North Holland is an imprint of Elsevier The Boulevard, Langford lane, Kidlington, Oxford, OX5 1GB, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands 225 Wyman Street, Waltham, MA 02451, USA First edition 2011 Copyright © 2011 Elsevier B.V. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone ( 44) (0) 1865 843830; fax ( 44) (0) 1865 853333; email:
[email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-444-52936-7 ISSN: 1874-5857
For information on all North Holland publications visit our web site at elsevierdirect.com
Printed and bound in Great Britain 11
11 10 9 8 7 6 5 4 3 2 1
INTRODUCTION
While the more narrow research program of inductive logic is an invention of the 20th century, philosophical reflection about induction as a mode of inference is as old as philosophical reflection about deductive inference. Aristotle was concerned with what he calls epagoge and he studied it, with the same systematic intent with which he approached the logic of syllogisms. However, it turned out that inductive inferences are much harder to evaluate, and it took another 2300 years to make substantial progress on these issues. Along the way, a number of philosophical and scientific turning points were achieved, and we can now look back on the excitingly rich history that this handbook covers in considerable detail. After Aristotle, our history took off in the 18th century with the ingenious insights and contributions of two philosophers: David Hume famously formulated the problem of induction with tremendous clarity. This problem (also called Hume’s Problem) kept philosophers busy ever since; many responses have been put forward and, in turn, criticized and variants of a major philosophical claim (“scepticism”) have been defended on its basis. At around the same time, Blaise Pascal and the philosophers of the School of Port Royal developed probability theory and laid the groundwork for decision theory. Both developments eventually lead to a much better understanding of inductive inferences, and it would be difficult to see how their impact on philosophy and science could be overestimated. The strong bond between developments in science and philosophy (as far as they can be separated) can also be observed in the later course of this history. Think, for example, of the work by Carnap, Hintikka, Ramsey and de Finetti and the contemporary endeavours in learning theory and Bayesian inference. The close interaction between science and philosophy is obvious here, which makes the field of inductive logic rather special. While there are many examples were a science split from philosophy and became autonomous (such as physics with Newton and biology with Darwin), and while there are, perhaps, topics hat ar of exclusively philosophical interest, inductive logic — as this handbook attests — is a research field where philosophers and scientists fruitfully and constructively interact. A final development should be noted: While much of deductive logic has been developed in an anti-psychologistic spirit (an exception is van Lambalgen and Stenning’s Human Reasoning and Cognitive Science, MIT Press 2008), inductive logic profits considerably from empirical studies. And so it is no wonder that contemporary cognitive psychologists pay much attention to inductive reasoning and set out to study it empirically. In the course of this work philosophical accounts (such as Bayesianism) can be critically evaluated, and alternatives might be inspired. Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
viii
It is to be hoped that philosophers and psychologists will interact on these issues more closely in the future, and that the new trend in experimental philosophy will prove a beneficial good. It was our intention to include a chapter on the Port Royal contributions to probability theory and decision theory. For reasons of space, we decided to avoid duplication with Russell Wahl’s excellent chapter, “Port Royal: The Stirrings of Modernity”, which appears in volume two of the Handbook, Mediaeval and Renaissance Logic. The Editors are deeply and most gratefully in the debt of the volume’s superb authors. For support and encouragement thanks are also due Nancy Gallini, Dean of Arts and Margaret Schabas, Head of Philosophy (and her successor Paul Bartha) at UBC, and Christopher Nicol, Dean of Arts and Science, and Michael Stingl, Chair of Philosophy, University of Lethbridge. Special thanks to Jane Spurr, Publications Administrator in London; Carol Woods, Production Associate in Vancouver, and our colleagues at Elsevier, Senior Acquisitions Editor, Lauren Schultz and Gavin Becker, Assistant Editor. Dov M. Gabbay King’s College London Stephan Hartmann Tilburg University John Woods University of British Columbia and King’s College London and University of Lethbridge
CONTRIBUTORS
Nick Chater University College London, UK.
[email protected] Frederick Eberhardt Washington University, USA.
[email protected] Malcolm Forster University of Wisconsin-Madison, USA.
[email protected] Maria Carla Galavotti University of Bologna, Italy.
[email protected] Clark Glymour Carnegie Mellon University, USA.
[email protected] Ulrike Hahn Cardiff University, UK. hahnu@cardiff.ac.uk Evan Heit University of California, Merced, USA.
[email protected] James Joyce University of Michigan, USA.
[email protected] Marc Lange University of North Carolina at Chapel Hill, USA.
[email protected] x
Hannes Leitgeb University of Bristol, UK.
[email protected] Ulrike von Luxburg University of Tuebingenm Germany.
[email protected] John Milton King’s College London, UK
[email protected] Alan Musgrave University of Otago, New Zealand.
[email protected] Ilkka Niiniluoto University of Helsinki, Finland. ilkka.niiniluoto@helsinki.fi Mike Oaksford Birkbeck College London, UK.
[email protected] Daniel Osherson Princeton University, USA.
[email protected] Ronald Ortner Montanuniversit¨ at Leoben, Austria.
[email protected] Stathis Psillos University of Athens, Greece.
[email protected] Jan-Willem Romeijn University of Groningen, The Netherlands.
[email protected] Bernhard Schoelkopf University of Tuebingen, Gemany.
[email protected] Contributors
Robert Schwartz University of Wisconsin - Milwaukee, USA.
[email protected] Jan Sprenger Tilburg University, The Netherlands.
[email protected] Scott Weinstein University of Pennsylvania, USA.
[email protected] Jonathan Weisberg University of Toronto, Canada.
[email protected] Sandy Zabell Northwestern University, USA.
[email protected] xi
INDUCTION BEFORE HUME J. R. Milton The word ‘Induction’ and its cognates in other languages, of which for present purposes the most important is Latin ‘inductio’, have a complex semantic history, as does the Greek ἐπαγωγή from which they were derived. Though some of these uses — electromagnetic induction, or the induction of a clergyman into a new benefice — are manifestly irrelevant, others that still diverge significantly from any of the uses current among present-day philosophers and logicians are not. As will soon become apparent, any attempt to write a history that focused solely on the direct ancestors of modern usage would be arduous if not impossible to execute, and deeply unsatisfactory if it could brought to a conclusion. The net must, at least initially, be cast more widely. Another potential problem is that there may have been philosophers who discussed problems of inductive inference without using the word ‘induction’ (or its equivalents) at all. The most conspicuous suspect here is David Hume, who has been widely seen — in the twentieth century at least1 — as an inductive sceptic, even though it is notorious that he rarely used the word, and never in the passages where his inductive scepticism has been located. Whether or not this interpretation of Hume is correct lies outside the scope of this chapter, but it is at least entirely clear that the issue cannot be decided simply from an analysis of Hume’s vocabulary. In the Hellenistic era discussions of non-deductive inference were centred on what became known as inference from signs (semeiosis). This was concerned with arguments from the apparent to the non-apparent — either the temporarily and provisionally non-apparent (for example something at a distance), or to the permanently and intrinsically non-apparent (for example invisible bodies such as atoms). How useful it is for modern historians to employ the terminology of induction when dealing with this material is disputed: some do so quite freely, e.g. [Asmis, 1984], while others reject it altogether [Barnes, 1988]. In the present study no attempt will be made to discuss this material in any detail; for some modern accounts see [Burnyeat, 1982; Sedley, 1982; Allen, 2001]. 1
THE ANCIENT WORLD
Human beings have been making generalisations since time immemorial, and certainly long before any logicians arrived on the scene to analyse what they were doing. Techniques could sometimes go well beyond induction by simple enumeration, as the following remarkable passage from the Old Testament shows: 1 [Stove,
1973; Winkler, 1999; Howson, 2000; Okasha, 2001].
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
2
J. R. Milton
And Gideon said unto God, If thou wilt save Israel by mine hand, as thou hast said, Behold I will put a fleece of wool in the floor; and if the dew be on the fleece only, and it be dry upon all the earth beside, then shall I know that thou wilt save Israel by mine hand, as thou hast said. And it was so: for he rose up early on the morrow, and thrust the fleece together, and wringed the dew out of the fleece, a bowl full of water. And Gideon said unto God, Let not thine anger be hot against me, and I will speak but this once: let me prove, I pray thee, but this once with the fleece; let it now be dry only upon the fleece, and upon all the ground let there be dew. And God did so that night: for it was dry upon the fleece only, and there was dew on all the ground. (Judges, vi. 36–40). Neither the writer of this passage nor his readers had ever read Mill, or heard of the Method of Agreement or Method of Difference, but few could have found Gideon’s procedures difficult to comprehend. As Locke was to comment sardonically, ‘God has not been so sparing to Men to make them barely two-legged Creatures, and left it to Aristotle to make them Rational’ (Essay, IV. xvii. 4; [Locke, 1975, p. 671]). It was, nevertheless, Aristotle who was the first philosopher to give inductive reasoning a name and to provide an account, albeit a brief and imperfect one, of what it was and how it worked. The name chosen was ἐπαγωγή (epagoge), derived from the verb ἐπάγειν, variously translated, according to context, as to bring or lead in, or on. Like ‘induction’ in modern English, epagoge had (and continued to have) a variety of other, irrelevant meanings: Plato had used it for an incantation (Republic 364c), and Aristotle himself employed it for the ingestion of food (De Respiratione 483a9).
1.1 Socrates and Plato Although none of Aristotle’s predecessors had anticipated him in using the term epagoge for inductive arguments, he had himself picked out Socrates for his use of what Aristotle called ἐπακτικοὺς λόγους (Metaphysics 1078b28). Though Aristotle would have had testimony about Socrates’ activities that has since been lost, there can be little doubt that his main source of information was Plato. In the early dialogues, Socrates was often portrayed as using modes of argument that Aristotle would certainly have classed as epagoge, for example in Protagoras 332C, where Socrates is reporting his interrogation of Protagoras: Once more, I said, is there anything beautiful? Yes. To which the only opposite is the ugly? There is no other. And is there anything good? There is.
Induction before Hume
3
To which the only opposite is the evil? There is no other. And there is the acute in sound? True. To which the only opposite is the grave? There is no other, he said, but that. Then every opposite has one opposite only and no more? He [Protagoras] assented. [Plato, 1953, vol. I, p. 158] Here and elsewhere (e.g. Charmides 159–160; Ion 540) the conclusion is a philosophical one that could have been grasped directly by someone intelligent and clear-sighted enough. Plato was concerned with truths such as these, not with empirical generalisations involving white swans or other sensory particulars [Robinson, 1953, pp. 33–48; McPherran, 2007].
1.2
Aristotle
Aristotle’s theory of induction — or to put it more neutrally, of epagoge, since there is disagreement even about the most appropriate translation of that term — has long been a matter of controversy. It is widely regarded as incomplete and in various respects imperfect: one modern commentator has referred to ‘the common belief [that] Aristotle’s concept of induction is incomplete, ill-conceived, unsystematic and generally unsatisfactory’, at least in comparison with his theory of deduction [Upton, 1981, p. 172]. Though not everyone might agree with this, it is clear that there is no consensus either about what exactly Aristotle was trying to do, or about how successful he was.2 When Aristotle used the word epagoge to characterise his own arguments, his employment of the term is thoroughly Socratic, or at least Platonic; the arguments were seldom empirical generalisations, or anything like them. The following passage from Metaphysics I is in this respect entirely typical: That contrariety is the greatest difference is made clear by induction [ἐκ τῆς ἐπαγωγῆς]. For things which differ in genus have no way to one another, but are too far distant and are not comparable; and for things that differ in species the extremes from which generation takes place are the contraries, and the distance between extremes — and therefore that between the contraries — is greatest. (1055a5–10). Similar remarks can be found elsewhere in the same book, e.g. in 1055b17 and 1058b9. Aristotle discussed epagoge in three passages, none of them very long. The earliest is in Topics A12, where dialectical arguments are divided into two kinds, 2 A selection of diverse views can be found in [Kosman, 1973; Hamlyn, 1976; Engberg-Pedersen, 1979; Upton, 1981; Caujolle-Zaslavsky, 1990; McKirahan, 1992, pp. 250–7; De Rijk, 2002, pp. 140–8].
4
J. R. Milton
syllogismos and epagoge. The meaning of the former term is certainly broader than ‘syllogism’ as now generally understood, and as the word is used in Aristotle’s later writings; it can probably best be translated as ‘deduction’. Epagoge is characterised quite briefly: Induction is the progress from particulars to universals; for example, ‘If the skilled pilot is the best pilot, and the skilled charioteer the best charioteer, then in general the skilled man is the best man in any particular sphere.’ Induction is more convincing and more clear and more easily grasped by sense perception and is shared by the majority of people, but reasoning [syllogismos] is more cogent and more efficacious against argumentative opponents (105a12–19). The first part of this subsequently became the standard definition of induction in the Middle Ages and Renaissance. It is natural for a modern reader to interpret it as meaning that induction is the mode of inference that proceeds from particular to universal propositions, but the Greek does not quite say this. Induction is merely the passage (ἔφοδος) from individuals to universals, τὰ καθόλου, and in other places (notably Posterior Analytics B19) these universals would seem to be, or at least to include, universal concepts. It should also not be automatically assumed that ‘ἔφοδος’ means inference in any technical sense [De Rijk, 2002, pp. 141–4]. Aristotle’s longest account of epagoge is in Prior Analytics B23: Now induction, or rather the syllogism which springs out of induction [ὁ ἐξ ἐπαγωγῆς ουλλογισμὸς], consists in establishing syllogistically a relation between one extreme and the middle by means of the other extreme, e.g. if B is the middle term between A and C, it consists in proving through C that A belongs to B. For this is the manner in which we make inductions. For example let A stand for long-lived, B for bileless, and C for the particular long-lived animals, e.g. man, horse, mule. A then belongs to the whole of C: for whatever is bileless is long-lived.3 But B also (‘not possessing bile’) belongs to all C. If then C is convertible with B, and the middle term is not wider in extension, it is necessary that A should belong to B. For it has already been proved that if two things belong to the same thing, and the extreme is convertible with one of them, then the other predicate will belong to the predicate that is converted. But we must apprehend C as made up of all the particulars. For induction proceeds through an enumeration of all the cases. (68b15–29). This is not an easy passage to understand, and has been the subject of much discussion. Aristotle appears to be applying his method of conversion, devised as 3 The phrase given here in italics makes no sense here; it may be an interpolation and if so should be excised [Aristotle, 1973, p. 514], even though there is no manuscript support for doing this [Ross, 1949, p. 486].
Induction before Hume
5
part of his account of syllogisms, to a case where it is not obviously applicable: hence the mention of middle terms. The crucial step in the argument is that B belongs to all C, i.e. that every long-lived animal is bileless. This could mean that every individual long-lived animal is bileless, or it could mean that every species of such animals is bileless. The latter seems to be indicated by the examples given — man, horse, mule, rather than (say) Socrates, Bucephelas etc. If so, then Aristotle appears to have been giving an example of what has subsequently came to be termed perfect (i.e. complete) induction: an inference from a finite sample that is sufficiently small for all the particular cases to be examined. This might seem to be what is indicated by the final remark, that ‘induction proceeds through an enumeration of all the cases’, but here (as often) the Oxford translation supplies words not present in the Greek, which merely says ‘for induction [is] through all’, ἡ γὰρ ἐπαγωγὴ διὰ πάντων. It is perhaps significant here that the proposition being proved — that all bileless animals are long-lived — is a generalisation about the natural world, and therefore very unlike the propositions argued for by Socrates in the early Platonic dialogues. It is manifestly not something that could in principle be grasped immediately by intuition. The same is true of another proposition described as having been derived by induction: in Posterior Analytics A13 (78a30–b4) Aristotle gave a celebrated example of a scientific demonstration:
Therefore
(1) (2) (3)
The planets do not twinkle. Whatever does not twinkle is near. The planets are near.
This counts as a demonstration, as distinct from a merely valid syllogism, because it states the cause: it is because the planets are near (i.e. nearer than the fixed stars) that they do not twinkle. Premise (2) is described as having been reached ‘by induction or through sense-perception’ (78a34–5), though the same must in fact be true also of premise (1). For (1) the argument is straightforward and unproblematic — Mercury does not twinkle, Venus does not twinkle, etc. — but for (2) it is not. There is clearly no difficulty in assembling a long list of particular non-twinkling objects that are also nearby, but how could the general proposition that all such objects are nearby be established? If it is supposed to be the conclusion of an inductive argument, then the enumeration is manifestly incomplete, and the inference correspondingly fallible. The demonstrations analysed in the Posterior Analytics are syllogistic arguments (here ‘syllogism’ is being used in the strict sense) which proceed from premises that are ‘true, primary, immediate, better known than, prior to, and causative of their conclusion’ (71b20–2). All these premises are universal in form, and this raises an obvious question: if the primary premises from which demonstrations proceed cannot themselves be demonstrated, how are they to be known? It was an issue that Aristotle deferred until the final chapter of the second book. The problem is stated quite clearly at the beginning of the chapter, but the discussion that follows at first sight seems rather puzzling: rather than discussing
6
J. R. Milton
inductive arguments, Aristotle appears to be trying to account for the acquisition of universal concepts — from the perception of several individual men to the species man, and then to the genus animal (100a3–b3). He then commented (this is the only place in which the word epagoge occurs in the whole chapter): ‘Thus it is clear that it is necessary for us to come to know the first principles by induction, because this is also the way in which universals are put into us by sense perception’ (100b3–5). The whole passage is undeniably difficult, and has been diversely interpreted, as the two main English commentaries on the Posterior Analytics show. Sir David Ross took it that Aristotle was concerned with both concept formation and induction, and treated them together because ‘the formation of general concepts and the grasping of universal propositions are inseparably interwoven’ [Ross, 1949, p. 675]. Jonathan Barnes, on the other hand, held that ‘Here “induction” is used in a weak sense, to refer to any cognitive progress from the less to the more general . . . Thus construed, 100b3–5 says no more than that concept acquisition proceeds from the less to the more general.’ [Barnes, 1975, p. 256]. On Barnes’s reading, the passage is not concerned with the inference from singular to universal propositions at all. This is not a dispute that can easily be resolved: the relevant texts are quite short, and all the participants in the debate are thoroughly familiar with them. My own inclination is to side with Ross. Aristotle’s position here is very different from that found in a later empiricist like Locke. Locke had an account of how humans — unlike the other animals that he called ‘brutes’ — had a capacity to frame abstract general ideas from the ideas of particular things given in perception [Locke, 1975, pp. 159–60], but this process had nothing to do with an inductive ascent from particular to universal propositions, about which Locke said virtually nothing. For Aristotle what comes to rest in the soul (more specifically, in the intellect) is not a mere Lockean abstract general idea, a particular entity that has the capacity to function as a universal sign, but rather a real universal thing, a form freed from matter and thereby de-individuated. This is why the same psychological process can be used to explain both the acquisition of universal concepts and the knowledge of first principles. In the Posterior Analytics the account of this is little more than a sketch, but it was subsequently fully worked out by Aristotle’s followers in late antiquity and in the Middle Ages. There is no hint whatever in Aristotle that epagoge is merely one of several ways by which we can gain knowledge of first principles. The view found in many modern empiricists that while some universal truths are known — or at least receive some degree of evidential support — a posteriori, by induction, others (for example Euclid’s axiom that all right angles are equal) are known a priori, is entirely foreign to his way of thinking. For Aristotle it is impossible to view (θεωρῆσαι) universals except through induction (Posterior Analytics 81b2). In all the passages mentioned so far, epagoge is treated as a process leading to universals, whether concepts, or propositions, or both. This is explicit in the definition in the Topics, but it can also be seen in the Prior and the Posterior Analytics. Often, however, and especially in the practical affairs of life, we are
Induction before Hume
7
concerned with reasoning from particulars to other particulars — whether the sun will rise tomorrow, whether this loaf of bread will nourish me, and so on. Aristotle was, of course, well aware that we do this, and classified such inferences as ‘examples’ (paradeigmata). What is less clear is whether paradeigma is a type of induction, or whether it is a different kind of argument, resembling induction in various ways, but not a sub-variety of it. In Prior Analytics B24, the chapter immediately after the chapter on induction, there is an account of paradeigmata. To give one specimen of such an argument, Athens against Thebes and Thebes against Phocis are both cases of wars against neighbours; the war against Phocis was bad for Thebes, so a war against Thebes would be bad for Athens (68b41–69a13). The inference might appear to proceed via a more general principal that war against neighbours is always bad (69a4, 6), which would make it an application of induction: a two-part argument involving an inductive ascent to a generalisation followed by a deductive descent to a particular case. Aristotle, however, insisted that the two kinds of inference were distinct: example is not reasoning from part to whole or from whole to part, but from part to part (69a14–15). Induction proceeds by an examination of all the individual cases (ἐξ ἁπάντων τῶν ἀτόμων), while example does not (69a16–19). In Aristotle’s Rhetoric, however, induction and example seem much closer, if not identical: just as in dialectic there is induction on the one hand and syllogism or apparent syllogism on the other, so it is in rhetoric. The example is an induction, the enthymeme4 is a syllogism, and the apparent enthymeme is an apparent syllogism. I call the enthymeme a rhetorical syllogism and the example a rhetorical induction. Every one who effects persuasion through proof does in fact use either enthymemes or examples: there is no other way. And since every one who proves anything at all is bound to use either syllogisms or inductions (and this is clear to us from the Analytics), it must follow that enthymemes are syllogisms and examples are inductions (1356b1–10). The exhaustive division of all arguments into either syllogismos or epagoge is not peculiar to the Rhetoric: it can be found in both parts of the Analytics (68b13–14, 71a5–6), as can the identification of enthymeme and example as their rhetorical counterparts (71a9–11). One very plausible way of interpreting this is that enthymeme and example are not sub-varieties of syllogismos and epagoge, still less entirely different types of argument, but rather instances of syllogismos and epagoge ‘when these occur in a rhetorical speech rather than in a dialectical argument’ [Burnyeat, 1994, p. 16]. If this is done, however, the notion of epagoge must be broadened to include most if not all non-deductive argument, since one thing that is absolutely certain about paradeigma is that it concerns arguments from particulars to particulars. 4 Aristotle’s account of enthymeme is complex and has often been misunderstood, but lies outside the scope of this chapter; for a penetrating modern analysis, see [Burnyeat, 1994].
8
J. R. Milton
None of Aristotle’s surviving works contains a detailed and systematic account of induction, and there is no evidence that one was ever produced. Why this should have been the case is not obvious, given the potential importance of such reasoning in his theory of knowledge, but one explanation may be that the separation of form and content, which had been central to his analysis of the syllogism, was (and still remains) more difficult to achieve in the case of induction. At all events, Aristotle did not bequeath to his successors an account of induction that was in any way comparable to his treatment of the syllogism.
1.3 Hellenistic and later Greek accounts In the three centuries that followed Aristotle’s death, his technical writings were not much studied outside the (declining) Peripatetic school, and the terms that he had devised were replaced by others. The problems involved in inference from particular to universal propositions were raised occasionally, but they seem not to have become the central issue of discussion, unlike the problems of inference from signs. Alcinous The lack of any serious interest in induction among the Platonists is indicated by the extremely brief treatment in one of the few philosophical textbooks to survive, the Handbook of Platonism (Didaskalikos) attributed to a certain Alcinous, often identified with the Middle Platonist Albinus (2nd century ad): Induction is any logical procedure which passes from like to like, or from the particular to the general. Induction is particularly useful for activating the natural concepts (Didaskalikos, 6.7; [Dillon, 1993, p. 10]). The last remark may allude to the well-known passage in the Meno where the slave boy is being led to reveal his innate knowledge of geometry [Dillon, 1993, p. 77]. One finds here a characteristic blend of Platonism and Aristotelianism: the role of induction is to provide particular examples that can bring to full consciousness the concepts implanted in us by nature. Diogenes Laertius Two other Greek writers from the Roman period had rather more to say about induction: the biographer Diogenes Laertius (early 3rd century?), and the Pyrrhonian sceptic, Sextus Empiricus (late 2nd or early 3rd century?). Neither was an original thinker, and indeed Diogenes was barely a thinker at all, but rather a scissors-and-paste compiler whose labours would have been ignored by posterity had they not resulted in the only extensive compendium of philosophical biographies to have survived from antiquity.
Induction before Hume
9
Diogenes’ remarks on induction are in his life of Plato (III. 53–55). Epagoge is defined as an argument in which we infer from some true premises a conclusion resembling them. There are two varieties: from opposites (κατ᾿ ἐναντίωσιν), and from implication (ἐκ τῆς ἀκολουθίας). The former is a mode of argument that bears little resemblance to any modern notion of induction: If man is not an animal he will be either a stick or a stone. But he is not a stick or a stone, for he is animate and self-moved. Therefore he is an animal. But if he is an animal, and if a dog or an ox is also an animal, then man by being an animal will be a dog and an ox as well. The first part of this is clear enough — it seems that either Diogenes or his source was using an ancient version of the question ‘Animal, Vegetable or Mineral?’ — but the last part is considerably more obscure. The second kind of induction is much more familiar. There are two sub-varieties: one, described as belonging to rhetoric, in which the argument is from particulars to other particulars, and the other, belonging to dialectic, in which it is from particulars to universals. The former is clearly the Aristotelian paradeigma, though that term was not used. An instance of the latter is the argument that the soul is immortal: And this is proved in the dialogue on the soul [presumably the Phaedo] by means of a certain general proposition, that opposites proceed from opposites. And the general proposition is established by means of some propositions which are particular, as that sleep comes from waking and vice-versa, and the greater from the less and vice-versa. These are not examples of empirical generalisations. Sextus Empiricus Among the immense range of sceptical arguments preserved and deployed by Sextus Empiricus, inductive scepticism is inconspicuous, though not wholly absent. In the Outlines of Pyrrhonism II. 204 inductive arguments were dismissed in a very cursory, almost contemptuous, manner: It is also easy, I consider, to set aside the method of induction [τὸν περὶ ἐπαγογῆς τρόπον]. For, when they propose to establish the universal from the particulars by means of induction, they will effect this by a review either of all or of some of the particular instances. But if they review some, the induction will be insecure, since some of the particulars omitted in the induction may contravene the universal; while if they are to review all, they will be toiling at the impossible, since the particulars are infinite and indefinite. Thus on both grounds, as I think, the consequence is that induction is invalidated.5 [Sextus, 1967, p. 283]. 5 Literally,
‘shaken’, or ‘made to totter’.
10
J. R. Milton
Another passage a few pages earlier (II. 195) supplies a little more detail: Well then, the premiss ‘Every man is an animal’ is established by induction from particular instances; for from the fact that Socrates, who is a man, is also an animal, and Plato likewise, and Dion and each one of the particular instances, they think it is possible to assert that every man is an animal. . . [Sextus, 1967, p. 277]. Sextus was not persuaded: if even a single counter-example can be found, the universal conclusion is not sound (ὑγιής, i.e. healthy), ‘thus, for example, when most animals move the lower jaw, and only the crocodile the upper, the premiss “Every animal moves the lower jaw” is not true.’ [Sextus, 1967, p. 277]. At first sight this differs from the familiar modern textbook example of ‘All swans are white’ being falsified by the observation of a single individual black swan, but in fact the differences are small. In the case of the swans, what makes the falsification effective is that it was a species of black swans that was discovered. Logically speaking, a single negative instance can falsify a universal proposition; in practice it usually would not, as a variety of what Imre Lakatos called ‘monster-barring’ stratagems would come into play. It is very unlikely that the generalisation about how animals move their jaws, with the crocodile as an exception, was original to Sextus: the same example can be found in Apuleius’ Peri Hermeneias [Apuleius, 1987, p. 95]. It had probably long been a stock example, repeated from author to author. Alexander of Aphrodisias The view that conclusions drawn from inductive arguments are not conclusively established was not peculiar to the sceptics — indeed it can be found among the Aristotelians themselves, notably the late second-century commentator Alexander of Aphrodisias. On the passage in Topics 105a10ff quoted above, Alexander observed: So induction has the quality of persuasiveness; but it does not have that of necessity. For the universal does not follow by necessity from the particulars once these have been conceded, because we cannot get something through induction by going over all the particular cases, since the particular cases are impossible to go through [Alexander, 2001, p. 93]. As this and other remarks to be quoted in what follows show quite clearly, it is utterly mistaken to suppose that Hume was the first person to notice that inductive arguments are not deductively valid, and that any universal generalisation which covers a field that is either infinite or too large to survey completely is vulnerable to counter-examples. To suppose this would be unfair both to Hume, who was certainly doing something more radical and much less banal, and to his predecessors, who had taken the fallibility of such inferences for granted.
Induction before Hume
1.4
11
Roman philosophy
Cicero and the rhetorical tradition The Romans, unlike their medieval successors, had little interest in logic as a technical discipline,6 but rhetoric was a central — perhaps the central — element of their educational curriculum. When philosophy began to be written in Latin, a new technical vocabulary needed to be devised. Who introduced the term ‘inductio’ for epagoge is not now known, but in the surviving corpus of Latin literature the word first appears with this sense in a youthful work by Cicero, De Inventione. Here it is described as a form of argument in which the speaker first gets his opponent to agree on some undisputed propositions, and then leads him to assent to others resembling them. In the example Cicero gave, Pericles’ sharp-witted mistress Aspasia is interrogating the wife of a certain Xenophon (not the historian): ‘Please tell me, if your neighbour had a better gold ornament than you have, would you prefer that one, or your own?’ ‘That one’, she said. ‘And if she had clothes or other finery more expensive than you have, would you prefer yours or hers?’ ‘Hers, of course’, she replied. ‘Well then, if she had a better husband than you have, would you prefer yours or hers?’ At this, the woman blushed. (I. 55). Clearly this is not a specimen of inductive generalisation, but rather of what Aristotle called paradeigma. Cicero had little interest in the kinds of generalisation that might be made by a natural philosopher: his concern, here as elsewhere, was with the strategies that can be used in public speaking or in a court of law. In a later rhetorical treatise, the Topics, induction is mentioned very briefly as merely one variety of a more extensive class of arguments from similarity. The example Cicero gave — that if honesty is require of a guardian, a partner, a bailee and a trustee, it is required of an agent (Topics, 42) — is described as an epagoge (the Greek term was used), but it is clearly a case of what Aristotle had called paradeigma. In the rhetorical tradition, it was the analysis and employment of arguments of this type that attracted most interest. Cicero’s account of induction was followed by the writers of rhetorical treatises and textbooks, notably Quintilian’s Institutio Oratoria, V. x. 73, xi. 2 [Quintilian, 1921, vol. II, pp. 241, 273], though the treatment is fairly cursory: induction was merely one rather unimportant variety of reasoning, less deserving of extended analysis than either arguments from signs or examples. This subsumption of induction into the theory of rhetoric had the unwelcome result (for analytically minded historians of philosophy) that what they have thought of as the Problem of Induction — the enquiry into how (if at all) universal propositions can be proved, or
6 Though
the aversion was by no means universal: see [Barnes, 1997, ch. 1].
12
J. R. Milton
at least made probable,7 from evidence of particular cases — was never properly raised, let alone answered. Boethius It was only in the final twilight of the ancient world, after the fall of the Empire in the west, that Aristotle’s writings started to be translated into Latin. Boethius had planned to translate and comment on the entire corpus, but by the time of his premature death only a small part of this exceedingly ambitious project had been completed. The only translations that have survived were of the Categories and De Interpretatione, but Boethius’ own logical writings gave his early medieval successors some information about the content of Aristotle’s other works on logic. Induction was dealt with fairly briefly in De Topicis Differentiis [Stump, 1978, pp. 44–46], being described in Aristotelian rather than Ciceronian terms as a progression from particulars to universals. This is taken directly from Aristotle’s account in the Topics, as was the example given to illustrate it: just as a pilot should be chosen on the basis of possessing the appropriate skill rather than by lot, and similarly with a charioteer, so generally if one wants something governed properly one should choose someone on the basis of their skill. The main historical importance of Boethius’ account is not that it added anything to earlier analyses — it did not — but that it provided his early medieval readers with information then unavailable from any other source. Summary It is striking that no sustained discussion of inductive reasoning has survived from the ancient world. Of course the vast majority of Greek and Roman philosophical works have perished, and are accessible only from fragments quoted by other writers, or often not at all. If more had been preserved, then the patchy and episodic account given above could unquestionably have been made considerably longer and more detailed. There is nevertheless no sign that a major and systematic account of inductive reasoning has been lost: among the many lists of works given by Diogenes Laertius there is no sign of any treatise with the title Peri Epagoges or something similar. It would appear, therefore, that induction was not something that any of the ancients regarded as one of the central problems of philosophy. Several reasons for this state of affairs can be discerned. One is that the general drift of philosophy, especially in late antiquity, was also away from the kind of systematic empirical enquiry practised by Aristotle and his immediate successors. Plotinus, for example, used the word epagoge only twice, once for an argument to show that there is nothing contrary to substance, and once for an argument that whatever is destroyed is composite (Enneads, I. 8. 6; 7 On the meaning of probabilis and related terms in Cicero and other ancient authors, see [Glucker, 1995]. On subsequent history, see [Hacking, 1975; Cohen, 1980; Franklin, 2001].
Induction before Hume
13
II. 4. 6). The kind of understanding gained through empirical generalisation was too meagre and unimportant for the modes of argument leading to it to merit sustained analysis. Another reason is that interest in the systematic investigation of the natural world was intermittent and localised. It did not help that the scientific discipline in which the greatest advances were made had been mathematical astronomy, and this was not a field where the problems posed by inductive reasoning would have surfaced, still less become pressing. Constructing a model for the motions of a planet was a highly complex business, but it did not involve generalisation from data in the form ‘this A is B’ and ‘this A is B’ to ‘every A is B’. Ptolemy, indeed, seems to have felt so little urge to generalise that his models for the individual planets are all given separately, and (in the Almagest at least) not integrated into a single coherent system Finally, the centrality of rhetoric in ancient education meant that when inductive arguments were discussed, they tended to be evaluated for their persuasiveness, not for their logical merits. Inductive arguments became almost lost in a mass of miscellaneous un-formalised arguments that were not investigated for their validity, or any inductive analogue thereof, but for their plausibility in the context of a speech.
14
J. R. Milton
2 THE MIDDLE AGES
2.1 Arabic accounts Two civilisations inherited the legacy of ancient philosophy. Starting in the late eighth century, a large part of the philosophical literature that had been fashionable in late antiquity was translated into Arabic, including most of the corpus of Aristotle’s writings and many of the works of his commentators. The accounts of induction in the Prior Analytics, the Posterior Analytics and the Topics became the starting point of subsequent treatments. Very little of the Greek technical terminology was directly transliterated, a notable exception being the word for philosophy itself (falsafah). Epagoge was translated as istiqrˆ a, a word whose root meaning was investigation or examination.8 Al-Fˆ arˆ abˆı The first Arabic writer to give a systematic account of induction was al-Fˆ arˆabˆı (c.870–c.950) [Lameer, 1994, pp. 143–154, 169–175]. His conception of induction differed in one important respect from Aristotle’s. According to Joep Lameer: For Aristotle, induction is the advance from a number of related particular cases to the corresponding universal. In opposition to this, al-Fˆ arˆ abˆı explains induction in terms of an examination of the particulars. This view must be taken to be a natural consequence of the fact that in the Arabic Prior Analytics, epagˆ ogˆe was rendered as istiqrˆ a (‘collection’ in the sense of a scrutiny of the particulars). [Lameer, 1994, p. 173, cf. p. 144]. This conception of induction as proceeding by a one-by-one examination of the particulars had the consequence that inductions have full probative force only when they are complete [Lameer, 1994, pp. 144, 147]. Al-Fˆarˆ abˆı also made a distinction between induction and what he called methodic experience (tajriba, equivalent to Greek empeiria): methodic experience means that we examine the particular instances of universal premises to determine whether a given universal is predicable of each one of the particular instances, and we follow this up with all or most of them until we obtain necessary certainty, in which case that predication applies to the whole of that species. Methodic experience resembles induction, except the difference between methodic experience and induction is that induction does not produce necessary certainty by means of universal predication, whereas methodic experience does. [McGinnis and Reisman, 2007, p. 67]. 8 For insight into the meaning of Arabic terminology I am grateful to my colleague Peter Adamson.
Induction before Hume
15
Induction is inferior to methodic experience because it does not uncover necessary truths or lead to certain knowledge. Avicenna The same distinction between induction and methodic experience appears in Avicenna (Ibn Sina, 980–1037) [McGinnis, 2003], [McGinnis, 2008]. In his main philosophical work, The Cure (Book of Demonstration, I. 9. §§ 12, 21), induction was described as inferior to methodic experience, in that unless it proceeds from an examination of all the relevant cases, it leads only to probable belief; [McGinnis and Reisman, 2007, pp. 149, 152]). Methodic experience is not like induction . . . methodic experience is like our judging that the scammony plant [Convolvulus scammonia] is a purgative for bile; for since this is repeated many times, it stops being a case of something that occurs by chance, and the mind then judges and grants that it is characteristic of scammony to purge bile. Purging bile is a concomitant accident of scammony. [McGinnis and Reisman, 2007, p. 149]. The Aristotelian background is apparent here: events due merely to chance do not recur regularly, and a regular succession is therefore a sign that something is occurring naturally: Now one might ask: ‘This is not something whose cause is known, so how are we certain that scammony cannot be sound of nature, and yet not purge bile?’ I say: Since it is verified that purging bile so happens to belong to scammony, and that becomes evident by way of much repetition, one knows that it is not by chance, for chance is not always or for the most part. Then one knows that this is something scammony necessarily brings about by nature, since there is no way it can be an act of choice on the part of scammony. [McGinnis and Reisman, 2007, p. 149] To use the language of more recent philosophers, we know a priori that the physical world is full of natural law-like regularities, and we merely need enough experience to show that the apparent regularity we are considering is one of these, and not something due purely to chance. Even if this is granted, however, methodic experience does not produce certainty: there is always the risk of coming up with a generalisation that is too wide: We also do not preclude that in some country, some temperament and special property is connected with or absent from the scammony such that it does not purge. Nevertheless, the judgment based on methodic experience that we possess must be that the scammony commonplace among and perceived by us purges bile, whether owing to its essence
16
J. R. Milton
or a nature in it, unless opposed by some obstacle. [McGinnis and Reisman, 2007, p. 151] Another problem is effectively identical to the white-swan problem of modern textbooks: Were we to imagine that there were no people but Sudanese, and that only black people were repeatedly perceived, then would that not necessarily produce a conviction that all people are black? On the one hand, if it does not, then why does one repetition produce such a belief, and another repetition does not? On the other hand, if the one instance of methodic experience does produce the belief that there are only black people, it has in fact produced an error and falsehood. [McGinnis and Reisman, 2007, p. 150] It was a very pertinent question, and Avicenna’s response was rather opaque: you can easily resolve the puzzle concerning the Sudanese and their procreation of black children. In summary form, when procreation is taken to be procreation by black people, or people of one such country, then methodic experience will be valid. If procreation is taken to be that of any given people, then methodic experience will not end with the aforementioned particular instances; for that methodic experience concerned a black people, but people absolutely speaking are not limited to black people. [McGinnis and Reisman, 2007, p. 150] Though Avicenna’s writings had an immense influence on the philosophers in the universities of medieval Europe, this particular work was never translated into Latin. The purgative powers of scammony, however, became a stock example in scholastic discussions, probably through its use in Avicenna’s medical writings, which had an immense influence on medical education in the Latin west [Weinberg, 1965, pp. 134–135].
2.2 The Latin West Boethius’ categorisation of induction as a progression from particulars to universals was only one definition current during the Middle Ages. Another was a more rhetorical definition, derived ultimately from Cicero and transmitted by authors such as Victorinus and Alcuin, that made no mention of universality. Alcuin defined induction as an argument that from certain things proves uncertain ones, and compels the assent of the unwilling [Halm, 1863, p. 540].9 The reception and translation of the main body of Aristotle’s writings into Latin during the course of the twelfth and thirteenth centuries, initially from Arabic, 9 Inductio est oratio quae per certas res quaedam incerta probat et nolentem ducit in assensionem, Disputatio de Rhetorica et De Virtutibus, 30. According to Victorinus, Inductio est oratio, quae rebus non dubiis captat adsensiones eius, quicum instituta est, Explanationum in Rhetoricam M. Tullii Ciceronis Libri Duo, I. 31, [Halm, 1863, p. 240].
Induction before Hume
17
but subsequently directly from Greek, focused the attention of philosophers in the universities on the logical rather than the rhetorical tradition. There are two main locations for discussions of induction in the works of the schoolmen. One was in commentaries and questions on the Prior and Posterior Analytics, the other in general treatises on logic and logic textbooks, though few of these dealt with it at length.10 Robert Grosseteste One of the most elaborate and most interesting commentaries on the Posterior Analytics was one of the first, written by Robert Grosseteste before 1230 [Hackett, 2004, p. 161]. Grosseteste’s account of induction was based closely on the final chapter of the Posterior Analytics, though the specific example he used came from Avicenna: For when the senses several times observe two singular occurrences, of which one is the cause of the other or is related to it in some other way, and they do not see the connections between them, as, for example, when someone frequently notices that the eating of scammony happens to be accompanied by the discharge of red bile and does not see that it is the scammony that attracts and withdraws the red bile, then from constant observation of these two observable things it begins to form [estimare] a third, unobservable thing, namely that scammony is the cause that withdraws the red bile [Grosseteste, 1981, pp. 214–215; Crombie, 1953, pp. 73–74] This looks much more like the advancing of a causal hypothesis than a specimen of inductive generalisation. The next part of Grosseteste’s account followed Aristotle closely: repeated perceptions are stored in the memory, and this in turn leads to reasoning: Reason begins to wonder and consider whether things really are as the sensible recollection says, and these two lead the reason to the experiment [ad experientiam], namely, that scammony should be administered after all other causes purging red bile have been isolated and excluded. But when he has administered scammony many times with the sure exclusion of all other things that purge red bile, then there is formed in the reason this universal, namely that all scammony of its nature withdraws red bile; and this is the way in which it comes from sensation to a universal experimental principle. [Grosseteste, 1981, p. 215; Crombie, 1953, p. 74] If the conclusion had merely been that all scammony draws out red bile, then the argument would be a clear and unproblematic case of inductive generalisation. In 10 Little has been written specifically on medieval accounts of induction, but for two short general surveys, see [Weinberg, 1965; Bos, 1993].
18
J. R. Milton
fact the conclusion is stronger than this: that scammony of its nature [secundum se] draws out red bile. No doubt Grosseteste would have replied that the inference would only be safe if the power of drawing out bile really was part of the nature of scammony. The framework and most of the details of Grosseteste’s account are plainly Aristotelian, but there is one important difference. In the Posterior Analytics a plurality of memories constitute a single experience (empeiria), and this, unlike the memories from which it had arisen, is universal (100a5–6); there is no suggestion whatever that anything that we would now describe as an experiment needs to be undertaken. Grosseteste’s procedure was much more interventionist: scammony is to be administered in a variety of situations in which all the other substances that are known to purge bile have been excluded, and it is this systematic variation of the circumstances that provides the justification for the universal conclusion. William of Ockham The most detailed account of induction by any of the writers on logic was given by Ockham in his Summa Logicae, Part III, section iii, chapters 31–36. In the first of these, induction was defined in the manner of Aristotle and Boethius, as a progression from singulars to a universal [Ockham, 1974, p. 707]. In both the premises and the conclusion the predicate remains the same, and variation occurs merely in the subject: for example ‘This [man] runs, that [man] runs, and so on for other singulars [et sic de singulis], therefore every man runs’, or ‘Socrates runs, Plato runs, and so on for other singulars, therefore every man runs’ [Ockham, 1974, p. 708]. In all these examples Ockham was concerned with propositions ascribing a predicate to an individual (Socrates, this man, that white thing), and not a species. This is fully consonant with his thoroughgoing nominalism: only individuals exist, and universals are merely signs that represent them. In the chapters that follow Ockham gave a series of rules for sound and unsound inductive inferences. He began by considering non-modal propositions about present states of affairs (de praesenti et de inesse). There are three rules for these: 1. Every true universal proposition has some true singular. 2. If all the singulars of some universal proposition are true, then the universal is true. 3. If a negative universal proposition is false, then it follows that at least one of its singulars is false. [Ockham, 1974, pp. 708–709] The first of these points to a fundamental difference between medieval and modern post-Fregean logic, in which ‘Every A is B’ does not imply that ‘Some A is B’. The second rule might seem obvious, but as becomes apparent in the chapters that
Induction before Hume
19
follow, there are types of proposition for which Ockham thought that it did not apply. For some modal propositions — those in sensu divisionis 11 — the same rules apply: just as we can draw the conclusion that ‘Every man runs’ from ‘Socrates runs’, ‘Plato runs’, etc., so we can make the inference ‘Socrates is contingently an animal, Plato is contingently an animal, and so on for other singulars, therefore every man is contingently an animal’ [Ockham, 1974, p. 715]. In cases where the modality is in sensu compositionis, however, different rules apply: this rule is not generally true [vera] ‘all the singulars are necessary, therefore the universal is necessary’. Similarly . . . this rule is not general ‘the universal is necessary therefore the singulars are necessary’. [Ockham, 1974, p. 717] Another inference that is not valid (non valet) is ‘all the singulars are possible, therefore the universal is possible’: For it does not follow ‘this is possible: this contingent proposition is true; and this is possible: that contingent proposition is true, and so for the other singulars; therefore this is possible: every contingent proposition is true’. [Ockham, 1974, p. 718] It is clear that in these chapters Ockham was not concerned with the problems discussed in modern treatises on probability and induction. Although the subject matter was described as induction, the problems addressed are those of deductive logic, in particular the relations between universal propositions — or to be more accurate propositions involving universal quantification — and their associated singular propositions. When, for example, he wrote that ‘this rule is not valid, the singulars are contingent, therefore the universal is contingent’,12 it is quite clear that he meant all the singulars, and not merely some of them. The problems involved in generalisation from a finite sample were not even raised, let alone answered: here at least Ockham was not engaged in that kind of enquiry. Jean Buridan Ockham never wrote a commentary on either the Prior or the Posterior Analytics, but one fourteenth-century nominalist who did was Jean Buridan (c.1300–1358). Buridan took it for granted that inductive arguments are invalid if only some of the singulars are considered: ‘an induction is not a good consequence [bona consequentia] unless all the singulars are enumerated in it. But we cannot enumerate all 11 Modal propositions in sensu divisionis (or in sensu diviso) were those where the modal operator was applied to part of the proposition, not the whole; propositions in sensu compositionis (or in sensu composito) were those where the operator was applied to the whole proposition: see [Broadie, 1993, pp. 59–60]; on Ockham’s usage, see [Lagerlund, 2000, pp. 98–100]. 12 ista regula non valet, singulares sunt contingentes, igitur universalis est contingens’, ch. 36, [Ockham 1974, p. 720].
20
J. R. Milton
of them because they are infinitely many.’ [Biard, 2001, p. 92]. We do nevertheless draw general conclusion from finite samples: For when you have often seen rhubarb purge bile and have memories of this, and have never found a counterexample in the many different circumstances you have considered, then the intellect, not as a necessary consequence, but only from its natural inclination to the truth, assents to the universal principle and understands it as if it were an evident principle based on an induction such as ‘this rhubarb purged bile, and that [rhubarb]’, and so on for many others, which have been sensed and held in memory. Then the intellect supplies the little clause [clausulam] ‘and so on for the [other] singulars’, because it has never witnessed a counterexample . . . nor is there any reason or dissimilarity apparent why there should be a counterexample. [Biard, 2001, p. 93] Parts of this may remind a modern reader of Hume’s account of the operation of the mind, but there is one crucial difference: Buridan’s ‘inclinations’ are inclinations to the truth, not mere habits grounded on the association of ideas. In the background there is the unquestioned assumption — notoriously absent in Hume — that God has equipped us with faculties that, when not mis-used, will lead us to truth rather than error. 3 THE RENAISSANCE
3.1 The revival of rhetoric At the risk of some simplification, it seems fair to say that the Renaissance saw a rise in the status of rhetoric, and a fall in the status of logic, or at least formal logic, though the process was far from uniform or complete. Aristotle continued generally to be treated with respect, even by those who did not think of themselves as Aristotelians, but the refinements of later medieval logic, with its intricate subtleties and (to the humanists) barbarous grammar and terminology, were quite another matter. Hostility towards formal logic can be traced back at least as far as Petrarch, but the first sustained attack was at the hands of Lorenzo Valla (1407–1457). The earliest textbook in the new style was the De Inventione Dialectica of Rudolph Agricola (1444–1485), first published in 1515 and reprinted sufficiently often thereafter for it to have been described as ‘the first humanist work in logic to become a best seller’ [Monfasani, 1990, p. 181]. Similar criticisms of scholastic logic were made by Juan Luis Vives (1492–1540), who had studied logic in the University of Paris as an undergraduate, and had not enjoyed the experience [Broadie, 1993, pp. 192–206]; his In Pseudodialecticos was first published in 1520 [Vives, 1979]. Valla’s opposition to traditional logic was deeper than that of his successors, in that he disliked not merely late medieval subtleties, but formal logic as such [Mack, 1993, pp. 83–4]. The rules of sound reasoning, like the rules of good writing,
Induction before Hume
21
were to be drawn ad consuetudinem eruditorum atque elegantium [Valla, 1982, p. 217], that is, from the actual Latin usage of the best writers of the best period. Logic therefore became merely one part — and a relatively unimportant one at that — of rhetoric. Valla explicitly indicated his dislike of, and dissent from, the Boethian description of induction as a progression from particulars to universals [Valla, 1982, p. 346]: for him it was the rhetorical argument from particulars to particulars that mattered. Agricola preferred the term ‘enumeratio’ to ‘inductio’, even though both had been used by Cicero: ‘to me it seems that induction should be more rightly called enumeration, since Cicero called it an argument from the enumeration of all the parts’ [Agricola, 1992, p. 316]. Some of the examples given are inductions of the traditional kind, but some certainly are not, for example: ‘the wall is mine, the foundation is mine, the roof is mine, the rest of the parts are mine. Therefore the house is mine.’ [Agricola, 1992, p. 316]. This kind of argument seems to have become a recognised type of induction in the rhetorical tradition: in the early eighteenth century Vico’s Institutiones Oratoriae drew a distinction between two kinds of induction, inductio partium and inductio similium. The former in turn had two sub-varieties, one involving an enumeration of all the species that made up a genus, the other an enumeration of all the parts that make up a totality, such as the limbs and organs of the human body [Vico, 1996, p. 90].
3.2
Zabarella
One of the most interesting sixteenth-century accounts of induction was by one of the professors at Padua, at that time the leading university in Italy, and arguably in Europe, and one where the study of logic continued to flourish [Grendler, 2002, pp. 250–253, 257–266]. Jacopo Zabarella (1533–1589) has been described by Charles Schmitt as ‘in the methodological matters . . . without a doubt the most acute and most influential of the Italian Renaissance Aristotelians’ [Schmitt, 1969, p. 82]. In chapter 4 of his short treatise De Regressu, he distinguished two kinds of induction: dialectical and demonstrative. Dialectical induction is used when the subject matter is mutable and contingent (in materia mutabili et contingente) and has no strength (nil roboris habet) unless all the particulars are considered without exception [Zabarella, 1608, col. 485d]. Demonstrative induction, by contrast, can be employed in necessary [subject] matter, and in things which have an essential connection among themselves, and for that reason in it [demonstrative induction] not all the particulars are considered, for our mind having inspected certain of these at once grasps the essential connection [statim essentialem connexum animadvertit], and leaving aside the remainder of the singulars, at one infers [colligit] the universal: for it knows it to be necessary that things are thus with the remainder [Zabarella, 1608, col. 485d–e].
22
J. R. Milton
A similar account can be found in the longer treatise De Methodis, III. 14] Zabarella, 1608, col. 255f]. For Aristotelians like Zabarella, demonstrative induction was needed because it alone among the varieties of induction could lead to certain knowledge of the universal propositions that serve as the premises of demonstrative syllogisms. Such truths can become known to us not by a complete survey of all the particulars, which is impossible, but by enough of them being inspected for the appropriate universal to be formed in the soul. This is not merely a universal concept, but a real universal, a form abstracted from matter and thereby de-individuated. The situation may be represented by a diagram: (a) singular propositions
(b) Universal propositions
(c) Real individuals
(d) Real universals
Logically speaking, induction is an inference from (a) to (b) — this much was agreed by everyone working in the Aristotelian (as distinct from the rhetorical) tradition. For the medieval and post-medieval realists, including Zabarella, this inference from (a) to (b) was mirrored by the relation between the real individuals (c) and the real universals (d): what made a universal proposition true was what later philosophers might have called a universal fact. The existence of these facts explained why a universal proposition could be known to be true even though not all the relevant particulars had been surveyed — indeed sometimes when only a few (aliqua pauca) of them had been [Zabarella, 1608, col. 255f]. Once the intellect had grasped the universal, further investigation of the particulars was no longer required. In demonstrative induction this kind of grasp could be achieved, and certainty was therefore attainable. It is clear that this account of induction presupposed a realist account of universals, of a kind apparently held (though in a form that still remains a subject of dispute) by Aristotle, and certainly developed in a variety of different and incompatible forms by his successors in the Middle Ages and later [Milton, 1987]. It was not available to nominalists such as Ockham for whom the entities in class (d), the supposed real universals, were wholly non-existent. Despite the brilliance of several of its advocates, nominalism always remained a minority option among the university-based Aristotelians. In the seventeenth century it was to become much more popular.
Induction before Hume
4
23
THE SEVENTEENTH CENTURY AND EARLY EIGHTEENTH CENTURY
Many of the most original and creative philosophers of the seventeenth and early eighteenth century had little or nothing to say about induction. The word does not appear in either Spinoza’s Ethics or Locke’s Essay, and only once in passing in Berkeley’s Principles of Human Knowledge, § 50. That Spinoza had nothing to say is perhaps not very surprising,13 but the reason for Locke’s virtual silence — the term was used once in The Conduct of the Understanding 14 — is less immediately obvious. Part of the explanation may be that he had no confidence that natural philosophy would ever become a science, and that his own experience had mainly been as a physician, reasoning about particular cases and using general rules only as fallible guides to practice.
4.1
Bacon
Though Francis Bacon was the first thinker to invert the traditional priority and give induction precedence over deduction, it is potentially misleading to describe him as the founder of inductive logic. Bacon was not a logician either by temperament or doctrine, and it would be unhelpful to see him as a remote precursor of Carnap. His treatment of induction should be seen in the context of a massive but incomplete programme for the discovery of a new kind of scientific knowledge [Malherbe, 1996; Gaukroger, 2001, pp. 132–159]. Like Descartes a generation later, Bacon while still quite young became profoundly dissatisfied with all the many and various kinds of natural philosophy currently taught in the universities, but while Descartes was repelled by the uncertainty of this so-called knowledge, Bacon despised it for its uselessness — its utter failure to provide any grounding for practically effective techniques of controlling nature. Bacon’s disdain for traditional philosophy was made plain in 1605, in his first major publication, the Advancement of Learning. His low opinion of the logic taught by the schoolmen extended to their treatment of induction: Secondly, the Induction which the Logitians speake of, and which seemeth familiar with Plato, whereby the Principles of Sciences may be pretended to be invented, and so the middle propositions by derivation from the Principles; their fourme of Induction, I say is utterly vitious and incompetent.. . . For to conclude uppon an Enumeration of particulars, without instance contradictorie is no conclusion but a coniecture; for who can assure (in many subjects) uppon those particulars which appeare of a side, that there are not other on the contrarie side, which not? [Bacon, 2000a, pp. 109–110] 13 The
word does occur in ch. 11 of the Tractatus Theologico-Polticus [Spinoza, 2004, p. 158]. observations ‘may be establish’d into Rules fit to be rely’d on, when they are justify’d by a sufficient and wary Induction of Particulars’ [Locke, 1706, p. 49]. 14 Our
24
J. R. Milton
What Bacon proposed to use instead of this vicious and incompetent form of induction is not explained, though he did promise the reader that ‘if God give me leave’ he would one day publish an account of his new method, which he called the Interpretation of Nature [Bacon, 2000a, p. 111]. The promise was eventually honoured in 1620 with the publication of Bacon’s most substantial philosophical work, the Novum Organum, designed as the second part, though the first to be published, of a massive — and unfinished — six-part project, the Great Instauration (Instauratio Magna). The title chosen for this second part made it clear that Bacon was making an open challenge to Aristotle. Aristotle’s logical works had become known collectively as the Organon, or tool, and the New Organon was intended not merely as a supplement, but as a replacement. Bacon’s case against the traditional logic of the schools had two main strands. In the first place the old logic was concerned with talk rather than action: For the ordinary logic professes to contrive and prepare helps and guards for the understanding, as mine does; and in this one point they agree. But mine differs from it in three points especially; viz., in the end aimed at; in the order of demonstration; and in the starting point of the inquiry. For the end which this science of mine proposes is the invention not of arguments but of arts; not of things in accordance with principles, but of principles themselves; not of probable reasons, but of designations and directions for works. And as the intention is different, so accordingly is the effect; the effect of the one being to overcome an opponent in argument, of the other to command nature in action. [Bacon, 1857– 74, IV, pp. 23–24] Training in traditional logic encouraged the wrong kind of mental skills: it placed a premium on intellectual subtlety, but ‘the subtlety of nature is far greater than the subtlety of the senses and understanding’ (Novum Organum, I. 10). Facility with words and agility in debate are not what is required when one is trying to penetrate the workings of nature. Secondly, by concentrating on the forms of argument, syllogistic logic draws attention away from defects in their matter, which are far more dangerous: The syllogism consists of propositions, propositions consist of words, words are symbols of notions. Therefore if the notions themselves (which is the root of the matter) are confused and over-hastily abstracted from the facts, there can be no firmness in the superstructure. Our only hope therefore lies in a true induction. [Novum Organum, I. 14] This is the first occasion in which induction was mentioned in the Novum Organum (as distinct from the parts of the Instauratio Magna that preceded it), but there is no subsequent explanation of how induction could contribute to the rectification of
Induction before Hume
25
defective concepts. One thing that is apparent, however, is that Bacon’s approach was quite different to Descartes’: there was no suggestion that the establishment of a set of clear and distinct ideas either could or should precede the investigations undertaken with their help. The improvement of concepts and the growth of knowledge had to take place together, by slow increments. Despite these harsh remarks, Bacon did not reject syllogistic reasoning entirely, but he restricted its use to areas of human life where ‘popular’, superficial concepts are employed: Although therefore I leave to the syllogism and these famous and boasted modes of demonstration their jurisdiction over popular arts and such as are matter of opinion (in which department I leave all as it is), yet in dealing with the nature of things I use induction throughout . . . [Bacon, 1857–74, IV, p. 24]15 Bacon was the opposite of an ‘ordinary language’ philosopher: he had no belief whatever that the concepts embedded since time immemorial in common speech would prove to be the ones needed in a reformed natural philosophy — indeed quite the contrary. One of his fundamental objections to Aristotle was that Aristotle had taken as his starting point popular notions and merely ordered and systematised them, instead of replacing them by something better. Bacon had no liking for neologisms, and whenever possible preferred ‘to retaine the ancient tearmes, though I sometimes alter the uses and definitions, according to the Moderate proceeding in Civill government’ [Bacon, 2000a, p. 81]. But though he was prepared to retain the traditional vocabulary, the kind of induction he was planning to use would very unlike anything described by his predecessors: In establishing axioms, another form of induction must be devised than has hitherto been employed, and it must be used for proving and discovering not first principles (as they are called) only, but also the lesser axioms, and the middle, and indeed all. For the induction which proceeds by simple enumeration is childish; its conclusions are precarious and exposed to peril from a contradictory instance; and it generally decides on too small a number of facts, and on those only which are at hand. [Novum Organum, I. 105]16 The fallibility of induction by simple enumeration could hardly be more clearly expressed. Bacon had no intention of retaining it and merely adding safeguards that would make its use less risky and any conclusions reached more probable. He wanted it to be discarded in favour of something entirely different: 15 See also the letter of 30 June 1622 to Fr. Redemptus Baranzan [Bacon, 1857–74, XIV, p. 375]. 16 Axioms here are not the axioms of modern mathematics and logic, but rather important general principles; the term comes from Stoic logic [Frede, 1974, pp. 32–37; Kneale and Kneale, 1962, pp. 145–147].
26
J. R. Milton
But the induction which is to be available for the discovery and demonstration of sciences and arts, must analyse nature by proper rejections and exclusions; and then, after a sufficient number of negatives, come to a conclusion on the affirmative instances.. . . But in order to furnish this induction or demonstration well and duly for its work, very many things are to be provided which no mortal has yet thought of; insomuch that greater labour will have to be spent in it than has hitherto been spent on the syllogism. [Novum Organum, I. 105] The last part of this was a warning that Bacon’s own account of this new kind of induction would (at this stage) be far from complete. He never supposed that his method could be described in detail, prior to its employment in actual investigations. The specimen given in the Novum Organum of an enquiry made using the new kind of induction was explicitly described as a First Vintage, or provisional interpretation (interpretatio inchoata, II. 20); a full account would have to wait until the final part of the Instauratio Magna, the Scientia Activa, which was never written, or indeed even begun. One thing that was clear from the start, however, is that it would be a form of eliminative induction, relying on ‘rejections and exclusions’. A great mass of merely confirming instances, however large, is never enough. Bacon’s own preliminary account of his method is given in Book II of the Novum Organum. There are three stages: the compilation of a ‘natural and experimental history’ of the nature under investigation, the ordering of this in tables, and finally induction. While the first two of these are described in considerable detail (Novum Organum, II. 10–14), the account of induction itself is strikingly brief: We must make, therefore, a complete solution and separation of nature, not indeed by fire, but by the mind, which is a kind of divine fire. The first work therefore of true induction (as far as regards the discovery of Forms) is the rejection or exclusion of the several natures which are not found in some instance where the given nature is present, or are found in some instance where the given nature is absent, or are found to increase in some instance when the given nature decreases, or to decrease when the given nature increases. [Novum Organum, II. 16] Bacon’s theory of forms is notoriously obscure — they are certainly not the substantial forms of the Aristotelians — but it is clear that, whatever they might be in ontological terms, they are the causes of the (phenomenal) natures [P´erezRamos, 1988, pp. 65–132]. The form of heat is something which is present in all hot bodies, absent from all cold bodies, and which varies in intensity according to the degree of heat found in a body. The conclusion of the process of induction was described in a vivid (but opaque) metaphor taken from contemporary chemistry: ‘after the rejection and exclusion has been duly made, there will remain at the bottom, all light opinions vanishing into smoke, a Form affirmative, solid, and true and well defined’ (Novum Organum, II. 16), like a puddle of gold at the bottom of an alchemist’s crucible. Bacon’s own
Induction before Hume
27
comment on this is entirely apposite: ‘this is quickly said; but the way to come at it is winding and intricate.’ Bacon’s confidence that his method of eliminative induction would produce certain knowledge rested on several presuppositions, of which the most important is what Keynes subsequently termed a Principle of Limited Variety. Though the world as we experience it appears unendingly varied, all this complexity arises from the combination of a finite — indeed quite small — number of simple natures. There is an alphabet of nature,17 the contents of which cannot be guessed or discovered by speculation, but which will start to be revealed once the correct inductive procedures are employed. Bacon made no attempt to give an a priori justification of this, and there is no reason to suppose that he would have regarded any such justification as either possible or necessary. As always, validation would be retrospective — by having supplied those who employed the method correctly with power over nature.
4.2
Descartes
Induction played no significant role in Descartes’ mature philosophy, but there are some remarks on it in the early and unfinished Regulae ad Directionem Ingenii (c.1619–c.1628). Whether Descartes had read any of Bacon’s works at this stage in his life is not known — he certainly became familiar with Bacon’s thought subsequently [Clarke, 2006, p. 104] — but his account in the Regulae appears to have owed nothing whatever to the Novum Organum. In the Regulae the most certain kind of knowledge comes from intuition, a direct apprehension of the mind unmediated by any other intellectual operations. Deduction is needed because some chains of reasoning are too complex to be grasped by a single act of thought: we can grasp intuitively the link between each element in the chain and its predecessor, but not all the links between the elements at once. Induction18 was dealt with more briefly, Rule VII stating that In order to make our knowledge complete, every single thing relating to our undertaking must be surveyed in a continuous and wholly uninterrupted sweep of thought, and be included in a sufficient and well ordered enumeration [sufficienti et ordinata enumeratione]. [Descartes, 1995, I, p. 25] It would seem that for Descartes the words ‘inductio’ and ‘enumeratio’ were merely alternative names for the same thing; their equivalence is suggested by phrases such as ‘enumeratio, sive inductio’ and ‘enumerationem sive inductionem’ in the passages quoted below [Descartes, 1908, pp. 388, 389], [Marion, 1993, p. 103]. 17 On
this, and Bacon’s work on an Abecedarium Naturae, see [Bacon, 2000b, pp. xxix–xl, 305]. word occurs three times in Rule VII [Descartes, 1908, pp. 388, 389, 390] and once in Rule XI (p. 408). There is one place in Rule III (p. 368) where ‘inductio’ appears in the first edition of 1701, but this may be a transcriber’s or printer’s error for ‘deductio’; for a discussion of the problem, see [Descartes, 1977, pp. 117–119]. 18 The
28
J. R. Milton
The function of a sufficient enumeration is given in the explication of Rule VII: We maintain furthermore that enumeration is required for the completion of our knowledge [ad scientiae complementum]. The other Rules do indeed help us resolve most questions, but it is only with the aid of enumeration that we are able to make a true and certain judgement about whatever we apply our minds to. By means of enumeration nothing will wholly escape us and we shall be seen to have some knowledge on every question. In this context enumeration, or induction, consists in a thorough investigation of all the points relating to the problem at hand, an investigation which is so careful and accurate that we may conclude with manifest certainty that we have not inadvertently overlooked anything. So even though the object of our enquiry eludes us, provided we have made an enumeration we shall be wiser at least to the extent that we shall perceive with certainty that it could not possibly be discovered by any method known to us. [Descartes, 1995, I, pp. 25–26] If an enumeration is to lead to a negative conclusion that the knowledge of something lies entirely beyond the reach of the human mind, then it is essential that it should be ‘sufficient’: We should note, moreover, that by ‘sufficient enumeration’ or ‘induction’ [sufficientem enumerationem sive inductionem] we just mean the kind of enumeration which renders the truth of our conclusions more certain than any other kind of proof [aliud probandi genus] (simple intuition excepted) allows. But when our knowledge of something is not reducible to simple intuition and we have cast off our syllogistic fetters, we are left with this one path, which we should stick to with complete confidence. [Descartes, 1995, I, p. 26] This notion of a sufficient enumeration plays a crucial role in Descartes’ account, and it is unfortunate that his explication of it singularly fails to meet his own professed ideal of perfect clarity [Beck, 1952, p. 131]. It is not the same as completeness. If I wish to determine how many kinds of corporeal entities there are, I need to distinguish them from one another and make a complete enumeration of all the different kinds, But if I wish to show in the same way that the rational soul is not corporeal, there is no need for the enumeration to be complete; it will be sufficient if I group all bodies together into several classes so as to demonstrate that the rational soul cannot be assigned to any of these. [Descartes, 1995, I, pp. 26–27] The thought here appears to be that we do not need to make a complete list of all the different kinds of body: if we are merely attempting to establish a negative
Induction before Hume
29
thesis about what the rational soul is not, a division into several broad classes is enough. One relatively straightforward example of an enumeration is given in Rule VIII. Someone attempting to investigate all the kinds of knowledge will have to begin by considering the pure intellect, since the knowledge of everything else depends on this; then ‘among what remains he will enumerate [enumerabit] whatever instruments of knowledge we possess in addition to the intellect; and there are only two of these, namely imagination and sense perception’ [Descartes, 1995, I, p. 30]. He will make a precise enumeration [enumerabit exacte] of all the paths to truth which are open to men, so that he may follow one which is reliable. There are not so many of these that he cannot immediately discover them all by means of a sufficient enumeration [sufficientem enumerationem]. . . [Descartes, 1995, I, p. 30] Another example of an enumeration — here explicitly described as an induction — is more puzzling: To give one last example, say I wish to show by enumeration that the area of a circle is greater than the area of any other geometrical figure whose perimeter is the same length as the circle’s. I need not review every geometrical figure. If I can demonstrate that this fact holds for some particular figures, I shall be entitled to conclude by induction that the same holds true in all the other cases as well. [Descartes, 1995, I, p. 27] This does not appear to be an example of a complex proof composed of separate proofs of a finite set of more specific cases, as when a theorem about triangles in general is established by showing it to be true for acute, obtuse and right-angled triangles. There are clearly an infinite number of polygons with the same perimeter as a given circle but with smaller areas. How the argument is meant to proceed is not clear, but it is certainly not though an exhaustive case-by-case analysis.19 There is no further discussion of induction in the works that Descartes published. The Regulae was not printed until 1701, though copies circulated in manuscript and were read by Leibniz, and (possibly) by Locke.20 The account of knowledge that Locke gave in book IV of the Essay concerning Human Understanding certainly has close parallels with the account in the Regulae, but there is no mention at all of induction.
4.3
Gassendi
Pierre Gassendi discussed induction in his Institutio Logica [Gassendi, 1981], first published in 1658 as parts of his posthumous Opera Omnia. Following in the 19 The result is not elementary. A non-rigorous proof, first given by Zenodorus (2nd century bc?), is preserved in Book V of Pappus’ Collections [Cuomo, 2000, pp. 61–62]. 20 If Locke had seen a copy, it would have been between 1683 and 1689 when he was in exile in the Netherlands. There is no mention of this anywhere in his private papers.
30
J. R. Milton
tradition of the Prior Analytics, induction was treated as a kind of syllogism, for example: Every walking animal lives, every flying animal lives, and also every swimming animal, every creeping animal, every plant-like animal; therefore every animal lives. [Gassendi, 1981, p. 53] In such an induction there is a concealed premise: Every animal is either walking, or flying, or swimming, or creeping, or plant-like. Without this, the inference would have no force (consequutionis vis nulla foret), since if there were another kind of animal in addition to these, a false conclusion could emerge. If an induction is to be valid (legitima) it has to be based on an enumeration of all the relevant species or parts, and as Gassendi commented, such an enumeration is usually difficult if not impossible to achieve [Gassendi, 1981, p. 54]. The same account appeared with only minor changes in the French Abreg´e de la philosophie de Gassendi, published by Gassendi’s disciple, Fran¸cois Bernier [Bernier, 1684, I, pp. 132–133].
4.4
Arnauld and Nicole
The work most strongly influenced by Descartes’ as yet unpublished Regulae was La Logique ou l’art de penser, published in 1662 by Antoine Arnauld and Pierre Nicole. Induction is introduced in traditional and broadly neutral terms: Induction occurs whenever an examination of several particular things leads us to knowledge of a general truth. Thus when we experience several seas in which the water is salty, and several rivers in which the water is fresh, we infer that in general sea water is salty and river water is fresh. [Arnauld and Nicole, 1996, p. 202] Induction is described as the beginning of all knowledge, because singular things are presented to us before universals. This sounds thoroughly Aristotelian, but the resemblance is only superficial. Though I might never have started to think about the nature of triangles if I had not seen an individual example, ‘it is not the particular examination of all triangles which allows me to draw the general and certain conclusion about all of them . . . but the mere consideration of what is contained in the idea of the triangle which I find in my mind’ [Arnauld and Nicole, 1996, p. 202]. The same is true of the very general axioms that have application in fields quite remote from geometry, for example the principle that a whole is greater than its part, the ninth and last of Euclid’s Common Notions. According to certain philosophers — un-named, but presumably Gassendi and his followers — we know this only because ever since our infancy we have observed that a man is larger than his head, a house larger than a room, a forest larger than a tree, and
Induction before Hume
31
so on. Arnauld and Nicole replied that ‘if we were sure of this truth . . . only from the various observations we had made since childhood, we would be sure only of its probability [nous n’en serions probablement assur´es], since induction is a certain means of knowing something only when the induction is complete’ [Arnauld and Nicole, 1996, p. 247]. It is striking that several of Arnauld and Nicole’s examples of over-confident reliance on inherently fallible inductive arguments are taken from recent developments in the physical sciences. Natural philosophers had long believed that a piston could not be drawn out of a perfectly sealed syringe and that a suction pump could lift water from any depth, and they supposed these alleged truths to be founded on a ‘a very certain induction based on an infinity of experiments [exp´eriences]’ [Arnauld and Nicole, 1996, p. 203, translation modified]. Again, it was assumed that if water was contained in a curved vessel (e.g. a U-tube) of which one arm was wider than the other, the level in the two arms would be equal; experiment had shown that this was not true when one arm was very narrow, allowing capillary attraction to become significant [Arnauld and Nicole, 1996, p. 247]. It would probably be going to far to say that new discoveries in the natural sciences were the main force fuelling inductive scepticism, but they do seem to have played a part in reducing confidence in the age-old experiential data on which Aristotelian science had been based [Dear, 1995].
4.5
Hobbes and Wallis
Despite Hobbes’s strong and unswerving commitment to nominalism, inductive reasoning did not play a large role in his philosophy, and he had little to say about it. In the mid-1650s he became involved in a series of acrimonious arguments with the mathematician John Wallis [Jesseph, 1999], part of which touched on Wallis’s use of inductive arguments. Hobbes thought that induction had no place in mathematics, or at least in mathematical demonstration: The most simple way (say you) of finding this and some other Problemes, is to do the thing it self a little way, and to observe and compare the appearing Proportions, and then by Induction, to conclude it universally. Egregious Logicians and Geometricians, that think an Induction without a Numeration of all the particulars sufficient to infer a Conclusion universall, and fit to be received for a Geometricall Demonstration! [Hobbes, 1656, p. 46] Hobbes clearly thought that there were only two kinds of induction, one founded on a complete enumeration of all the particulars, which could lead to certainty, and the other founded on a partial enumeration, which could not. The remarks by Wallis that Hobbes had found so objectionable were in the Arithmetica Infinitorum of 1656, but his fullest treatment of the issue is to be found in his much later Treatise of Algebra [Wallis, 1685], where he was responding to the criticisms of a very much more capable mathematician than Hobbes, Pierre
32
J. R. Milton
Fermat [Stedall, 2004, pp. xxvi–xxvii]. Wallis insisted that inductive arguments have a legitimate role in mathematics: As to the thing itself, I look upon Induction as a very good Method of Investigation; as that which doth very often lead us to the easy discovery of a General Rule; or is at least a good preparative to such an one. And where the Result of such Inquiry affords to the view, an obvious discovery; it needs not (though it may be capable of it,) any further Demonstration. And so it is, when we find the Result of such Inquiry, to put us into a regular orderly Progression (of what nature soever,) which is observable to proceed according to one and the same general Process; and where there is no ground of suspicion why it should fail, or of any case which might happen to alter the course of such Process. [Wallis, 1685, p. 306] The example Wallis gave of this was the expansion of the binomial (a + e)n : (a + e)2 = a2 + 2ae + e2 , (a + e)3 = a3 + 3a2 e + 3ae 2 + e3 , (a + e)4 = a4 + 4a3 e + 6a2 e2 + 4ae 3 + e4 , and so on. The coefficients in each case can be found by repeated multiplication, but more easily by using the diagram now known as Pascal’s triangle, in which each element is the sum of the two diagonally above it in the row above: 1 1 1 1 1 1
3 4
5
1 2
1 3
6 10
1 4
10
1 5
1
The general result, that this procedure can be used for any power, is described as being established by induction. Wallis remarked: But most Mathematicians that I have seen, after such Induction continued for some few Steps, and seeing no reason to disbelieve its proceeding in like manner for the rest, are satisfied (from such evidence,) to conclude universally, and so in like manner for the consequent Powers. And such Induction hath hitherto been thought (by such as do not list to be captious) a conclusive Argument. [Wallis, 1685, p. 308] There is no indication here or anywhere else that the mathematical results reached by this kind of induction are merely probable, or in any way uncertain. The reason is clear: ‘there is, in the nature of Number, a sufficient ground for such a sequel’ [Wallis, 1685, p. 307]. Wallis was not using what has since become known as mathematical induction [Cajori, 1918]: his argument is much closer to the demonstrative induction of the later Aristotelians such as Zabarella, in which the examination of a few cases is enough to reveal the underlying regularity.
Induction before Hume
33
Wallis’s account of induction in Book III, chapter 15 of his Institutiones Logicae [Wallis, 1687, pp. 167–172] is more traditional, and much of it is concerned with the reduction of perfect inductions to various figures of the syllogism. In imperfect inductions the conclusion is described as only conjectural, or probable, on the familiar grounds that it can be overturned by a single negative instance [Wallis, 1687, p. 170]. No attempt was made to estimate any degrees of probability. Wallis insisted several time that the weakness (imbecillitas) that characterised all imperfect inductions did not lie in their form, which was that of a syllogism, but in their matter: For example, if someone argues Teeth in the upper jaw are absent in all horned animals; because it is thus in the Ox, the Sheep, the Goat, nor is it otherwise (as far as we know) in the others; Therefore (at least as far as we know) in all. This conclusion is not certain, but only probable [verisimilis]; not through a defect in the Syllogistic form, but through the uncertainty of the matter, or the truth of the premises. [Wallis, 1687, pp. 170–171] In other words an inductive argument is not a fallible inference from reliably established premises such as ‘The ox has no teeth in the upper jaw’ and ‘The sheep has no teeth in the upper jaw’, but rather a deductive inference from premises like ‘The ox, the sheep and the goat have no teeth in the upper jaw, nor is it otherwise in the other horned animals’, all of which are uncertain and are provisionally accepted merely because no counter-examples are known to exist.
4.6
Leibniz
Induction was not a central issue in Leibniz’s philosophy, but given his omnivorous intellectual curiosity, it is not surprising either that he said something or that what he had to say is of considerable interest [Rescher, 1981; 2003; Westphal, 1989]. In the preface to his New Essays on Human Understanding Leibniz explained that one point of fundamental disagreement between him and Locke concerned the existence or non-existence of innate principles in the soul. This in turn raised the question of ‘whether all truths depend on experience, that is on induction and instances, or if some of them have some other foundation’. Leibniz chose the second answer: the senses ‘never give us anything but instances, that is particular or singular truths. But however many instances confirm a general truth, they do not suffice to establish its universal necessity; for it does not follow that what has happened will always happen in the same way.’ [Leibniz, 1981, p. 49]. Our knowledge of truths of reason, such as those of arithmetic and geometry, is not based on induction at all. Not all truths are, however, truths of reason, and the other kind of truths — truths of fact — need to be discovered in a different way, at least by human beings. As Leibniz remarked in a paper ‘On the souls of men and beasts’, written around 1710:
34
J. R. Milton
there are in the world two totally different sorts of inferences, empirical and rational. Empirical inferences are common to us as well as to beasts, and consist in the fact that when sensing things that have a number of times been experienced to be connected we expect them to be connected again. Thus dogs that have been beaten a number of times when they have done something displeasing expect a beating again if they do the same thing, and therefore they avoid doing it; this they have in common with infants. [Leibniz, 2006, p. 66] Beasts and infants are not alone in making such inferences: so too do human beings. As he noted in § 28 of the Monadology, Men act like beasts insofar as the sequences of their perceptions are based only on the principle of memory, like empirical physicians who have a simple practice without theory. We are all mere empirics in three-fourths of our actions. For example, when we expect daylight tomorrow, we act as empirics, because this has always happened up to the present. Only the astronomer concludes it by reason. [Leibniz, 1969, p. 645, translation modified]21 There is however one difference between beasts and mere empirics: ‘beasts (as far as we can tell) are not aware of the universality of propositions . . . And although empirics are sometimes led by inductions to true universal propositions, nevertheless it only happens by accident, not by the force of consequence.’ [Leibniz, 2006, p. 67]. Human beings when relying purely on experience make generalisations that are often wrong, but beasts seem not to generalise at all. It would seem from this that there are three kinds of reasoning (using this word in a large sense): (1) inferences from particulars to other particulars, the kind of reasoning that earlier philosophers had called paradeigma or example; (2) inductive generalisation proper; and (3) deduction. Something of this kind seems to be indicated in a note he made on the back of a draft letter dated May 1693, where a distinction is made between three grades of confirmation (firmitas): logical certainty, physical certainty, which is only logical probability, and physical probability. The first example [is] in propositions of eternal truth, the second in propositions which are known to be true by induction, as that every man is a biped, for sometimes some are born with one foot or none; the third that the south wind brings rain, which is usually true but not infrequently false. [Couturat, 1961, p. 232] Physical certainty is identified with moral certainty in New Essays IV. vi. 13 [Leibniz 1981, p. 406]. The implication is that the conclusions reached by inductive inferences can at least sometimes be morally certain. 21 The parallel between beasts and empirical physicians is one that Leibniz drew several times: Principles of Nature and Grace, § 5 [Leibniz, 1969, p. 638], New Essays, preface, [Leibniz, 1981, p. 50].
Induction before Hume
35
An indication of how such certainty can be obtained is provided by one of the earliest expositions of Leibniz’s views on induction. The Dissertatio de Stilo Philosophico Nizolii was a preface written in 1670 for a new edition of Mario Nizzoli’s De Veris Principiis et Vera Ratione Philosophandi contra Pseudophilosophos, first published in 1553. Nizzoli (1488–1567) was an idiosyncratic thinker who has been described as a Ciceronian Ockhamist, and for Leibniz, his fundamental error is his nominalism — his denial of the existence of real universals: If universals were nothing but collections of individuals, it would follow that we could attain no knowledge through demonstration . . . but only through collecting individuals or by induction.22 But on this basis knowledge would straightway be made impossible, and the skeptics would be victorious. For perfectly universal propositions can never be established on this basis because you are never certain that all individuals have been considered. You must always stop at the proposition that all the cases which I have experienced are so. But . . . it will always remain possible that countless other cases which you have not examined are different. [Leibniz, 1969, p. 129] Leibniz admitted that we believe confidently that fire burns, and that we will ourselves be burned if we place our hand in one, but this kind of moral certainty does not depend on induction alone and is reached only with the assistance of other universal propositions: 1. if the cause is the same or similar in all cases, the effect will be the same or similar in all; 2. the existence of a thing which is not sensed is not assumed; and, finally, 3. whatever is not assumed, is to be disregarded in practice until it is proved. [Leibniz, 1969, p. 129]. The second and third of these are methodological principles, similar though not identical to Ockham’s Razor. The first is a more carefully worded version of Hume’s principle that ‘like causes always produce like effects’.23 Without the aid of these helping propositions (adminicula), as Leibniz called them, not even moral certainty would be possible. Our knowledge of the adminicula cannot therefore be grounded on induction: ‘For if these helping propositions, too, were derived from induction, they would need new helping propositions, and so on to infinity’ [Leibniz, 1969, p. 130]. Leibniz’s language here was Baconian,24 but his thought manifestly is not: it is much closer to Hume. 22 As the Latin (collectionem singularium, seu inductionem) makes clear, this is one process named in two ways, not two distinct processes [Leibniz, 1840, p. 70]. 23 The sixth of Hume’s rules by which to judge of causes and effects, Treatise of Human Nature, I. iii. 15. 24 The Adminicula inductionis were announced as a topic of future discussion in Novum Organum, II. 21, but never described in detail.
36
J. R. Milton
There is an illuminating comparison to be made between this kind of sophisticated induction, buttressed by the adminicula, and the demonstrative induction described by Zabarella. In demonstrative induction the conclusion can be made certain to us because the intellect grasps the universal nature on which the truth of the universal proposition is grounded. In Leibniz the help is provided by principles of a much higher degree of generality, such as the Law of Continuity [Leibniz, 1969, pp. 351–352], and ultimately the Principle of Sufficient Reason. In the words of Foucher de Careil: Thus Leibniz has seen that in order to be introduced into science, induction needs the help of certain universal propositions that in no way depend on it. And since there can be no obstacle to the complete and systematic unity of science except the diversity of the facts of experience, he saw that the law of continuity, which is the link between the universal and the particular, and which unites them in science, is the true basis of induction . . . without it induction is sterile, with it, it generates moral certainty. [Leibniz, 1857, p. 422] It is the regularity of nature — i.e. the fact that it is law-governed — that makes properly conducted inductive inferences safe. 5 CONCLUSION The story told in the pages above has been an episodic and fragmentary one, with remarks about induction extracted from the writings of authors who were almost always concerned primarily with other matters, and for whom inductive reasoning was a matter of relatively minor importance. The one significant exception was Bacon, and even his treatment of his new method of eliminative induction was remarkably brief, given its pivotal role in his programme. Even in the modern world a philosopher is not required to deal with induction at any length — or indeed at all — if they wish to be considered as a candidate for greatness. Given the direction of his interests, no one would have expected Nietzsche, for instance, to have focused his considerable talents on the problem, and the same is true of a large number of his predecessors. What is striking is not that many philosophers chose to concentrate on other matters, but that virtually everyone did. The notion that there is a general and far-reaching ‘problem of induction’ is relatively recent. One of the earliest and most influential uses of the phrase was in J. S. Mill’s System of Logic, III. iii. 3, where the discussion of induction concluded with the following peroration: Why is a single instance, in some cases, sufficient for a complete induction, while in others, myriads of concurring instances, without a single exception known or presumed, go such a very little way towards establishing an universal proposition? Whoever can answer this question
Induction before Hume
37
knows more of the philosophy of logic than the wisest of the ancients, and has solved the problem of induction. [Mill, 1973–4, p. 314] Whether anyone has subsequently succeeded in solving — or dissolving — the problem may be doubted, though confident (and sometimes absurd) claims have continued to be made. What does seem clear is that no one before the nineteenth century saw induction as posing a single, general problem, still less regarded a failure to solve it as being, in C. D. Broad’s often-quoted words, ‘the scandal of Philosophy’ [Broad, 1926, p. 67]. BIBLIOGRAPHY [Agricola, 1992] R. Agricola. De Inventione Dialectica Libri Tres. Edited by L. Mundt. T¨ ubingen: Max Niedereyer Verlag, 1992. [Alexander, 2001] Alexander of Aphrodisias. On Aristotle Topics I. Translated by J. M. Van Ophuijsen. London: Duckworth, 2001. [Allen, 2001] J. Allen. Inference from Signs: Ancient Debates about the Nature of Evidence. Oxford: Clarendon Press, 2001. [Apuleius, 1987] D. Londey and C. Johanson. The Logic of Apuleius. Leiden/New York/Copenhagen/Cologne: Brill, 1987. [Aquinas, 1970] Thomas Aquinas. Commentary on the Posterior Analytics of Aristotle. Translated by F. R. Larcher. Albany: Magi Books, 1970. [Aristotle, 1966] Aristotle. Posterior Analytics, Topica. Translated by H. Tredennick and E. S. Forster. London/Cambridge MA: William Heinemann and Harvard University Press, 1966. [Aristotle, 1973] Aristotle, Categories, On Interpretation, Prior Analytics. Translated by H. P. Cooke and H. Tredennick. London/Cambridge MA: William Heinemann and Harvard University Press, 1973. [Arnauld and Nicole, 1996] A. Arnauld and P. Nicole. Logic or the Art of Thinking. Translated by J. V. Buroker. Cambridge: Cambridge University Press, 1996. [Asmis, 1984] E. Asmis. Epicurus’ Scientific Method. Ithaca: Cornell University Press, 1984. [Atherton, 1999] M. Atherton. The Empiricists: Critical Essays on Locke, Berkeley and Hume. Lanham MD: Rowman & Littlefield, 1999. [Bacon, 1857–74] F. Bacon. The Works of Francis Bacon. Collected and edited by J. Spedding, R. L. Ellis and D. D. Heath. London: Longman & Co., 1857–74. [Bacon, 2000a] F. Bacon. Advancement of Learning. Edited by M. Kiernan. The Oxford Francis Bacon, vol. IV. Oxford: Clarendon Press, 2000. [Bacon, 2000b] F. Bacon. The Instauratio magna: Last Writings. Edited and translated by G. Rees. The Oxford Francis Bacon, vol. XIII. Oxford: Clarendon Press, 2000. [Bacon, 2004] F. Bacon. The Instauratio magna Part II: Novum organum and Associated Texts. Edited and translated by G. Rees and M. Wakeley. The Oxford Francis Bacon, vol. XI. Oxford: Clarendon Press, 2004. [Barnes, 1975] J. Barnes. Aristotle’s Posterior Analytics. Oxford: Clarendon Press, 1975. [Barnes, 1988, ] J. Barnes. Epicurean Signs. Oxford Studies in Ancient Philosophy, Supplementary Volume, 1988, pp. 91–134. Oxford: Clarendon Press, 1988. [Barnes, 1997] J. Barnes. Logic and the Imperial Stoa. Leiden/New York/Cologne: Brill, 1997. [Barnes et al., 1982] J. Barnes, J. Brunschwig, M. Burnyeat and M. Schofield. Science and Speculation: Studies in Hellenistic theory and practice. Cambridge: Cambridge University Press, 1982. [Beck, 1952] L. J. Beck. The Method of Descartes. Oxford: Clarendon Press, 1952. [Bernier, 1684] F. Bernier. Abreg´ e de la philosophie de Gassendi. Lyon: Anisson, Posuel & Rigaud, 1684. [Biard, 2001] J. Biard. The Natural Order in John Buridan. In [Thijssen and Zupko, 2001, pp. 77–96]. [Bos, 1993] E. P. Bos. A Contribution to the History of Theories of Induction in the Middle Ages. In [Jacobi, 1993, pp. 553–576].
38
J. R. Milton
[Broad, 1926] C. D. Broad. The Philosophy of Francis Bacon. Cambridge: Cambridge University Press, 1926. [Broadie, 1993] A. Broadie. Introduction to Medieval Logic. Oxford: Clarendon Press, 1993. [Burnyeat, 1982] M. F. Burnyeat. The Origins of Non-Deductive Inference. In [Barnes et al., 1982, pp. 193–238]. [Burnyeat, 1994] M. F. Burnyeat. Enthymeme: the Logic of Persuasion. In [Furley and Nehamas, 1994, pp. 3–56]. [Cajori, 1918] F. Cajori. Origin of the Name ‘Mathematical Induction’. American Mathematical Monthly, 25: 197–201, 1918. ´ [Caujolle-Zaslavsky, 1990] F. Caujolle-Zaslavsky. Etude pr´ eparatoire ` a une interpr´etation du ` sens aristot´elicien d’Eπαγογή. In [Devereux and Pellegrin, 1990, pp. 365–387]. [Cicero, 1949] Marcus Tullius Cicero, De Inventione, De Optimo Genere Oratorum, Topica. Translated by H. M. Hubbell. London/Cambridge MA: William Heinemann and Harvard University Press, 1949. [Clarke, 2006] D. M. Clarke. Descartes: A Biography. Cambridge: Cambridge University Press, 2006. [Cohen, 1980] L. J. Cohen. Some Historical Remarks on the Baconian Conception of Probability. Journal of the History of Ideas, 41: 219–231, 1980. [Couturat, 1961] L. Couturat. Opuscules et fragments id´ edites de Leibniz. Hildesheim: Georg Olms, 1961. [Crombie, 1953] A. C. Crombie. Robert Grosseteste and the Origins of Experimental Science, 1100–1700. Oxford: Clarendon Press, 1953. [Cuomo, 2000] S. Cuomo. Pappus of Alexandria and the Mathematics of Late Antiquity. Cambridge: Cambridge University Press, 2000. [Dear, 1995] P. Dear. Discipline and Experience: The Mathematical Way in the Scientific Revolution. Chicago: University of Chicago Press, 1995. [De Rijk, 2002] L. M. De Rijk. Aristotle: Semantics and Ontology. Volume I: General Introduction. Works on Logic. Leiden/Boston/Cologne: Brill, 2002. [Descartes, 1908] R. Descartes. Oeuvres de Descartes, vol. 10. Edited by C. Adam and P. Tannery. Paris: J. Vrin, 1908. [Descartes, 1977] R. Descartes. R` egles utiles et claires pour la direction de l’esprit en la recherche de la v´ erit´ e. Edited by J.- L. Marion. The Hague: Martinus Nijhoff, 1977. [Descartes, 1985] R. Descartes. The Philosophical Writings of Descartes. Edited by J. Cottingham, R. Stoothoff and D. Murdoch. Cambridge: Cambridge University Press, 1985. [Devereux and Pellegrin, 1990] D. Devereux and P. Pellegrin. Biologie, logique et m´ etaphysique chez Aristote: actes du s´ eminaire C.N.R.S.–N.S.F. Paris: Editions du C.N.R.S., 1990. [Dillon, 1993] J. Dillon. Alcinous: The Handbook of Platonism. Oxford: Clarendon Press, 1993. [Diogenes Laertius, 1980] Diogenes Laertius. Lives of Eminent Philosophers. Translated by R. D. Hicks. London/Cambridge MA: William Heinemann and Harvard University Press, 1980. [Engberg-Pedersen, 1979] More on Aristotelian Epagoge. Phronesis, 24: 301–319, 1979. [Franklin, 2001] J. Franklin. The Science of Conjecture. Baltimore/London: Johns Hopkins University Press, 2001. [Frede, 1974] M. Frede. Die Stoische Logik. G¨ ottingen: Vandenhoek and Ruprecht, 1974. [Furley and Nehamas, 1994] D. J. Furley and A. Nehamas, editors. Aristotle’s Rhetoric: Philosophical Essays. Princeton: Princeton University Press, 1994. [Gassendi, 1981] Pierre Gassendi’s Institutio Logica (1658). Edited and translated by H. Jones. Assen: Van Gorcum, 1981. [Gaukroger, 2001] S. Gaukroger. Francis Bacon and the Transformation of Early-modern Philosophy. Cambridge: Cambridge University Press, 2001. [Glucker, 1995] J. Glucker. Probabile, Veri Simile and Related Terms. In [Powell, 1995, pp. 115–144]. [Grendler, 2002] P. F. Grendler. The Universities of the Italian Renaissance. Baltimore/ London: Johns Hopkins University Press, 2002. [Grosseteste, 1981] R. Grosseteste. Commentarius in Posteriorum Analyticorum Libros. Edited by P. Rossi. Firenze: L. S. Olschki, 1981. [Hackett, 2004] J. Hackett. Robert Grosseteste and Roger Bacon on the Posterior Analytics. In [Lutz-Bachmann et al., 2004, pp. 161–212].
Induction before Hume
39
[Hacking, 1975] I. Hacking. The Emergence of Probability. Cambridge: Cambridge University Press, 1975. [Halm, 1863] K. F. von Halm. Rhetores Latini Minores. Leipzig: B. G. Teubner, 1863. [Hamlyn, 1976] D. Hamlyn. Aristotelian Epagoge. Phronesis, 21: 167–184, 1976. [Hobbes, 1656] T. Hobbes. Six Lessons To the Professors of the Mathematiques. London: Andrew Crook, 1656. [Howson, 2000] C. Howson. Hume’s Problem: Induction and the Justification of Belief. Oxford: Clarendon Press, 2000. [Jacobi, 1993] Argumentationstheorie: Scholastiche Forschungen zu den logischen und semantischen Regeln korrekten Folgerns. Edited by K. Jacobi. Leiden/New York/Cologne: Brill, 1993. [Jesseph, 1999] D. M. Jesseph. Squaring the Circle: The War between Hobbes and Wallis. Chicago: University of Chicago Press, 1999 [Kneale and Kneale, 1962] W. Kneale and M. Kneale. The Development of Logic. Oxford: Clarendon Press, 1962. [Kosman, 1973] L. A. Kosman. Understanding, Explanation and Insight in Aristotle’s Posterior Analytics. In [Lee et al., 1973, pp. 374–392]. [Lameer, 1994] J. Lameer. Al-Farabi and Aristotelian Syllogistics. Leiden/New York/Cologne: Brill, 1994 [Lagerlund, 2000] H. Lagerlund. Modal Syllogistics in the Middle Ages. Leiden/Boston/Cologne: Brill, 2000. [Lee et al., 1973] E. N. Lee, A. P. D. Mourelatos and R. M. Rorty. Exegesis and Argument. Assen: Van Gorcum, 1973. [Leibniz, 1840] G. W. Leibniz. Opera Philosophica quae extant Latina Gallica Germanica omnia. Edited by J. E. Erdmann. Berlin: G. Eichler, 1840. [Leibniz, 1857] G. W. Leibniz. Nouvelles lettres et opuscules de Leibniz. Edited by A. Foucher de Careil. Paris: Auguste Durand, 1857. [Leibniz, 1969] G. W. Leibniz. Philosophical Papers and Letters. Translated by L. E. Loemker. Dordrecht/Boston/London: Reidel, 1969. [Leibniz, 1980] G. W. Leibniz. Philosophisches Schriften, Band 2: 1663–1672. Edited by H. Schepers, W. Kabitz and W. Schneiders. Berlin: Akademie Verlag, 1980. [Leibniz, 1981] G. W. Leibniz. New Essays on Human Understanding. Translated by P. Remnant and J. Bennett. Cambridge: Cambridge University Press, 1981. [Leibniz, 2006] G. W. Leibniz. The Shorter Leibniz Texts: A Collection of New Translations. Edited and translated by L. H. Strickland. London: Continuum, 2006. [Locke, 1706] J. Locke Posthumous Works of Mr. John Locke. London: A. and J. Churchill, 1706. [Locke, 1975] J. Locke. An Essay concerning Human Understanding. Edited by P. H. Nidditch. Oxford: Clarendon Press, 1975. [Lutz-Bachmann et al., 2004] M. Lutz-Bachmann, A. Fidora and P. Antolic. Erkenntnis und Wissenschaft: Probleme der Epistemologie in der Philosophie des Mittelalters. Berlin: Akademie Verlag, 2004. [Mack, 1993] P. Mack. Renaissance Argument: Valla and Agricola in the Traditions of Rhetoric and Dialectic. Leiden/New York/Cologne: Brill, 1993. [Malherbe, 1996] M. Malherbe. Bacon’s Method to Science. In [Peltonen, 1996, pp. 75–98]. [Marion, 1993] J.- L. Marion. Sur L’ontologie grise de Descartes. Paris: Vrin, 1993. [Marrone, 1986] S. P. Marrone. Robert Grosseteste on the Certitude of Induction. In [Wenin, 1986, Volume II, pp. 481–488]. [McGinnis, 2003] J. McGinnis. Scientific Methodologies in Medieval Islam. Journal of the History of Philosophy, 41: 307–327, 2003. [McGinnis, 2008] Avicenna’s Naturalized Epistemology and Scientific Method. In [Rahman et al., 2008, pp. 129–152]. [McGinnis and Reisman, 2007] J. McGinnis and D. C. Reisman, editors. Classical Arabic Philosophy. Indianapolis/Cambridge: Hackett, 2007. [McKirahan, 1992] R. D. McKirahan Jr. Principles and Proofs: Aristotle’s Theory of Demonstrative Science. Princeton: Princeton University Press, 1992. [McPherran, 2007] M. L. McPherran. Socratic Epagoge and Socratic Induction. Journal of the History of Philosophy, 45: 347–364, 2007.
40
J. R. Milton
[Mill, 1973–4] J. S. Mill. A System of Logic, Ratiocinative and Inductive. Edited by J. M. Robson and R. F. McRae. Collected Works of John Stuart Mill, vols. VII and VIII. Toronto: University of Toronto Press, 1973, 1974. [Milton, 1987] J. R. Milton. Induction before Hume. British Journal for the Philosophy of Science, 38: 49–74, 1987. [Monfasani, 1990] J. Monfasani. Lorenzo Valla and Rudolph Agricola. Journal of the History of Philosophy, 28: 181–200, 1990. [Ockham, 1974] William of Ockham. Summa Logicae. Edited by P. Boehner, G. G´ al and S. Brown. Opera Philosophica, vol. I. St Bonaventure, N. Y: Franciscan Institute, 1974. [Okasha, 2001] S. Okasha. What did Hume really show about Induction? Philosophical Quarterly, 51: 307–327, 2001. [Oliver, 2004] S. Oliver. Robert Grosseteste on Light, Truth and Experimentum. Vivarium, 42: 151–180, 2004. [Peltonen, 1996] M. Peltonen. The Cambridge Companion to Bacon. Cambridge: Cambridge University Press, 1996. [P´ erez-Ramos, 1988] A. P´ erez-Ramos. Francis Bacon’s Idea of Science and the Maker’s Knowledge Tradition. Oxford: Clarendon Press, 1988. [Plato, 1953] The Dialogues of Plato. Translated by B. Jowett. 4th Edition. Oxford: Clarendon Press, 1953. [Powell, 1995] J. G. F. Powell, editor. Cicero the Philosopher. Oxford: Clarendon Press, 1995. [Quintilian, 1921] Marcus Fabius Quintilianus. Institutio Oratoria. Edited by H. E. Butler. London/Cambridge MA: William Heinemann and Harvard University Press, 1921. [Rahman et al., 2008] S. Rahman, A. Street and H. Tahiri, editors. The Unity of Science in the Arabic Tradition. New York/London: Springer, 2008. [Rescher, 1981] N. Rescher. Inductive Reasoning in Leibniz. In N. Rescher, Leibniz’s Metaphysics of Nature, pp. 20–28. Dordrecht/Boston/London: Reidel, 1981. [Rescher, 2003] N. Rescher. The Epistemology of Inductive Reasoning in Leibniz. In N. Rescher, On Leibniz, pp. 117–126. Pittsburgh: University of Pittsburgh Press, 2003. [Robinson, 1953] R. Robinson. Plato’s Earlier Dialectic. Oxford: Clarendon Press, 1953. [Ross, 1949] Aristotle’s Prior and Posterior Analytics. Introduction and commentary by W. D. Ross. Oxford: Clarendon Press, 1949. [Schmitt, 1969] C. B. Schmitt. Experience and Experiment: A Comparison of Zabarella’s View with Galileo’s in De motu. Studies in the Renaissance, 16: 80–138, 1969. [Sedley, 1982] D. Sedley. On Signs. In [Barnes et al., 1982, pp. 239–272]. [Serene, 1979] E. F. Serene. Robert Grosseteste on Induction and Demonstrative Science. Synthese, 40: 97–115, 1979. [Sextus, 1967] Sextus Empiricus. Outlines of Pyrrhonism. Translated by R. G. Bury. London/Cambridge MA: William Heinemann and Harvard University Press, 1967. [Spinoza, 2004] B. Spinoza. A Theologico-Political Treatise and A Political Treatise. New York: Dover Books, 2004. [Stedall, 2004] J. A. Stedall. The Arithmetic of Infinitesimals: John Wallis 1656. New York/London: Springer, 2004. [Stove, 1973] D. C. Stove. Probability and Hume’s Inductive Scepticism. Oxford: Clarendon Press, 1973. [Stump, 1978] E. Stump. Boethius’ De topicis differentiis: Translated with notes and essays on the text. Ithaca/London: Cornell University Press, 2001. [Thijssen, 1987] J. M. M. H. Thijssen. John Buridan and Nicholas of Autrecourt on Causality and Induction. Traditio, 43: 237–255, 1987. [Thijssen and Zupko, 2001] J. M. M. H. Thijssen and J. Zupko, editors. The Metaphysics and Natural Philosophy of John Buridan. Leiden/Boston/Cologne: Brill, 2001. [Upton, 1981] T. V. Upton. A Note on Aristotelian epagoge. Phronesis, 26: 172–176, 1981. [Valla, 1982] L. Valla. Repastinatio Dialectice et Philosophice. Edited by G. Zippel. Padova: Editrice Antenore, 1982. [Vico, 1996] G. Vico. The Art of Rhetoric: (Institutiones Oratoriae, 1711–1741). Edited by G. Crif` o; translated by G. A. Pinton and A. W. Shippee. Amsterdam/Atlanta: Editions Rodopi, 1996. [Vives, 1979] Against the Pseudodialecticians: A Humanist Attack on Medieval Logic. Translated with an introduction by R. Guerlac. Dordrecht/Boston/London: Reidel, 1979.
Induction before Hume
41
[Wallis, 1685] J. Wallis. A Treatise of Algebra, Both Historical and Practical. London: John Playford, 1685. [Wallis, 1687] J. Wallis. Institutio Logicae, Ad communes usus accomodata. Oxford: E Theatro Sheldoniano, 1687. [Weinberg, 1965] J. R. Weinberg. Abstraction, Relation, and Induction: Three Essays in the History of Thought. Madison/Milwaukee: University of Wisconsin Press, 1965. ˆ [Wenin, 1986] L’homme et son univers au Moyen Age: Actes du septi` eme congr` es internationale de philosophie m´ edi´ evale. Edited by C. Wenin. Philosophes m´edi´ evaux, 26–27. Louvain-la-Neuve: Editions de l’institute sup´erieur de philosophie, 1986. [Westphal, 1989] Leibniz and the Problem of Induction. Studia Leibnitiana, 21: 174–187, 1989. [Winkler, 1999] K. Winkler. Hume’s Inductive Skepticism. In [Atherton, 1999, pp. 183–212]. [Zabarella, 1608] J. Zabarella. Opera Logica. Frankfurt: Lazarus Zetzner, 1608; reprinted Hildesheim: G. Olms, 1966.
HUME AND THE PROBLEM OF INDUCTION Marc Lange
1
INTRODUCTION
David Hume first posed what is now commonly called “the problem of induction” (or simply “Hume’s problem”) in 1739 — in Book 1, Part iii, section 6 (“Of the inference from the impression to the idea”) of A Treatise of Human Nature (hereafter T ). In 1748, he gave a pithier formulation of the argument in Section iv (“Skeptical doubts concerning the operations of the understanding”) of An Enquiry Concerning Human Understanding (E).1 Today Hume’s simple but powerful argument has attained the status of a philosophical classic. It is a staple of introductory philosophy courses, annually persuading scores of students of either the enlightening or the corrosive effect of philosophical inquiry – since the argument appears to undermine the credentials of virtually everything that passes for knowledge in their other classes (mathematics notably excepted2 ). According to the standard interpretation, Hume’s argument purports to show that our opinions regarding what we have not observed have no justification. The obstacle is irremediable; no matter how many further observations we might make, we would still not be entitled to any opinions regarding what we have not observed. Hume’s point is not the relatively tame conclusion that we are not warranted in making any predictions with total certainty. Hume’s conclusion is more radical: that we are not entitled to any degree of confidence whatever, no matter how slight, in any predictions regarding what we have not observed. We are not justified in having 90% confidence that the sun will rise tomorrow, or in having 70% confidence, or even in being more confident that it will rise than that it will not. There is no opinion (i.e., no degree of confidence) that we are entitled to have regarding a claim concerning what we have not observed. This conclusion “leaves not the lowest degree of evidence in any proposition” that goes beyond our present observations and memory (T , p. 267). Our justified opinions must be “limited to the narrow sphere of our memory and senses” (E, p. 36). 1 All page references to the Treatise are to [Hume, 1978]. All page references to the Enquiry are to [Hume, 1977]. 2 However, even in mathematics, inductive logic is used, as when we take the fact that a computer search program has found no violation of Goldbach’s conjecture up to some enormously high number as evidence that Goldbach’s conjecture is true even for higher numbers. For more examples, see [Franklin, 1987]. Of course, such examples of inductive logic in mathematics must be sharply distinguished from “mathematical induction”, which is a form of deductive reasoning.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
44
Marc Lange
Hume’s problem has not gained its notoriety merely from Hume’s boldness in denying the epistemic credentials of all of the proudest products of science (and many of the humblest products of common-sense). It takes nothing for someone simply to declare himself unpersuaded by the evidence offered for some prediction. Hume’s problem derives its power from the strength of Hume’s argument that it is impossible to justify reposing even a modest degree of confidence in any of our predictions. Again, it would be relatively unimpressive to argue that since a variety of past attempts to justify inductive reasoning have failed, there is presumably no way to justify induction and hence, it seems, no warrant for the conclusions that we have called upon induction to support. But Hume’s argument is much more ambitious. Hume purports not merely to show that various, apparently promising routes to justifying induction all turn out to fail, but also to exclude every possible route to justifying induction. Naturally, many philosophers have tried to find a way around Hume’s argument — to show that science and common-sense are justified in making predictions inductively. Despite these massive efforts, no response to date has received widespread acceptance. Inductive reasoning remains (in C.D. Broad’s famous apothegm) “the glory of Science” and “the scandal of Philosophy” [Broad, 1952, p. 143]. Some philosophers have instead embraced Hume’s conclusion but tried to characterize science so that it does not involve our placing various degrees of confidence in various predictions. For example, Karl Popper has suggested that although science refutes general hypotheses by finding them to be logically inconsistent with our observations, science never confirms (even to the smallest degree) the predictive accuracy of a general hypothesis. Science has us make guesses regarding what we have not observed by using those general hypotheses that have survived the most potential refutations despite sticking their necks out furthest, and we make these guesses even though we have no good reason to repose any confidence in their truth: I think that we shall have to get accustomed to the idea that we must not look upon science as a ‘body of knowledge,’ but rather as a system of hypotheses; that is to say, a system of guesses or anticipations which in principle cannot be justified, but with which we work as long as they stand up to tests, and of which we are never justified in saying that we know that they are ‘true’ or ‘more or less certain’ or even ‘probable’. [Popper, 1959, p. 317; cf. Popper, 1972] However, if we are not justified in having any confidence in a prediction’s truth, then it is difficult to see how it could be rational for us to rely upon that prediction [Salmon, 1981]. Admittedly, “that we cannot give a justification . . . for our guesses does not mean that we may not have guessed the truth.” [Popper, 1972, p. 30] But if we have no good reason to be confident that we have guessed the truth, then we would seem no better justified in being guided by the predictions of theories that have passed their tests than in the predictions of theories that have failed their
Hume and the Problem of Induction
45
tests. There would seem to be no grounds for calling our guesswork “rational”, as Popper does. Furthermore, Popper’s interpretation of science seems inadequate. Some philosophers, such as van Fraassen [1981; 1989], have denied that science confirms the truth of theories about unobservable entities (such as electrons and electric fields), the truth of hypotheses about the laws of nature, or the truth of counterfactual conditionals (which concern what would have happened under circumstances that actually never came to pass — for example, “Had I struck the match, it would have lit”). But these philosophers have argued that these pursuits fall outside of science because we need none of them in order to confirm the empirical adequacy of various theories, a pursuit that is essential to science. So even these interpretations of science are not nearly as austere as Popper’s, according to which science fails to accumulate evidence for empirical predictions. In this essay, I will devote sections 2, 3, and 4 to explaining Hume’s argument and offering some criticism of it. In section 6, I will look at the conclusion that Hume himself draws from it. In sections 5 and 7-11, I will review critically a few of the philosophical responses to Hume that are most lively today.3
2
TWO PROBLEMS OF INDUCTION
Although Hume never uses the term “induction” to characterize his topic, today Hume’s argument is generally presented as targeting inductive reasoning: any of the kinds of reasoning that we ordinarily take as justifying our opinions regarding what we have not observed. Since Hume’s argument exploits the differences between induction and deduction, let’s review them. For the premises of a good deductive argument to be true, but its conclusion to be false, would involve a contradiction. (In philosophical jargon, a good deductive argument is “valid”.) For example, a geometric proof is deductive since the truth of its premises ensures the truth of its conclusion by a maximally strong (i.e., “logical”) guarantee: on pain of contradiction! That deduction reflects the demands of non-contradiction (a semantic point) has a metaphysical consequence — in particular, a consequence having to do with necessity and possibility. A contradiction could not come to pass; it is impossible. So it is impossible for the premises of a good deductive argument to be true but its conclusion to be false. (That is why deduction’s “guarantee” is maximally strong.) It is impossible for a good deductive argument to take us from a truth to a falsehood (i.e., to fail to be “truth-preserving”) because such failure would involve a contradiction and contradictions are impossible. A good deductive argument is necessarily truthpreserving. In contrast, no contradiction is involved in the premises of a good inductive argument being true and its conclusion being false. (Indeed, as we all know, this 3 My critical review is hardly exhaustive. For an admirable discussion of some responses to Hume in older literature that I neglect, see [Salmon, 1967].
46
Marc Lange
sort of thing is a familiar fact of life; our expectations, though justly arrived at by reasoning inductively from our observations, sometimes fail to be met.) For example, no matter how many human cells we have examined and found to contain proteins, there would be no contradiction between our evidence and a given as yet unobserved human cell containing no proteins. No contradiction is involved in a good inductive argument’s failure to be truth-preserving. Once again, this semantic point has a metaphysical consequence if every necessary truth is such that its falsehood involves a contradiction (at least implicitly): even if a given inductive argument is in fact truth-preserving, it could have failed to be. It is not necessarily truth-preserving.4 These differences between deduction and induction lead to many other differences. For example, the goodness of a deductive argument does not come in degrees; all deductive arguments are equally (and maximally) strong. In contrast, some inductive arguments are more powerful than others. Our evidence regarding the presence of oxygen in a room that we are about to enter is much stronger than our evidence regarding the presence of oxygen in the atmosphere of a distant planet, though the latter evidence may still be weighty. As we examine more (and more diverse) human cells and find proteins in each, we are entitled to greater confidence that a given unobserved human cell also contains proteins; the inductive argument grows stronger. Furthermore, since the premises of a good deductive argument suffice to ensure its conclusion on pain of contradiction, any addition to those premises is still enough to ensure the conclusion on pain of contradiction. In contrast, by adding to the premises of a good inductive argument, its strength may be augmented or diminished. By adding to our stock of evidence the discovery of one human cell that lacks proteins, for example, we may reduce the strength of our inductive argument for the prediction of proteins in a given unobserved human cell. That inductive arguments are not deductive — that they are not logically guaranteed to be truth-preserving — plays an important part in Hume’s argument (as we shall see in a moment). But the fact that the premises of a good inductive argument cannot give the same maximal guarantee to its conclusion as the premises of a deductive argument give to its conclusion should not by itself be enough to cast doubt on the cogency of inductive reasoning. That the premises of an inductive argument fail to “demonstrate” the truth of its conclusion (i.e., to show that the conclusion could not possibly be false, given the premises) does not show that its premises fail to confirm the truth of its conclusion — to warrant us (if 4 Sometimes it is said that since the conclusion of a good deductive argument is true given the premises on pain of contradiction, the conclusion is implicitly contained in the premises. A good deductive argument is not “ampliative”. It may make explicit something that was already implicit in the premises, and so we may learn things through deduction, but a deductive argument does not yield conclusions that “go beyond” its premises. In contrast, a good inductive argument is ampliative; it allows us to “go beyond” the evidence in its premises. This “going beyond” is a metaphor that can be cashed out either semantically (the contrary of an inductive argument’s conclusion does not contradict its premises) or metaphysically (it is possible for the conclusion to be false and the premises true).
Hume and the Problem of Induction
47
our belief in the premises is justified) in placing greater confidence (perhaps even great confidence) in the conclusion. (Recall that Hume purports to show that even modest confidence in the conclusions reached by induction is not justified.) That the conclusion of a given inductive argument can be false, though its premises are true, does not show that its premises fail to make its conclusion highly plausible. In short, inductive arguments take risks in going beyond our observations. Of course, we all know that some risks are justified, whereas others are unwarranted. The mere fact that inductive arguments take risks does not automatically show that the risks they take are unreasonable. But Hume (as it is standard to interpret him) purports to show that we are not justified in taking the risks that inductive inferences demand. That induction is fallible does not show that inductive risks cannot be justified. To show that, we need Hume’s argument. It aims to show that any scheme purporting to justify taking those risks must fail. It is important to distinguish two questions that could be asked about the risks we take in having opinions that go beyond the relatively secure ground of what we observe: 1. Why are we justified in going beyond our observations at all? 2. Why are we justified in going beyond our observations in a certain specific way: by having the opinions endorsed by inductive reasoning? To justify our opinions regarding what we have not observed, it would not suffice merely to justify having some opinions about the unobserved rather than none at all. There are many ways in which we could go beyond our observations. But we believe that only certain ways of doing so are warranted. Our rationale for taking risks must be selective: it must reveal that certain risks are worthy of being taken whereas others are unjustified [Salmon, 1967, p. 47]. In other words, an adequate justification of induction must justify induction specifically; it must not apply equally well to all schemes, however arbitrary or cockeyed, for going beyond our observations. For example, an adequate justification of induction should tell us that as we examine more (and more diverse) human cells and find proteins in each, we are entitled (typically) to greater confidence that a given unobserved human cell also contains proteins, but not instead to lesser confidence in this prediction — and also not to greater confidence that a given unobserved banana is ripe. In short, an answer to the first of the two questions above that does not privilege induction, but merely supports our taking some risks rather than none at all, fails to answer the second question. An adequate justification of induction must favor science over guesswork, wishful thinking, necromancy, or superstition; it cannot place them on a par. 3 HUME’S FORK: THE FIRST OPTION Consider any inductive argument. Its premises contain the reports of our observations. Its conclusion concerns something unobserved. It may be a prediction
48
Marc Lange
regarding a particular unobserved case (e.g., that a given human cell contains proteins), a generalization concerning all unobserved cases of a certain kind (that all unobserved human cells contain proteins), or a generalization spanning all cases observed and unobserved (that all human cells contain proteins) — or even something stronger (that it is a law of nature that all human cells contain proteins). Although Hume’s argument is not limited to relations of cause and effect, Hume typically gives examples in which we observe a cause (such as my eating bread) and draw upon our past experiences of events that have accompanied similar causes (our having always derived nourishment after eating bread in the past) to confirm that a similar event (my deriving nourishment) will occur in this case. Another Hume favorite involves examples in which a body’s presenting a certain sensory appearance (such as the appearance of bread) and our past experiences confirm that the body possesses a certain disposition (a “secret power,” such as to nourish when eaten). How can the premises of this inductive argument justify its conclusion? Hume says that in order for the premises to justify the conclusion, we must be able to reason from the premises to the conclusion in one of two ways: All reasonings may be divided into two kinds, namely demonstrative reasoning, or that concerning relations of ideas, and moral reasoning, or that concerning matter of fact and existence. (E, p. 22) By “demonstrative reasoning”, Hume seems to mean deduction. As we have seen, deduction concerns “relations of ideas” in that a deductive argument turns on semantic relations: certain ideas contradicting others. Then “moral reasoning, or that concerning matter of fact and existence” would apparently have to be induction. (“Moral reasoning”, in the archaic sense that Hume uses here, does not refer specifically to reasoning about right and wrong; “moral reasoning” could, in the strongest cases, supply “moral certainty”, a degree of confidence beyond any reasonable doubt but short of the “metaphysical certainty” that a proof supplies.5 ) I will defer the half of Hume’s argument concerned with non-demonstrative reasoning until the next section. indexmoral reasoning Is there a deductive argument taking us from the premises of our inductive argument about bread (and only those premises) to the argument’s conclusion? We cannot think of one. But this does not show conclusively that there isn’t one. As we all know from laboring over proofs to complete our homework assignments for high-school geometry classes, we sometimes fail to see how a given conclusion can be deduced from certain premises even when there is actually a way to do it. But Hume argues that even if we used greater ingenuity, we could not find a way to reason deductively from an inductive argument’s premises to its conclusion. No way exists. Here is a reconstruction of Hume’s argument. If the conclusion of an inductive argument could be deduced from its premises, then the falsehood of the conclusion would contradict the 5 For other examples of this usage, see the seventh definition of the adjective “moral” in The Oxford English Dictionary.
Hume and the Problem of Induction
49
truth of the premises. But the falsehood of its conclusion does not contradict the truth of its premises. So the conclusion of an inductive argument cannot be deduced from its premises. How does Hume know that the conclusion’s falsehood does not contradict the truth of the argument’s premises? Hume says that we can form a clear idea of the conclusion’s being false along with the premises being true, and so this state of affairs must involve no contradiction. Here is the argument in some of Hume’s words: The bread, which I formerly eat, nourished me; that is, a body of such sensible qualities, was, at that time, endued with such secret powers: But does it follow, that other bread must also nourish me at another time, and that like sensible qualities must always be attended with like secret powers? The consequence seems nowise necessary. . . . That there are no demonstrative arguments in the case, seems evident; since it implies no contradiction, that the course of nature may change, and that an object, seemingly like those which we have experienced, may be attended with different or contrary effects. May I not clearly and distinctly conceive, that a body, falling from the clouds, and which, in all other respects, resembles snow, has yet the taste of salt or the feeling of fire? . . . Now whatever is intelligible, and can be distinctly conceived, implies no contradiction, and can never be proved false by any demonstrative argument or abstract reasoning a priori. (E, pp. 21-2) This passage takes us to another way to understand “Hume’s fork”: the choice he offers us between two different kinds of reasoning for taking us from an inductive argument’s premises to its conclusion. We may interpret “demonstrative reasoning” as reasoning a priori (that is, reasoning where the step from premises to conclusion makes no appeal to what we learn from observation) and reasoning “concerning matter of fact and existence” as reasoning empirically (that is, where the step from premises to conclusion depends on observation). According to Hume, we can reason a priori from p to q only if “If p, then q” is necessary — i.e., only if it could not have been false. That is because if it could have been false, then in order to know that it is true, we must check the actual world — that is, make some observations. If “If p, then q” is not a necessity but merely happens to hold (i.e., holds as a “matter of fact”), then we must consult observations in order to know that it is the case. So reasoning “concerning matter of fact and existence” must be empirical. Now Hume can argue once again that by reasoning a priori, we cannot infer from an inductive argument’s premises to its conclusion. Here is a reconstruction:
50
Marc Lange
If we could know a priori that the conclusion of an inductive argument is true if its premises are true, then it would have to be necessary for the conclusion to be true if the premises are true. But it is not necessary for the conclusion to be true if the premises are true. So we cannot know a priori that the conclusion of an inductive argument is true if its premises are true. Once again, Hume defends the middle step on the grounds that we can clearly conceive of the conclusion being false while the premises are true. Hence, there is no contradiction in the conclusion’s being false while the premises are true, and so it is not necessary for the conclusion to be true if the premises are true. Hume says: To form a clear idea of any thing, is an undeniable argument for its possibility, and is alone a refutation of any pretended demonstration against it. (T , p. 89; cf. T , pp. 233, 250) Of course, if the premises of our inductive argument included not just that bread nourished us on every past occasion when we ate some, but also that all bread is alike in nutritional value, then there would be an a priori argument from the premises to the conclusion. It would be a contradiction for all bread to be nutritionally alike, certain slices of bread to be nutritious, but other slices not to be nutritious. However, Hume would ask, how could we know that all bread is alike in nutritional value? That premise (unlike the others) has not been observed to be true. It cannot be inferred a priori to be true, given our observations, since its negation involves no contradiction with our observations. Earlier I said that Hume aims to show that we are not entitled even to the smallest particle of confidence in our predictions about (contingent features of) what we have not observed. But the arguments I have just attributed to Hume are directed against conclusions of the form “p obtains”, not “p is likely to obtain” or “p is more likely to obtain than not to obtain”. How does Hume’s argument generalize to cover these conclusions? How does it generalize to cover opinions short of full belief — opinions involving a degree of confidence less than certainty? The appropriate way to extend Hume’s reasoning depends on what it is to have a degree of belief that falls short of certainty. Having such a degree of belief in p might be interpreted as equivalent to (or at least as associated with) having a full belief that p has a given objective chance of turning out to be true, as when our being twice as confident that a die will land on 1, 2, 3, or 4 than that it will land on 5 or 6 is associated with our believing that the die has twice the objective chance of landing on 1, 2, 3, or 4 than on 5 or 6. As Hume says, There is certainly a probability, which arises from a superiority of chances on any side; and accordingly as this superiority increases, and
Hume and the Problem of Induction
51
surpasses the opposite chances, the probability receives a proportionable increase, and begets still a higher degree of belief or assent to that side, in which we discover the superiority. If a die were marked with one figure or number of spots on four sides, and with another figure or number of spots on the two remaining sides, it would be more probable, that the former would turn up than the latter; though, if it had a thousand sides marked in the same manner, and only one side different, the probability would be much higher, and our belief or expectation of the event more steady and secure. (E, p. 37; cf. T , p. 127) Suppose, then, that our having n% confidence in p must be accompanied by our believing that p has n% chance of obtaining. Then Hume could argue that since there is no contradiction in the premises of an inductive argument being true even while its conclusion lacks n% chance of obtaining — for any n, no matter how low — we cannot proceed a priori from an inductive argument’s premises to even a modest degree of confidence in its conclusion. For example, there is no contradiction in a die’s having landed on 1, 2, 3, or 4 twice as often as it has landed on 5 or 6 in the many tosses that we have already observed, but its not having twice the chance of landing on 1, 2, 3, or 4 as on 5 or 6; the die could even be strongly biased (by virtue of its mass distribution) toward landing on 5 or 6, but nevertheless have happened “by chance” to land twice as often on 1, 2, 3, or 4 as on 5 or 6 in the tosses that we have observed. Hume sometimes seems simply to identify our having n% confidence in p with our believing that p has n% chance of obtaining. For example, he considers this plausible idea: Shou’d it be said, that tho’ in an opposition of chances ‘tis impossible to determine with certainty, on which side the event will fall, yet we can pronounce with certainty, that ‘tis more likely and probable, ‘twill be on that side where there is a superior number of chances, than where there is an inferior: . . . (T , p. 127) Though we are not justified in having complete confidence in a prediction (e.g., that the die’s next toss will land on 1, 2, 3, or 4), we are entitled to a more modest degree of belief in it. (One paragraph later, he characterizes confidence in terms of “degrees of stability and assurance”.) He continues: Shou’d this be said, I wou’d ask, what is here meant by likelihood and probability? The likelihood and probability of chances is a superior number of equal chances; and consequently when we say ‘tis likely the event will fall on the side, which is superior, rather than on the inferior, we do no more than affirm, that where there is a superior number of chances there is actually a superior, and where there is an inferior there is an inferior; which are identical propositions, and of no consequence. (T , p. 127)
52
Marc Lange
In other words, to have a greater degree of confidence that the die’s next toss will land on 1, 2, 3, or 4 than that it will land on 5 or 6 is nothing more than to believe that the former has a greater chance than the latter. However, it is not the case that our having n% confidence that p must be accompanied by our believing that p has n% chance of obtaining. Though the outcomes of die tosses may be governed by objective chances (dice, after all, is a “game of chance”), some of our predictions concern facts that we believe involve no objective chances at all, and we often have non-extremal degrees of confidence in those predictions. For instance, I may have 99% confidence that the next slice of bread I eat will be nutritious, but I do not believe that there is some hidden die-toss, radioactive atomic decay, or other objectively chancy process responsible for its nutritional value. For that matter, I may have 90% confidence that the dinosaurs’s extinction was preceded by an asteroid collision with the earth (or that the field equations of general relativity are laws of nature), but the objective chance right now that an asteroid collided with the earth before the dinosaurs’ extinction is 1 if it did or 0 if it did not (and likewise for the general-relativistic field equations). Suppose that our having n% confidence that the next slice of bread I eat will be nutritious need not be accompanied by any prediction about which we must have full belief — such as that the next slice of bread has n% objective chance of being nutritious, or that n% of all unobserved slices of bread are nutritious, or that n equals the limiting relative frequency of nutritious slices among all bread slices. Since there is no prediction q about which we must have full belief, Hume cannot show that there is no a priori argument from our inductive argument’s premises to n% confidence that the next slice of bread I eat will be nutritious by showing that there is no contradiction in those premises being true while q is false. We have here a significant gap in Hume’s argument [Mackie, 1980, pp. 15-16; Stove, 1965]. If our degrees of belief are personal (i.e., subjective) probabilities rather than claims about the world, then there is no sense in which the truth of an inductive argument’s premises fail to contradict the falsehood of its conclusion — since there is no sense in which its conclusion can be false (or true), since its conclusion is a degree of belief, not a claim about the world. (Of course, the conclusion involves a degree of belief in the truth of some claim about the world. But the degree of belief itself is neither true nor false.) Hence, Hume cannot conclude from such non-contradiction that there is no a priori argument from the inductive argument’s premises to its conclusion. Of course, no a priori argument could demonstrate that the premises’ truth logically guarantees the conclusion’s truth — since, once again, the conclusion is not the kind of thing that could be true (or false). But there could still be an a priori argument from the opinions that constitute the inductive argument’s premises to the degrees of belief that constitute its conclusion — an argument showing that holding the former opinions requires holding the latter, on pain of irrationality. This a priori argument could not turn entirely on semantic relations because a degree of belief is not the sort of thing that can be true or false, so it cannot
Hume and the Problem of Induction
53
be that one believes a contradiction in having the degrees of belief in an inductive argument’s premises without the degree of belief forming its conclusion. Thus, the a priori argument would not be deductive, as I characterized deduction in section 2. Here we see one reason why it is important to distinguish the two ways that Hume’s fork may be understood: (i) as deduction versus induction, or (ii) as a priori reasoning versus empirical reasoning. Hume apparently regards all a priori arguments as deductive arguments, and hence as arguments that do not yield mere degrees of belief, since degrees of belief do not stand in relations of contradiction and non-contradiction. (At E, p. 22, Hume explicitly identifies arguments that are “probable only” with those “such as regard matter of fact and real existence, according to the division above mentioned” — his fork. See likewise T , p. 651.) If degrees of belief can be interpreted as personal probabilities, then there are a priori arguments purporting to show that certain degrees of belief cannot rationally be accompanied by others: for example, that 60% confidence that p is true cannot be accompanied by 60% confidence that p is false — on pain not of contradiction, but of irrationality (“incoherence”). Whether such a priori arguments can resolve Hume’s problem is a question that I will take up in section 9. On the other hand, even if our degrees of belief are personal probabilities rather than claims about the world, perhaps our use of induction to generate our degrees of belief must (on pain of irrationality) be accompanied by certain full beliefs about the world. Suppose we regard Jones as an expert in some arcane subject — so much so that we take Jones’ opinions on that subject as our own. Surely, we would be irrational to regard Jones as an expert and yet not believe that there is a higher fraction of truths among the claims in Jones’ area of expertise about which Jones is highly confident than among the claims in Jones’ area of expertise about which Jones is highly doubtful (presuming that there are many of both). If we did not have this belief, then how could we consider Jones to be an expert? (A caveat: Perhaps our belief that Jones is an expert leaves room for the possibility that Jones has a run of bad luck so that by chance, there is a higher fraction of truths among the claims about which Jones is doubtful than among the claims about which Jones is highly confident. However, perhaps in taking Jones to be an expert, we must at least believe there to be a high objective chance that there is a higher fraction of truths among the claims about which Jones is highly confident than among the claims about which Jones harbors grave doubts.) We use induction to guide our predictions. In effect, then, we take induction as an expert; we take the opinions that induction yields from our observations and make them our own. Accordingly, we must believe that there is (a high chance that there is) a higher fraction of truths among the claims to which induction from our observations assigns a high degree of confidence than among the claims to which induction from our observations assigns a low degree of confidence (presuming that there are many of both). (Perhaps we must even believe that there is (a high chance that there is) a high fraction of truths among the claims to which induction from our observations assigns a high degree of confidence. Otherwise, why would we have such a high degree of confidence in their truth?)
54
Marc Lange
We may now formulate an argument [Skyrms, 1986, pp. 25—27] in the spirit of Hume’s. To be justified in using induction to generate our degrees of belief, we must be justified in believing that there is (a high chance that there is) a higher fraction of truths among the claims to which induction from our observations assigns a high degree of confidence than among the claims to which induction from our observations assigns a low degree of confidence. But the falsehood of this claim does not contradict our observations. So we cannot know a priori (or deductively) that this claim is true given our observations. For us to be justified in using induction, would it suffice that we justly possess a high degree of confidence that there is (a high chance that there is) a higher fraction of truths among the claims to which induction from our observations assigns a high degree of confidence than among the claims to which induction from our observations assigns a low degree of confidence? Perhaps.6 If so, then once again, our Humean argument is vulnerable to the reply that there may be an a priori argument for our having this high degree of confidence, given our observations, even if there is no contradiction between our observations and the negation of the claim in which we are placing great confidence. On the other hand, consider our expert Jones. Suppose we merely possess a high degree of confidence that there is a higher fraction of truths among the claims in Jones’ area of expertise about which Jones is highly confident than among the claims in Jones’ area of expertise about which Jones is highly doubtful. Then although we might give great weight to Jones’ opinions, we might well not take Jones’ opinions as our own. We should, if possible, consult many other experts along with Jones and weigh each one’s opinion regarding p by our confidence in the expert who holds it in order to derive our own opinion regarding p. We should take into account whether we believe that a given expert is more likely to err by placing great confidence in claims about which he should be more cautious or by having grave doubts regarding claims in which he should place greater confidence. But our relation to the expert Jones would then be very different from our relation to our “in-house” expert Induction. In contrast to Jones’ opinions, the opinions that induction generates from our observations we take unmodified as our own. If we possessed merely a high degree of confidence that there is a higher fraction of truths among the claims to which induction from our observations assigns a high degree of confidence than among the claims to which induction from our observations assigns a low degree of confidence, then we would have to take the degrees of belief recommended by induction and amend them in light of our estimates of induction’s tendency to excessive confidence and tendency to excessive caution. We do not seem to rest our reliance on induction upon any balancing (or even contemplation) of these correction factors.
6 Though Hume doesn’t seem to think so: “If there be any suspicion, that the course of nature may change, and that the past may be no rule for the future, all experience becomes useless, and can give rise to no inference or conclusion.” (E, p. 24)
Hume and the Problem of Induction
4
55
HUME’S FORK: THE SECOND OPTION
Let’s now turn to the second option in Hume’s fork: Is there an inductive (rather than deductive) — or empirical (rather than a priori ) — argument taking us from the premises of a given inductive argument to its conclusion? Of course there is: the given inductive argument itself! But since that is the very argument that we are worrying about, we cannot appeal to it to show that we are justified in proceeding from its premises to its conclusion. Is there any independent way to argue inductively (or empirically) that this argument’s conclusion is true if its premises are true? Hume argues that there is not. He believes that any inductive (or empirical) argument that we would ordinarily take to be good is of the same kind as the argument that we are worrying about, and so cannot be used to justify that argument on pain of circularity: [A]ll experimental conclusions [what Hume on the following page calls “inferences from experience”] proceed upon the supposition that the future will be conformable to the past. To endeavour, therefore, the proof of this last supposition by probable arguments, or arguments regarding existence, must be evidently going in a circle, and taking that for granted, which is the very point in question. (E, p. 23) [P]robability is founded on the presumption of a resemblance betwixt those objects, of which we have had experience, and those, of which we have had none; and therefore ‘tis impossible this presumption can arise from probability. (T , p. 90)7 Since all non-deductive arguments that we consider good are based on the “principle of the uniformity of nature” (that unexamined cases are like the cases that we have already observed), it would be begging the question to use some such argument to take us from the premises to the conclusion of an inductive argument. For example, suppose we argued as follows for a high degree of confidence that the next slice of bread to be sampled will be nutritious: 1. We have examined many slices of bread for their nutritional value and found all of them to be nutritious. 2. (from 1) If unobserved slices of bread are like the slices of bread that we have already examined, then the next slice of bread we observe will be nutritious. 3. When in the past we examined things that had not yet been observed, we usually found them to be like the things that we had already observed. 7 Although Hume’s is the canonical formulation of the argument, the ideas behind it seem to have been in the air. In 1736, Joseph Butler [1813, p. 17] identified the probability “that all things will continue as we experience they are” as “our only natural reason for believing the course of the world will continue to-morrow, as it has done as far as our experience or knowledge of history can carry us back.”
56
Marc Lange
4. So (from 3) unobserved slices of bread are probably like examined slices of bread. 5. Therefore (from 2 and 4) it is likely that the next slice of bread we observe will be nutritious. But the step from (3) to (4) is based on our confidence that unobserved things are like observed things, which — had we been entitled to it — could have gotten us directly from (1) to (5) without any detour through (2), (3), and (4). As Hume wrote, Shou’d it be said, that we have experience, that the same power continues united with the same object, and that like objects are endow’d with like powers, I wou’d renew my question, why from this experience we form any conclusion beyond those past instances, of which we have had experience. If you answer this question in the same manner as the preceding, your answer gives still occasion to a new question of the same kind, even in infinitum; which clearly proves, that the foregoing reasoning had no just foundation. (T , p. 91) To justify induction by arguing that induction is likely to work well in the future, since it has worked well in the past, is circular.8 It might be suggested that although a circular argument is ordinarily unable to justify its conclusion, a circular argument is acceptable in the case of justifying a fundamental form of reasoning. After all, there is nowhere more basic to turn, so all that we can reasonably demand of a fundamental form of reasoning is that it endorse itself. However, certain ludicrous alternatives to induction are also self-supporting. For instance, if induction is based on the presupposition that unexamined cases are like the cases that we have already observed, then take “counterinduction” to be based on the opposite presupposition: that unexamined cases are unlike the cases that we have already observed. For example, induction urges us to expect unexamined human cells to contain proteins, considering that 8 Moreover, surely we did not have to wait to accumulate evidence of induction’s track record in order to be justified in reasoning inductively. It has sometimes been suggested (for instance, by [Black, 1954]) that an inductive justification of induction is not viciously circular. Roughly speaking, the suggestion is that the argument from past observations of bread to bread predictions goes by a form of reasoning involving only claims about bread and other concrete particulars, whereas the argument justifying that form of reasoning (“It has worked well in past cases, so it will probably work well in future cases”) goes by a form of reasoning involving only claims about forms of reasoning involving only claims about bread and the like. In short, the second form of reasoning is at a higher level than and so distinct from the first. Therefore, to use an argument of the second form to justify an argument of the first form is not circular. This response to the problem of induction has been widely rejected on two grounds [BonJour, 1986, pp. 105—6]: (i) Even if we concede that these two forms of argument are distinct, the justification of the first form remains conditional on the justification of the second form, and so on, starting an infinite regress. No form ever manages to acquire unconditional justification. (ii) The two forms of argument do not seem sufficiently different for the use of one in justifying the other to avoid begging the question.
Hume and the Problem of Induction
57
every human cell that has been tested for proteins has been found to contain some. Accordingly, given that same evidence, counterinduction urges us to expect unexamined human cells not to contain proteins.9 Counterinduction is plainly bad reasoning. However, just as induction supports itself (in that induction has worked well in the past, so by induction, it is likely to work well in the future), counterinduction supports itself (in that counterinduction has not worked well in the past, so by counterinduction, it is likely to work well in the future). If we allow induction to justify itself circularly, then we shall have to extend the same privilege to counterinduction (unless we just beg the question by presupposing that induction is justified whereas counterinduction is not). But as I pointed out at the close of section 2, an adequate justification of induction must justify induction specifically; it must not apply equally well to all schemes, however arbitrary or cockeyed, for going beyond our observations. Even counterinduction is self-supporting, so being self-supporting cannot suffice for being justified. [Salmon, 1967, pp. 12—17] It might be objected that there are many kinds of inductive arguments — not just the “induction by enumeration” (taking regularities in our observations and extrapolating them to unexamined cases) that figures in Hume’s principal examples, but also (for example) the hypothetico-deductive method, common-cause inference [Salmon, 1984], and inference to the best explanation [Harman, 1965; Thagard, 1978; Lipton, 1991]. Does this diversity undermine Hume’s circularity argument? One might think not: even if an inference to the best explanation could somehow be used to support the “uniformity assumption” grounding one of Hume’s inductions by enumeration, we would still need a justification of inference to the best explanation in order to justify the conclusion of the inductive argument. There is some justice in this reply. However, this reply also misunderstands the goal of Hume’s argument. Hume is not merely demanding that we justify induction, pointing out that we have not yet done so, and suggesting that until we do so, we are not entitled to induction’s fruits. Hume is purporting to show that it is impossible to justify induction. To do that, Hume must show that any possible means of justifying induction either cannot reach its target or begs the question in reaching it. The only way that inference to the best explanation (or some other 9 Of course, expressed this crudely, “counterinduction” would apparently lead to logically inconsistent beliefs — for instance, that that the next emerald we observe will be yellow (since every emerald we have checked so far has been found not to be yellow) and that the next emerald we observe will be orange (since every emerald we have checked so far has been found not to be orange). One way to reply is to say: so much the worse, then, for any argument that purports to justify counterinduction! Another reply is to say that like induction, counterinduction requires that we form our expectations on the basis of all of our evidence to date, so we must consider that every emerald we have checked so far has been found not merely to be non-yellow and nonorange, but to be green, so by counterinduction, we should expect only that the next emerald to be observed will not be green. Finally, we might point out that induction must apply the principle of the uniformity of nature selectively, on pain of leading to logically inconsistent beliefs, as Goodman’s argument will show (in a moment). “Counterinduction” must likewise be selective in applying the principle of the non-uniformity of nature. But no matter: let’s suppose that counterinduction allows that principle to be applied in the argument that I am about to give by which counterinduction supports itself.
58
Marc Lange
non-deductive kind of inference) can beg the question is if it, too, is based on some principle of the uniformity of nature. That it has not yet itself been justified fails to show that induction cannot be justified. In other words, Hume’s argument is not that if one non-deductive argument is supported by another, then we have not yet justified the first argument because the second remains ungrounded. Rather, Hume’s argument is that every non-deductive argument that we regard as good is of the same kind, so it would be circular to use any of them to support any other. In other words, Hume is arguing that there is a single kind of non-deductive argument (which we now call “induction”) that we consider acceptable. Consequently, it is misleading to characterize Hume’s fork as offering us two options: deduction and induction. To put the fork that way gives the impression that Hume is entitled from the outset of his argument to presume that induction is a single kind of reasoning. But that is part of what Hume needs to and does argue for: [A]ll arguments from experience are founded on the similarity, which we discover among natural objects, and by which we are induced to expect effects similar to those, which we have found to follow from such objects. (E, p. 23) If some good non-deductive argument that does not turn on a uniformity-of-nature presumption could be marshaled to take us from the premises to the conclusion of an inductive argument, then we could invoke that argument to justify induction. As long as the argument does not rely on a uniformity-of-nature presumption, we beg no question in using it to justify induction; it is far enough away from induction to avoid circularity. Hume’s point is not that any non-deductive scheme for justifying an induction leaves us with another question: how is that scheme to be justified? Hume’s point is that any non-deductive scheme for justifying an induction leaves us with a question of the same kind as we started with, because every non-deductive scheme is fundamentally the same kind of argument as we were initially trying to justify. Let me put my point in one final way. Hume has sometimes been accused of setting an unreasonably high standard for inductive arguments to qualify as justified: that they be capable of being turned into deductive arguments [Stove, 1973; Mackie, 1974]. In other words, Hume has been accused of “deductive chauvinism”: as presupposing that only deductive arguments can justify. But Hume does not begin by insisting that deduction is the only non-circular way to justify induction. Hume argues for this by arguing that every non-deductive argument is of the same kind. If there were many distinct kinds of non-deductive arguments, Hume would not be able to suggest that any non-deductive defense of induction is circular. Hume’s argument, then, turns on the thought that every inductive argument is based on the same presupposition: that unobserved cases are similar to the cases that we have already observed. However, Nelson Goodman [1954] famously showed that such a “principle of the uniformity of nature” is empty. No matter
Hume and the Problem of Induction
59
what the unobserved cases turn out to be like, there is a respect in which they are similar to the cases that we have already observed. Therefore, the “principle of the uniformity of nature” (even if we are entitled to it) is not sufficient to justify making one prediction rather than another on the basis of our observations. Different possible futures would continue different past regularities, but any possible future would continue some past regularity. [Sober, 1988, pp. 63—69] For example, Goodman says, suppose we have examined many emeralds and found each of them at the time of examination to be green. Then each of them was also “grue” at that time, where Object x is grue at time t iff x is green at t where t is earlier than the year 3000 or x is blue at t where t is during or after the year 3000.10 Every emerald that we have found to be green at a certain moment we have also found to be grue at that moment. So if emeralds after 3000 are similar to examined emeralds in their grueness, then they will be blue, whereas if emeralds after 3000 are similar to examined emeralds in their greenness, then they will be green. Obviously, the point generalizes: no matter what the color(s) of emeralds after 3000, there will be a respect in which they are like the emeralds that we have already examined. The principle of the uniformity of nature is satisfied no matter how “disorderly” the world turns out to be, since there is inevitably some respect in which it is uniform. So the principle of the uniformity of nature is necessarily true; it is knowable a priori. The trouble is that it purchases its necessity by being empty. Thus, we can justify believing in the principle of the uniformity of nature. But this is not enough to justify induction. Indeed, by applying the “principle of the uniformity of nature” indiscriminately (both to the green hypothesis and to the grue hypothesis), we make inconsistent predictions regarding emeralds after 3000. So to justify induction, we must justify expecting certain sorts of past uniformities rather than others to continue. The same argument has often been made in terms of our fitting a curve through the data points that we have already accumulated and plotted on a graph. Through any finite number of points, infinitely many curves can be drawn. These curves disagree in their predictions regarding the data points that we will gather later. But no matter where those points turn out to lie, there will be a curve running through them together with our current data. Of course, we regard some of the curves passing through our current data as making arbitrary bends later (at the year 3000, for instance); we would not regard extrapolating those curves as justified. To justify induction requires justifying those extrapolations we consider 10 Here I have updated and simplified Goodman’s definition of “grue.” He defines an object as “grue” if it is green and examined before a given date in the distant future, or is blue otherwise. My definition, which is more typical of the way that Goodman’s argument is presented, defines what it takes for an object to be grue at a certain moment and does without the reference to the time at which the object was examined. Notice that whether an object is grue at a given moment before 3000 does not depend on whether the object is blue after 3000, just as to qualify as green now, an object does not need to be green later.
60
Marc Lange
“straight” over those that make unmotivated bends. It might be alleged that of course, at any moment at which something is green, there is a respect in which it is like any other thing at any moment when it is green, whereas no property is automatically shared by any two objects while they are both grue; they must also both lie on the same side of the year 3000, so that they are the same color. Thus, “All emeralds are grue” is just a linguistic trick for papering over a non-uniformity and disguising it as a uniformity. But this move begs the question: why do green, blue, and other colors constitute respects in which things can be alike whereas grue, bleen, and other such “schmolors” do not? Even if there is some metaphysical basis for privileging green over grue, our expectation that unexamined emeralds are green, given that examined ones are, can be based on the principle of the uniformity of nature only if we already know that all green objects are genuinely alike. How could we justify that without begging the question? The principle of the uniformity of nature does much less to ground some particular inductive inference than we might have thought. At best, each inductive argument is based on some narrower, more specific presupposition about the respect in which unexamined cases are likely to be similar to examined cases. Therefore, Hume is mistaken in thinking that all inductive arguments are of the same kind in virtue of their all turning on the principle of the uniformity of nature. Hence, Hume has failed to show that it is circular to use one inductive argument to support another. Of course, even if this gap in Hume’s argument spoils Hume’s own demonstration that there is no possible way to justify induction, it still leaves us with another, albeit less decisive argument (to which I alluded a moment ago) against the possibility of justifying any particular inductive argument. Rather than argue that any inductive justification of induction is circular, we can offer a regress argument. If observations of past slices of bread justify our high confidence that the next slice of bread we eat will be nutritious, then what justifies our regarding those past observations of bread as confirming that the next slice of bread we eat will be nutritious? Apparently, other observations justify our believing in this link between our bread observations and our bread predictions — by justifying our high confidence that unexamined slices of bread are similar in nutritional value to already observed slices of bread. But whatever those other observations are, this bread uniformity does not follow deductively from them. They confirm it only by virtue of still other observations, which justify our believing in this link between certain observations and the bread uniformity. But what justifies our regarding those observations, in turn, as confirming this link, i.e., as confirming that the bread uniformity is likely if the first set of observations holds? We are off on a regress. A given inductive link is justified (if at all) only by observations, but those observations justify that link (if at all) only through another inductive link, which is justified (if at all) only by observations, which justify that link (if at all) only through another inductive link. . . . How can it end? If all non-deductive arguments can be justified only by way of observations, then any argument that
Hume and the Problem of Induction
61
we might use to justify a given non-deductive argument can itself be justified only by way of observations, and those observations could justify that argument only by an argument that can be justified only by way of other observations, and those observations could justify that argument only by an argument that can be justified only by way of still other observations, and so on infinitely. No bottom ever seems to be reached, so none of these arguments is actually able to be justified. In other words, any non-deductive argument rests upon high confidence in some contingent (i.e., non-necessary) truth (such as that slices of bread are generally alike nutritionally) linking its conclusion to its premises. We cannot use a nondeductive argument to justify this confidence without presupposing high confidence in some other contingent truth. We have here a close cousin of Hume’s argument — one that leads to the same conclusion, but through a regress rather than a circle.11 Even if there is no single “principle of the uniformity of nature” on which every inductive argument rests, there remains a formidable argument that no inductive argument can be justified inductively.12 5 THREE WAYS OF REJECTING HUME’S PROBLEM Let’s sum up Hume’s argument. We cannot use an inductive argument to justify inferring an inductive argument’s conclusion from its premises, on pain of circularity. We cannot use a deductive argument to justify do so, because there is no deductive argument from the premises of an inductive argument to its conclusion. So there is no way to justify the step from an inductive argument’s premises to its conclusion. Hume, according to the standard interpretation, holds that we are not entitled to our opinions regarding what we have not observed; those opinions are unjustified. This is a bit different from the conclusion that there is no way to justify that step — that no successful (e.g., non-question-begging) argument can be given for it. Perhaps there is no argument by which induction can be justified, but we are nevertheless justified in using induction, and so we are entitled to the opinions that we arrive at inductively. In this section, I shall look briefly at some forms that this view has taken. It has often been thought that certain beliefs are justified even though there is no argument by which they acquire their justification; they are “foundational”. Many epistemologists have been foundationalists, arguing that unless certain beliefs are 11 Perhaps it is even Hume’s argument. See the passage I quoted earlier from T, p. 91, where Hume’s argument takes the form of a regress. 12 Contrast John Norton [2003], who gives similar arguments that there is no general rule of inductive inference. Norton contends that different inductions we draw are grounded on different opinions we have regarding various particular contingent facts (e.g., that samples of chemical elements are usually uniform in their physical properties). He concludes that there is no special problem of induction. There is only the question of how there can be an end to the regress of justifications that begins with the demand that we justify those opinions regarding particular contingent facts on which one of our inductive arguments rests.
62
Marc Lange
justified without having to inherit their justification by inference from other beliefs that already possess justification, none of our beliefs is justified. (After all, a regress seems to loom if every belief acquires its justification from other beliefs that acquire their justification from other beliefs. . . . How could this regress end except with beliefs that are justified without having to have acquired their justification from other beliefs? This regress argument poses one of the classic problems of epistemology.) The beliefs that we acquire directly from making certain observations have often been considered foundational. Another kind of belief that has often been considered foundational consists of our beliefs in certain simple propositions that we know a priori, from which we infer the rest of our a priori knowledge — for example, that a person is tall if she is tall and thin. We rest our knowledge of this fact on no argument. It has sometimes been maintained that we just “see” — by a kind of “rational insight” — that this fact obtains (indeed, that it is necessary). Some philosophers have suggested that the proper lesson to take from Hume’s argument is that induction is likewise foundational. For instance, Bertrand Russell [1959, pp. 60-69] offers an inductive principle and suggests that it does not need to rest on anything to be justified. It is an independent, fundamental rule of inference.13 But this approach has all of the advantages of theft over honest toil.14 It fails to explain why induction rather than some alternative is a fundamental rule of inference. It does not tell us why we should expect the products of inductive reasoning to be true. It tries to make us feel better about having no answer to Hume’s problem — but fails. As Wesley Salmon writes: This is clearly an admission of defeat regarding Hume’s problem, but it may be an interesting way to give up on the problem. The search for the weakest and most plausible assumptions sufficient to justify alternative inductive methods may cast considerable light upon the logical structure of scientific inference. But, it seems to me, admission of unjustified and unjustifiable postulates to deal with the problem is tantamount to making scientific method a matter of faith. [Salmon, 1967, pp. 47–8] When philosophers have identified certain sorts of beliefs as foundational, they have generally offered some positive account of how those beliefs manage to be 13 Russell offers the following as a primitive inductive principle: “(a) When a thing of a certain sort A has been found to be associated with a thing of a certain other sort B, and has never been found dissociated from a thing of the sort B, the greater the number of cases in which A and B have been associated, the greater is the probability that they will be associated in a fresh case in which one of them is known to be present; (b) Under the same circumstances, a sufficient number of cases of association will make the probability of a fresh association nearly a certainty, and will make it approach certainty without limit.” [1959, p. 66; cf. Russell, 1948, pp. 490–1] Of course, this principle is vulnerable to Goodman’s “grue” problem. Other approaches offering primitive inductive principles are Mill’s [1872] “axiom of the uniformity of the course of nature” and Keynes’ [1921] presumption of “limited independent variety”. 14 The phrase is Russell’s: “The method of ”postulating” what we want has many advantages; they are the same as the advantages of theft over honest toil.” [1919, p. 71]
Hume and the Problem of Induction
63
non-inferentially justified (e.g., of how certain of us qualify as able to make certain kinds of observations, or of how we know certain facts a priori ). Simply to declare that induction counts as good reasoning seems arbitrary. A similar problem afflicts the so-called “ordinary-language dissolution” of the problem of induction. Many philosophers have suggested that induction is a fundamental kind of reasoning and that part of what we mean by evidence rendering a given scientific theory “justified”, “likely”, “well supported”, and so forth is that there is a strong inductive argument for it from the evidence. Hence, to ask “Why is inductive reasoning able to justify?” is either to ask a trivial question (because by definition, inductive reasoning counts as able to justify) or to ask a meaningless question (because, in asking this question, we are not using the word “justify” in any familiar, determinate sense). As P.F. Strawson remarks: It is an analytic proposition that it is reasonable to have a degree of belief in a statement which is proportional to the strength of the evidence in its favour; and it is an analytic proposition, though not a proposition of mathematics, that, other things being equal, the evidence for a generalization is strong in proportion as the number of favourable instances, and the variety of circumstances in which they have been found, is great. So to ask whether it is reasonable to place reliance on inductive procedures is like asking whether it is reasonable to proportion the degree of one’s convictions to the strength of the evidence. Doing this is what ‘being reasonable’ means in such a context. . . . In applying or withholding the epithets ‘justified’, well founded’, &c., in the case of specific beliefs, we are appealing to, and applying, inductive standards. But to what standards are we appealing when we ask whether the application of inductive standards is justified or well grounded? If we cannot answer, then no sense has been given to the question. [Strawson, 1952, pp. 256-7]15 In contending that it is either trivial or meaningless to ask for a justification of induction, the ordinary-language approach does not purport to “solve” the problem of induction, but rather to “dissolve it”: to show that the demand for a justification of induction should be rejected. One might reply that this line of thought offers us no reason to believe that the conclusions of strong inductive arguments from true premises are likely to be true. But the ordinary-language theorist disagrees: that these conclusions are the conclusions of strong inductive arguments from true premises is itself a good reason to believe that they are likely to be true. What else could we mean by a “good reason” than the kind of thing that we respect as a good reason, and what’s more respectable than induction? In his Philosophical Investigations, 15 Cf. [Horwich, 1982, pp. 97—98; Salmon, Barker, and Kyburg, 1965]. For critique of this view, I am especially indebted to [BonJour, 1998, pp. 196—199; Salmon, 1967, pp. 49—52; and Skyrms, 1986, pp. 47—54].
64
Marc Lange
Ludwig Wittgenstein recognizes that we may feel the need for a standard that grounds our standards for belief, but he urges us to resist this craving: 480. If it is now asked: But how can previous experience be a ground for assuming that such-and-such will occur later on? — the answer is: What general concept have we of grounds for this kind of assumption? This sort of statement about the past is simply what we call a ground for assuming that this will happen in the future. . . 481. If anyone said that information about the past could not convince him that something would happen in the future, I should not understand him. One might ask him: what do you expect to be told, then? What sort of information do you call a ground for such a belief? . . . If these are not grounds, then what are grounds? — If you say these are not grounds, then you must surely be able to state what must be the case for us to have the right to say that there are grounds for our assumption. . . 482. We are misled by this way of putting it: ‘This is a good ground, for it makes the occurrence of the event probable.’ That is as if we had asserted something further about the ground, which justified it as a ground; whereas to say that this ground makes the occurrence probable is to say nothing except that this ground comes up to a particular standard of good grounds — but the standard has no grounds!. . . .. 484. One would like to say: ‘It is a good ground only because it makes the occurrence really probable.’. . . 486. Was I justified in drawing these consequences? What is called a justification here? — How is the word ‘justification’ used?. . . [Wittgenstein, 1953; cf. Rhees and Phillips, 2003, pp. 73-77] But this argument makes the fact that induction counts for us as “good reasoning” seem utterly arbitrary. We have not been told why we should respect induction in this way. We have simply been reminded that we do. If part of what “good reason” means is that inductive reasons qualify as good, then so be it. We can still ask why we ought to have a term that applies to inductive arguments (and not to bogus arguments instead or in addition) and where a consequence of its applying to some argument is that we ought to endorse that argument. The mere fact that these circumstances of application and consequences of application are coupled in the meaning of “good reason” cannot prevent us from asking why they ought to be coupled — just as (to use Michael Dummett’s example) the term “Boche” has “German” as its circumstance of application and “barbaric” as its consequence of application, so that it is contradictory to say “The Boche are not really barbaric” or “The Germans are not really the Boche”, but we can still ask why we ought (not) to have “Boche” in our language. [Dummett,
Hume and the Problem of Induction
65
1981, p. 454; Brandom, 1994, pp. 126–127]. It is not contradictory to conclude that we should not use the term “Boche” because it is not true that the Germans are barbaric. Analogously, we can ask why we ought to have a term like “good argument” if an inductive argument automatically qualifies as a “good argument” and any “good argument” is automatically one that conveys justification from its premises to its conclusion. Without some account of why those circumstances of application deserve to go with those consequences of application, we have no reason to put them together — and so no reason to think better of arguments that qualify by definition as “good.” As Salmon says, It sounds very much as if the whole [ordinary-language] argument has the function of transferring to the word ‘inductive’ all of the honorific connotations of the word ‘reasonable’, quite apart from whether induction is good for anything. The resulting justification of induction amounts to this: If you use inductive procedures you can call yourself ‘reasonable’ — and isn’t that nice! [Salmon, 1957, p. 42; cf. Strawson, 1958] It does not show us why we ought to be “reasonable”. However, the ordinary-language dissolutionist persists, to ask why we ought to use the term “good reason” — why we ought to couple its circumstances and consequences of application — is just to ask for good reasons for us to use it. We cannot find some point outside of all of our justificatory standards from which to justify our standards of justification. What standards of justification do we mean when we demand a justification of induction? If an argument meets some “standards of justification” that are not our usual ones, then it does not qualify as a justification. On the other hand, if it meets our usual standards of justification, then (since, Hume showed, no deductive argument can succeed in justifying induction) the argument will inevitably be inductive and so beg the question, as Hume showed. But we do not need to specify our standards of justification in advance in order for our demand for a justification of induction to make sense. We know roughly what a justification is, just as we know roughly what it is for an argument to beg the question. A justification of induction would consist of an argument that we believe is properly characterized as justificatory — an argument that, we can show, meets the same standards as the familiar arguments that we pretheoretically recognize as justificatory. In showing this, we may be led to new formulations of those standards — formulations that reveal features that had been implicit in our prior use of “justification”. When we ask for a justification of induction, we are not trying to step entirely outside of our prior standards of justification, but at the same time, we are not asking merely to be reminded that induction is one of the kinds of reasoning that we customarily recognize as good. Rather, we are asking for some independent grounds for recognizing induction as good — a motivation for doing so that is different enough to avoid begging the question, but not so different that it is unrecognizable as a justification. Of course, it may be unclear
66
Marc Lange
what sort of reasoning could manage to walk this fine line until we have found it. But that does not show that our demand for a justification of induction is trivial or meaningless. Here is an analogy. Suppose we want to know whether capital punishment counts as “cruel” in the sense in which the United States Constitution, the English Bill of Rights, and the Universal Declaration of Human Rights outlaw cruel punishment. One might argue that since capital punishment was practiced when these documents were framed (and is practiced today in the United States), “cruel” as they (and we) mean it must not apply to capital punishment. But the question of whether capital punishment is cruel cannot be so glibly dismissed as trivial (if we mean “cruel” in our sense) or meaningless (if we mean “cruel” in some other, unspecified sense) — and could not be so dismissed even if no one had ever thought that capital punishment is cruel. What we need, in order to answer the question properly, is an independent standard of what it takes for some punishment to qualify as “cruel”. The standard must fit enough of our pretheoretic intuitions about cruel punishment (as manifested, for example, in legal precedents) that we are justified in thinking that it has managed to make this notion more explicit, rather than to misunderstand it. Furthermore, the standard must derive its credentials independently from whatever it says about capital punishment, so that we avoid begging the question in using this standard to judge whether capital punishment qualifies as cruel. Of course, prior to being sufficiently creative and insightful to formulate such a standard, we may well be unable to see how it could be done. But that is one reason why it takes great skill to craft good arguments for legal interpretations — and, analogously, why it is difficult to address Hume’s problem. Another approach that deems induction to be good reasoning, even though no non-question-begging argument can be given to justify it, appeals to epistemological naturalism and externalism. On this view, if inductive reasoning from true premises does, in fact, tend to lead to the truth, then an inductive argument has the power to justify its conclusion even though the reasoner has no non-circular basis for believing that the conclusions of inductive arguments from true premises are usually true [Brueckner, 2001; Kornblith, 1993; Papineau, 1993, pp. 153–160; Sankey, 1997; van Cleve, 1984]. To my mind, this approach simply fails to engage with Hume’s problem of induction. The externalist believes that we qualify as having good reasons for our opinions regarding the future as long as inductive arguments from true premises do in fact usually yield the truth regarding unexamined cases. But the problem of induction was to offer a good reason to believe that a given inductive argument from true premises will likely yield the truth regarding unexamined cases. Suppose the externalist can persuade us that to be justified in some belief is to arrive at it by reliable means. Then we are persuaded that if induction is actually reliable, then the conclusion of an inductive argument (from justified premises) is justified. We are also persuaded that if induction actually is reliable, then an inductive reasoner is justified in her belief (arrived at inductively, from the frequent success of past
Hume and the Problem of Induction
67
inductive inferences) that induction will continue to be reliable. Nevertheless, the externalist has not persuaded us that induction is reliable. 6
HUME’S CONCLUSION
Hume, according to the standard interpretation of his view, is an “inductive skeptic”: he holds that we are not entitled to our opinions regarding what we have not observed. There are plenty of textual grounds for this interpretation. For example, in a passage that we have already quoted (T , p. 91), he says that an inductive argument for induction has “no just foundation”, suggesting that his main concern is whether induction has a just foundation. Sometimes Hume appears to concede induction’s justification: I shall allow, if you please, that the one proposition [about unexamined cases] may justly be inferred from the other [about examined cases]: I know in fact, that it always is inferred. (E, p. 22). But his “if you please” suggests that this concession is merely rhetorical – for the sake of argument. His point is that someone who believes that we are justified in having expectations regarding unexamined cases should be concerned with uncovering their justification. When Hume finds no justification, he concludes that these expectations are unjustified. Accordingly, Hume says, our expectations are not the product of an “inference” (E, p. 24) or some “logic” (E, p. 24) or “a process of argument or ratiocination” (E, p. 25): it is not reasoning which engages us to suppose the past resembling the future, and to expect similar effects from causes, which are, to appearance, similar. (E, p. 25) Rather, Hume says, our expectations regarding what we have not observed are the result of the operation of certain innate “instincts” (E, pp. 30, 37, 110). When (for example) the sight of fire has generally been accompanied by the feeling of heat in our past experience, these instinctual mental mechanisms lead us, when we again observe fire, to form a forceful, vivid idea of heat — that is, to expect heat: What then is the conclusion of the whole matter? A simple one; though, it must be confessed, pretty remote from the common theories of philosophy. All belief of matter of fact or real existence is derived merely from some object, present to the memory or senses, and a customary conjunction between that and some other object. Or in other words; having found, in many instances, that any two kinds of objects, flame and heat, snow and cold, have always been conjoined together; if flame or snow be presented anew to the senses, the mind is carried by custom to expect heat or cold, and to believe, that such
68
Marc Lange
a quality does exist, and will discover itself upon a nearer approach. This belief is the necessary result of placing the mind in such circumstances. It is an operation of the soul, when we are so situated, as unavoidable as to feel the passion of love, when we receive benefits; or hatred, when we meet with injuries. All these operations are a species of natural instincts, which no reasoning or process of the thought and understanding is able, either to produce, or to prevent. (E, p. 30) With prose like that, is it any wonder that Hume’s argument has become such a classic? Hume believes that we cannot help but form these expectations, in view of the way our minds work. So Hume does not recommend that we try to stop forming them. Any such attempt would be in vain. But Hume’s failure to recommend that we try to resist this irresistible psychological tendency should not lead us to conclude (with Garrett [1997]) that Hume believes our expectations to be justified or that Hume is uninterested in evaluating their epistemic standing. Hume sometimes uses normative-sounding language in giving his naturalistic, psychological account of how we come by our expectations: [N]one but a fool or a madman will ever pretend to dispute the authority of experience, or to reject that great guide of human life. . . (E, p. 23) But by “authority” here, he presumably means nothing normative, but merely the control or influence that experience in fact exercises over our expectations — experience’s “hold over us”.16 As Hume goes on to explain on the same page, he wants “to examine the principle of human nature, which gives this mighty authority to experience. . . ” That “principle of human nature” has no capacity to justify our expectations; it merely explains them. (And since it is irresistible, none can reject it and none but a fool or a madman will pretend to reject it.) Hume often terms this principle of the association of ideas “custom” or “habit”: ’Tis not, therefore, reason which is the guide of life, but custom. That alone determines the mind, in all instances, to suppose the future conformable to the past. However easy this step may seem, reason would never, to all eternity, be able to make it. (T , p. 652) 16 That “authority” here should be interpreted as brute power to bring about rather than entitlement (“rightful authority”) to bring about is evident from other passages: “If the mind be not engaged by argument to make this step, it must be induced by some other principle of equal weight and authority [namely, custom]. . . ” (E, p. 27). An interpretation of “authority” as normative has led some to regard Hume not as an inductive skeptic, but instead as offering a reductio of some particular conception of knowledge on the grounds that it would deem what we know inductively not to be knowledge: ”Far from being a skeptical challenge to induction, Hume’s ‘critique’ is little more than a prolonged argument for the general position that Newton’s inductive method must replace the rationalistic model of science” according to which a priori reasoning is “capable of deriving sweeping factual conclusions.” [Beauchamp and Rosenberg, 1981, p. 43] See also [Smith, 1941; Stroud, 1977].
Hume and the Problem of Induction
69
For wherever the repetition of any particular act or operation produces a propensity to renew the same act or operation, without being impelled by any reasoning or process of the understanding; we always say, that this propensity is the effect of Custom. By employing that word, we pretend not to have given the ultimate reason of such a propensity. We only point out a principle of human nature, which is universally acknowledged, and which is well known by its effects. . . . [A]fter the constant conjunction of two objects, heat and flame, for instance, weight and solidity, we are determined by custom alone to expect the one from the appearance of the other. (E, p. 28) Hume is thus offering a theory of how our minds work.17 His “arguments” for this scientific theory are, of course, inductive. What else could they be? For instance, Hume points out that in the cases we have seen, a correlation in some observer’s past observations (such as between seeing fire and feeling heat) is usually associated with that observer’s forming a certain expectation in future cases (e.g., expecting heat, on the next occasion of seeing fire). Having noted this association between an observer’s expectations and the correlations in her past observations, Hume extrapolates the association; Hume is thereby led to form certain expectations regarding what he has not yet observed, and so to believe in a general “principle of human nature”.18 Some interpreters have suggested that since Hume is here using induction, he must believe that he is (and we are) entitled to do so — and so that induction is justified.19 However, in my view, Hume’s use of induction shows no such thing. Hume says that we cannot help but form expectations in an inductive way — under the sway (“authority”) of certain mental instincts. Hume’s behavior in forming his own expectations regarding the expectations of others is just one more example of these mental instincts in action. So Hume’s own belief in his theory of human nature is accounted for by that very theory. Like his expectations regarding the next slice of bread he will eat, his expectations regarding human belief-formation fail to suggest that Hume regards our expectations regarding what we have not observed to be justified. By the same token, Hume notes that as we become familiar with cases where someone’s expectations regarding what had not yet been observed turn out to be accurate and other cases where those expectations turn out not to be met, we notice the features associated with these two sorts of cases. We then tend to be 17 As we have seen, the “principle of the uniformity of nature” must be applied selectively, on pain of leading to contradiction. So insofar as Hume’s theory of the mind incorporates such a principle as governing the association of ideas, it does not suffice to account for our expectations. 18 By analogous means, Hume arrives at other parts of his theory, such as that every simple idea is a copy of a prior impression, that there are various other principles of association among ideas, etc. 19 See, for instance, [Garrett, 1997], according to which Hume’s main point is not to give “an evaluation of the epistemic worth of inductive inferences” (p. 94) but rather to do cognitive psychology — to identify the component of the mind that is responsible for those opinions (imagination rather than reason). For more discussion of this line of interpretation, see [Read and Richman, 2000].
70
Marc Lange
guided by these associations in forming future expectations. This is the origin of the “Rules by which to judge of causes and effects” that Hume elaborates (T , p. 173): “reflexion on general rules keeps us from augmenting our belief upon every encrease of the force and vivacity of our ideas” (T , p. 632). We arrive at these “rules” by way of the same inductive instincts that lead us to form our other expectations regarding what we have not observed. Some have argued that Hume’s offering these rules shows that Hume is not an inductive skeptic, since if there are rules distinguishing stronger from weaker inductive arguments, then such arguments cannot all be bad.20 But as I have explained, Hume’s endorsement of these rules does not mean that he believes that expectations formed in accordance with them are justified. The rules do not distinguish stronger from weaker inductive arguments. Rather, they result from our instinctively forming expectations regarding the expectations we tend to form under various conditions. Today, in the wake of Darwin’s theory of evolution by natural selection, we might argue that natural selection has equipped us with various innate beliefforming instincts. Creatures with these instincts stood at an advantage in the struggle for existence, since these instincts gave them accurate expectations and so enabled them to reproduce more prolifically. But once again, this theory does not solve Hume’s problem; it does not justify induction. To begin with, we have used induction of some kind to arrive at this scientific explanation of the instincts’s origin. Furthermore, even if the possession of a certain belief-forming instinct was advantageous to past creatures because it tended to lead to accurate predictions, we would need to use induction to justify regarding the instinct’s past predictive success as confirming its future success.
7
BONJOUR’S A PRIORI JUSTIFICATION OF INDUCTION
Laurence BonJour [1986; 1998, pp. 203–216] has maintained that philosophers have been too hasty in accepting Hume’s argument that there is no a priori means of proceeding from the premises of an inductive argument to its conclusion. BonJour accepts that there is no contradiction in an inductive argument’s premises being true and its conclusion false. But BonJour rejects Hume’s view that the only truths that can be established a priori are truths that hold on pain of contradiction. BonJour is inclined to think that there is something right in the view that only necessary truths can be known a priori. But again, he believes that there are necessary truths that are not analytic (i.e., necessary truths the negations of which are not contradictions). BonJour concedes that we cannot know a priori that anything like the “principle of the uniformity of nature” holds. However, he thinks that a good inductive argument has as its premise not merely that the fraction of Gs among examined F s is m/n, but something considerably less likely to be a coincidence: that the fraction converged to m/n and has since remained approximately steady as more 20 See
prior note.
Hume and the Problem of Induction
71
(and more diverse) F s have been examined. BonJour suggests that (when there is no relevant background information on the connection between being F and being G or on the incidence of Gs among F s) we know by a priori insight (for certain properties F and G) that when there has been substantial variation in the locations and times at which our observations were made, the character of the observers, and other background conditions, the fraction m/n is unlikely to remain steady merely as a brute fact (i.e., a contingent fact having no explanation) or just by chance — e.g., by there being a law that approximately r/n of all F s are G, but “by chance” (analogous to a fair coin coming up much more often heads than tails in a long run of tosses), the F s we observed were such that the relative frequency of Gs among them converged to a value quite different from r/n and has since remained about there. We recognize a priori that it is highly unlikely that any such coincidence is at work. Moreover, as the observed F s become ever more diverse, it eventually becomes a priori highly unlikely that the explanation for the steady m/n fraction of Gs among them is that although it is not the case that there is a law demanding that approximately m/n of all F s are G, the F s that we have observed have all been Cs and there is a law demanding that approximately m/n of all F Cs are G. As the pool of observed F s becomes larger and more diverse, it becomes increasingly a priori unlikely that our observations of F s are confined only to Cs where the natural laws demand that F Cs behave differently from other F s. For that to happen would require an increasingly unlikely coincidence: a coordination between the range of our observations and the natural laws. (Analogous arguments apply to other sorts of possible explanations of the steady m/n fraction of Gs among the observed F s, such as that the F s we observed in each interval happened to consist of about the same fraction of Cs, and the laws assign a different likelihood to F Cs being G than to F ∼ Cs being G.) Thus, in the case of a good inductive argument, it is a priori likely (if the act of observing an F is not itself responsible for its G-hood) that our evidence holds only if it is a law that approximately m/n of all F s are G. In other words, we know a priori that the most likely explanation of our evidence is the “straight inductive explanation”. BonJour does not say much about the likelihoods that figure in these truths that we know a priori. They seem best understood as “logical probabilities” like those posited by John Maynard Keynes [1921] and Rudolf Carnap [1950], among others — logical, necessary, probabilistic relations obtaining among propositions just in virtue of their content.21 Just as we know by rational insight that the premises of a deductive argument can be true only if the conclusion is true, so likewise (BonJour seems inclined to say) we know by rational insight that the premises of a good inductive argument make its conclusion highly likely. As Keynes wrote: Inasmuch as it is always assumed that we can sometimes judge directly that a conclusion follows from a premises, it is no great extension of this assumption to suppose that we can sometimes recognize that a 21 BonJour
[personal communication] is sympathetic to the logical interpretation of probability.
72
Marc Lange
conclusion partially follows from, or stands in a relation of probability to a premiss. [Keynes, 1921, p. 52] Presumably, part of what we grasp in recognizing these probabilistic relations is that (in the absence of other relevant information) we should have high confidence in any proposition that stands to our evidence in a relation of logically high probability. But I wonder why we should. If a conclusion logically follows from our evidence, then we should believe the conclusion because it is impossible for the premises to be true without the conclusion being true. But we cannot say the same in the case of a conclusion that is merely “highly logically probabilified” by our evidence. To say that the conclusion is made likely, and so we should have great confidence that it is true, is to risk punning on the word “likely”. There is some sort of logical relation between the premises and the conclusion, and we call this relation “high logical probabilification” presumably because it obeys the axioms of probability and we think that (in the absence of other relevant information) we ought to align our subjective degrees of probability (i.e., our degrees of confidence) with it. But then this relation needs to do something to deserve being characterized as “high logical probabilification”. What has it done to merit this characterization? The problem of justifying induction then boils down to the problem of justifying the policy of being highly confident in those claims that stand in a certain logical relation to our evidence. Calling that relation “high logical probabilification” or claiming rational insight into that relation’s relevance to our assignment of subjective probability does not reveal what that relation does to merit our placing such great weight upon it. Why should my personal probability distribution be one of the “logical” probability functions? How do we know that my predictions would then tend to be more accurate? (To say that they are then “likely” to be more accurate, in the sense of “logical” probability, is to beg the question.) Let’s turn to a different point. BonJour recognizes that no matter how many F s we observe, there will always be various respects C in which they are unrepresentative of the wider population of F s. (All F s so far observed existed sometime before tomorrow, to select a cheap example.) Of course, we can never show conclusively that the steady m/n frequency of Gs among the observed F s does not result from a causal mechanism responsible for the G-ness of F Cs but not for the G-ness of other F s. That concession is no threat to the justification of induction, since strong inductions are not supposed to be proofs. But how can we be entitled even to place high confidence in the claim that there is no such C? BonJour writes (limiting himself to spatial Cs for the sake of the example): [Our data] might be skewed in relation to some relevant factor C . . . because C holds in the limited area in which all the observations are in fact made, but not elsewhere. It is obviously a quite stubborn empirical fact that all of our observations are made on or near the surface of the earth, or, allowing for the movement of the earth, in the general region of the solar system, or at least in our little corner of the galaxy, and
Hume and the Problem of Induction
73
it is possible that C obtains there but not in the rest of the universe, in which case our standard inductive conclusion on the basis of those observations would presumably be false in relation to the universe as a whole, that is, false simpliciter. . . . The best that can be done, I think, is to point out that unless the spatio-temporal region in which the relevant C holds is quite large, it will still be an unlikely coincidence that our observations continue in the long run to be confined to that region. And if it is quite large, then the inductive conclusion in question is in effect true within this large region in which we live, move, and have our cognitive being. [BonJour, 1998, p. 215] BonJour seems to be saying that we know a priori that it would be very unlikely for all of our observations so far to have been in one spatiotemporal region (or to have been made under one set of physical conditions C) but for the laws of nature to treat that region (or those conditions) differently from the region in which (or conditions under which) our next observation will be made. But this does not seem very much different from purporting to know a priori that considering the diversity of the F s that we have already examined and the steady rate at which Gs have arisen among them, the next case to be examined will probably be like the cases that we have already examined. Inevitably, there will be infinitely many differences between the F s that we have already examined and the F s that we will shortly examine (if we are willing to resort to “gruesome” respects of similarity and difference). Is the likely irrelevance of these differences (considering the irrelevance of so many other factors, as manifested in the past steadiness of the m/n fraction) really something that we could know a priori ? Isn’t this tantamount to our knowing a priori (at least for certain properties F and G) that if it has been the case at every moment from some long past date until now that later F s were found to be Gs at the same rate as earlier F s, then (in the absence of other relevant information) it is likely that the F s soon to be examined will be Gs at the same rate as earlier F s? That seems like helping ourselves directly to induction a priori. Consider these two hypotheses: (h) A law requires that approximately m/n of all F s are G; (k) A law requires that approximately m/n of all F s before today are G but that no F s after today are G. On either of these hypotheses, it would be very likely that approximately m/n of any large, randomly-selected sample of F s before today will be G.22 Do we really have a priori insight into which of these hypotheses provides the most likely explanation of this fact? BonJour apparently thinks that we do, at least for certain F s and Gs, since (k) would require an a priori unlikely coordination between the present range of our observations and the discriminations made by the natural laws. Moreover, insofar as (k) is changed so that the critical date is pushed back from today to the year 3000 (or 30,000, or 300,000. . . ), (k)’s truth would require less of an a priori unlikely coordination between the laws and the present range of our observations — but, BonJour seems 22 See
section 10.
74
Marc Lange
to be saying, it then becomes increasingly the case that “the inductive conclusion in question is in effect true within this large region in which we live, move, and have our cognitive being.” 8
REICHENBACH’S PRAGMATIC JUSTIFICATION OF INDUCTION
Hans Reichenbach [1938, pp. 339–363; 1949a, 469–482; 1968, 245–246] has proposed an intriguing strategy for justifying induction. (My discussion is indebted to [Salmon, 1963].) Reichenbach accepts Hume’s argument that there is no way to show (without begging the question) that an inductive argument from true premises is likely to lead us to place high confidence in the truth. However, Reichenbach believes that we can nevertheless justify induction by using pragmatic (i.e., instrumental, means-ends) reasoning to justify the policy of forming our expectations in accordance with an inductive rule. Of course, Reichenbach’s argument cannot justify this policy by showing that it is likely to lead us to place high confidence in the truth — since that approach is blocked by Hume’s argument. Reichenbach believes that to justify an inductive policy, it is not necessary to show that induction will probably be successful, or that induction is more likely to succeed than to fail, or even that the claims on which induction leads us to place high confidence are at least 10% likely to be true. It suffices, Reichenbach thinks, to show that if any policy for belief-formation will do well in leading us to the truth, then induction will do at least as well. That is, Reichenbach believes that to justify forming our opinions by using induction, it suffices to show that no other method can do better than induction — even if we have not shown anything about how well induction will do. Induction is (at least tied for) our best hope, according to Reichenbach, though we have no grounds for being at all hopeful.23 BonJour [1998, pp. 194—196] has objected that Reichenbach’s argument cannot justify induction because it does not purport to present us with good grounds for believing that induction will probably succeed. It does not justify our believing in the (likely) truth of the claims that receive great inductive support. It purports to give us pragmatic rather than epistemic grounds for forming expectations in accordance with induction. Surely, BonJour says, if we are not entitled to believe that a hypothesis that has received great inductive support is likely to be true, then we have not really solved Hume’s problem. Reichenbach writes: A blind man who has lost his way in the mountains feels a trail with his stick. He does not know where the path will lead him, or whether it may take him so close to the edge of a precipice that he will be plunged into the abyss. Yet he follows the path, groping his way step by step; for if there is any possibility of getting out of the wilderness, it is by 23 Feigl [1950] distinguishes “validating” a policy (i.e., deriving it from more basic policies) from “vindicating” it (i.e., showing that it is the right policy to pursue in view of our goal). A fundamental policy cannot be validated; it can only be vindicated. Accordingly, Reichenbach is often interpreted as purporting to “vindicate” induction.
Hume and the Problem of Induction
75
feeling his way along the path. As blind men we face the future; but we feel a path. And we know: if we can find a way through the future it is by feeling our way along this path. [Reichenbach, 1949a, p. 482] BonJour replies: We can all agree that the blind man should follow the path and that he is, in an appropriate sense, acting in a justified or rational manner in doing so. But is there any plausibility at all to the suggestion that when we reason inductively, or accept the myriad scientific and commonsensical results that ultimately depend on such inference, we have no more justification for thinking that our beliefs are likely to be true than the blind man has for thinking that he has found the way out of the wilderness? [BonJour, 1998, pp. 195—196] BonJour’s objection illustrates how much Reichenbach is prepared to concede to Hume. Reichenbach’s point is precisely that an agent “makes his posits because they are means to his end, not because he has any reason to believe in them.” [Reichenbach, 1949b, p. 548]24 The policy for forming our expectations that Reichenbach aims to justify is the “straight rule”: If n F s have been examined and m have been found to be G, then take m/n to equal (within a certain degree of approximation) the actual fraction of Gs among F s — or, if there are infinitely many F s and Gs (and so the fraction is infinity divided by infinity), take m/n to approximate the limiting relative frequency of Gs among F s. Reichenbach presents his policy as yielding beliefs about limiting relative frequencies, rather than as yielding degrees of confidence, because Reichenbach identifies limiting relative frequencies with objective chances. Accordingly, the “straight rule” is sometimes understood as follows: If n F s have been examined and m have been found to be G, then take m/n to equal (within a certain degree of approximation) an F s objective chance of being G. Reichenbach then argues that if there is a successful policy for forming beliefs about limiting relative frequencies, then the straight rule will also succeed. For instance, suppose that some clairvoyant can predict the outcome of our next experiment with perfect accuracy. Then let F be that the clairvoyant makes a certain prediction regarding the outcome and G be that the clairvoyant is correct. Since m/n (from our past observations of F s and Gs) equals 1, the straight rule endorses our taking 1 to be the limiting relative frequency of truths among the clairvoyant’s predictions (or chance that a given prediction by the clairvoyant will come to pass). In short, Reichenbach argues that if the world is non-uniform (i.e., if there is no successful policy for forming beliefs about limiting relative frequencies), then the straight rule will fail but so will any other policy, whereas if the world is uniform (i.e., if there is a successful policy), then the straight rule will seize upon the uniformity (or policy). Hence, the straight rule can do no worse than 24 However, Reichenbach [1968, p. 246] says “in my theory good grounds are given to treat a posit as true”.
76
Marc Lange
any other policy for making predictions. Under any circumstances, it is at least tied for best policy. Therefore, its use is pragmatically justified. Even this rough statement of Reichenbach’s argument suffices to reveal several important difficulties it faces. First, as we saw in connection with the principle of the uniformity of nature, the straight rule licenses logically inconsistent predictions. For instance, if the F s are the emeralds, then “G” could be “green” or “grue.” In either case, m/n is 1, but we cannot apply the straight rule to both hypotheses on pain of believing that emeralds after 3000 are all green and all blue. For the rule to license logically consistent predictions, it must consist of the straight rule along with some principle selecting the hypotheses to which the straight rule should be applied. However, a straight rule equipped with a principle of selection is no longer guaranteed to succeed if any rule will. If all emeralds are grue (and there are emeralds after 3000), then a straight rule equipped with a principle of selection favoring the grue hypothesis over the green hypothesis will succeed whereas a straight rule favoring green over grue will fail (as long as all of our emeralds are observed before the year 3000).25 Here is a related point. The straight rule does not tell us what properties to take as F and G. It merely specifies, given F and G, what relative frequency (or chance) to assign to unexamined F s being G. But then there could be an F and a G to which the straight rule would lead us to assign an accurate relative frequency, but as it happens, we fail to think of that F and G. (For instance, we might simply not think of tallying the rate at which the clairvoyant’s predictions have been accurate in the past, or of taking “grue” as our G. [Putnam, 1994, p. 144]) In other words, the straight rule is concerned with justifying hypotheses once they have been thought up — not with thinking them up in the first place. (It is not a “method of discovery”; it is a “method of justification.”) So the sense in which the straight rule is guaranteed to lead us to the genuine limiting relative frequency, if any rule could, is somewhat limited. Let’s set aside this difficulty to focus on another. Reichenbach compares the straight rule to other policies for making predictions. But what about the policy of making no predictions at all? Of course, the straight rule is more likely (or, at least, not less likely) than this policy to arrive at accurate predictions. But it is also more likely than this policy to arrive at inaccurate predictions. If our goal is to make accurate predictions and we incur no penalty for making inaccurate ones, then the straight rule is obviously better than the no-prediction policy. This seems to be Reichenbach’s view: We may compare our situation to that of a man who wants to fish in 25 Reichenbach
responds, “The rule of induction . . . leads only to posits that are justified asymptotically.” [1949a, p. 448] In the long run, we observe emeralds after 3000. So although “applying the rule of induction to [grue], we shall first make bad posits, but while going on will soon discover that [emeralds after 3000 are not grue]. We shall thus turn to positing [green] and have success.” This response is vulnerable to the reply that we make all of our actual predictions in the short run rather than the long run (as I will discuss momentarily). Moreover, if we “apply the rule of induction” to grue as well as to green, then we make predictions that are contradictory, not merely inaccurate.
Hume and the Problem of Induction
77
an unexplored part of the sea. There is no one to tell him whether or not there are fish in this place. Shall he cast his net? Well, if he wants to fish in that place I should advise him to cast the net, to take the chance at least. It is preferable to try even in uncertainty than not to try and be certain of getting nothing. [Reichenbach, 1938, pp. 362—363]26 This argument presumes that there is no cost to trying — that the value of a strategy is given by the number of fish that would be caught by following it, so that if a strategy leads us to try and fail, then its value is zero, which is the same as the value of the strategy of not trying at all. So the fisherman has everything to gain and nothing to lose by casting his net. But doesn’t casting a net come with some cost? (It depletes the fisherman’s energy, for instance.) In other words, the straight rule offers us some prospect of making accurate predictions, whereas the policy of making no predictions offers us no such prospect, so (Reichenbach concludes) the straight rule is guaranteed to do no worse than the no-prediction rule. But why shouldn’t our goal be “the truth, the whole truth, and nothing but the truth”, so that we favor making accurate predictions over making inaccurate predictions or no predictions, but we favor making no predictions over making inaccurate predictions? Reichenbach then cannot guarantee that the straight rule will do at least as well as any other policy, since if the straight rule fails, then the policy of making no predictions does better. Thus, Reichenbach’s argument may favor the straight rule over alternative methods of making predictions, but it does not justify making some predictions over none at all. Let us now look at Reichenbach’s more rigorous formulation of his argument. Consider a sequence of F s and whether or not each is G. Perhaps the sequence is G, ∼ G, ∼ G, G, ∼ G, ∼ G, ∼ G. . . , and so at each stage, the relative frequency of Gs among the F s is 1/1, 1/2, 1/3, 1/2, 2/5, 1/3, 2/7. . . . Either this sequence converges to a limiting relative frequency or it does not. (According to Reichenbach, this is equivalent to: either there is a corresponding objective chance or there is not.) For instance, if 1/4 of the first 100 F s are G, 3/4 of next 1000 are G, 1/4 of the next 10,000 are G, and so forth, then the relative frequency of Gs among F s never converges and so there is no limiting relative frequency. No method can succeed in arriving at the limit if there is no limit, so in that event, the straight rule does no worse than any other method. On the other hand, if there is a limiting relative frequency, then the straight rule is guaranteed to converge to it in the long run: the rule’s prediction is guaranteed eventually to come within any given degree of approximation to the limiting relative frequency, and thenceforth to remain within that range. That is because by definition, L is the limit of the sequence a1 , a2 , a3 ,. . . exactly when for any small positive number ε, there is an integer N such that for any n > N, an is within ε of L. So if L is the limit of 26 Salmon [1991, p. 100] takes himself to be following Reichenbach in arguing that the policy of making no predictions fails whether nature is uniform or not, so it cannot be better than using induction, since the worst that induction can do is to fail. But although the no-prediction rule fails in making successful predictions, it succeeds in not making unsuccessful predictions.
78
Marc Lange
the sequence a1 , a2 , a3 ,. . . where an is the fraction (m/n) of Gs among the first n F s, then at some point in the sequence, its members come and thenceforth remain within ε of L, and since at any point the straight rule’s prediction of the limit is just the current member m/n of the sequence, the straight rule’s prediction of the limit is guaranteed eventually to come and thenceforth to remain within ε of L. If instead our goal is the accurate estimation of the objective chance of an F ’s being G, then if there is such a chance, the straight rule is 100% likely to arrive at it — to within any specified degree of approximation — in the long run. (That is, although a fair coin might land heads repeatedly, the likelihood of its landing heads about half of the time becomes arbitrarily high as the number of tosses becomes arbitrarily large.) So the straight rule is “asymptotic”: its prediction is guaranteed to converge to the truth in the long run, if any rule will. Surely, Reichenbach seems to be suggesting, it would be irrational to knowingly employ a rule that is not asymptotic if an asymptotic rule is available. Plenty of rules are not asymptotic. For instance, consider the “counterinductive rule”: If n F s have been examined and m have been found to be G, then take (1−m/n) to approximate the actual fraction of Gs among F s. The counterinductive rule is not asymptotic since, unless the limit is 50%, its prediction is guaranteed to diverge from the straight rule’s in the long run, and the straight rule’s is guaranteed to converge to the truth (if there is a truth to converge to) in the long run. In this way, Reichenbach’s argument purports to justify following the straight rule rather than the counterinductive rule. (Notice how this argument aims to walk a very fine line: to justify induction without saying anything about induction’s likelihood of leading us to the truth!) This argument does not rule out another rule’s working better than the straight rule in the short run — that is, converging more quickly to the relative limiting frequency than the straight rule does. For instance, the rule that would have us (even before we have ever tossed a coin!) guess 50% as the approximate limiting relative frequency of heads among the coin-toss outcomes might happen to lead to the truth right away. So in this respect, it might do better than the straight rule. But it cannot do better than the straight rule in the long run, since the straight rule is guaranteed to lead to the truth in the long run. (The 50% rule is not guaranteed to lead to the truth in the long run.) However, in the long run (as Keynes famously quipped), we are all dead! All of our predictions are made in the short run — after a finite number of observations have been made. Why should a rule’s success in the long run, no matter how strongly guaranteed, do anything to justify our employing it in the short run? We cannot know how many cases we need to accumulate before the straight rule’s prediction comes and remains within a certain degree of approximation of the genuine limit (if there is one). So the fact that the straight rule’s prediction is guaranteed to converge eventually to the limit (if there is one) seems to do little to justify our being guided by the straight rule in the short run. Why should the straight rule’s success under conditions that we have no reason to believe we currently (or will ever!) occupy give us a good reason to use the straight rule?
Hume and the Problem of Induction
79
(This seems to me closely related to BonJour’s objection to Reichenbach.) A final problem for Reichenbach’s argument is perhaps the most serious. A nondenumerably infinite number of rules are entitled to make the same boast as the straight rule: each is guaranteed to converge in the long run to the limiting relative frequency if one exists. Here are a few examples: If n F s have been examined and m have been found to be G, then take m/n + k/n for some constant k (or, if this quantity exceeds 1, then take 1) to equal (within a certain degree of approximation) the actual fraction of Gs among F s. If n F s have been examined and m have been found to be G, then take the actual fraction of Gs among F s to equal 23.83%, if n < 1, 000, 000, or m/n, otherwise. If n F s have been examined and m have been found to be G, and among the first 100 F s examined, M were found to be G, then take the actual fraction of Gs among F s to equal (m+M )/(n+ inf{n, 100}). (In other words, “double count” the first 100 F s.) In the long run, each of these rules converges to the straight rule and so must converge to the limit (if there is one). The straight rule cannot be shown to converge faster than any other asymptotic rule. Moreover, the asymptotic rules disagree to the greatest extent possible in their predictions: for any evidence and for any prediction, there is an asymptotic rule that endorses making that prediction on the basis of that evidence.27 The first of these three rivals to the straight rule violates the constraint that if G and H are mutually exclusive characteristics and a rule endorses taking p as Gs relative frequency and q as Hs relative frequency among F s, then the rule should endorse taking (p + q) as the relative frequency of (G or H). Furthermore, the second and third of these three rivals to the straight rule violate the constraint that the rule endorse taking the same quantity as the limiting relative frequency for any sequence of Gs and ∼ Gs with a given fraction of Gs and ∼ Gs, no matter how long the sequence or in what order the Gs and ∼ Gs appear in it. Constraints like these have been shown to narrow down the asymptotic rules to the straight rule alone. [Salmon, 1967, pp. 85–89, 97–108; Hacking, 1968, pp. 57–58] However, it is difficult to see how to justify such constraints without begging the question. For instance, if the sequence consists of the F s in the order in which we have observed them, then to require making the same prediction regardless of the order (as long as the total fraction of Gs is the same) is tantamount to assuming that 27 Reichenbach [1938, pp. 353–354] notes this problem. In [Reichenbach, 1949a, p. 447], he favors the straight rule on grounds of “descriptive simplicity”. But although Reichenbach regards “descriptive simplicity” as relevant for selecting among empirically equivalent theories, rules are not theories. In any case, the rival rules do not endorse all of the same predictions in the short run, so they are not “equivalent” there. Of course, in the long run, they are “equivalent”, but why is that fact relevant?
80
Marc Lange
later F s are no different from earlier F s — that the future is like the past, that each F is like an independent flip of the same coin as every other F (i.e., that each F had the same objective chance of being G). 9 BAYESIAN APPROACHES Suppose it could be shown — perhaps by a Dutch Book argument or an argument from calibration [Lange, 1999] — that rationality obliges us (in typical cases) to update our opinions by Bayesian conditionalization (or some straightforward generalization thereof, such as Jeffrey’s rule). This would be the kind of argument that Hume fails to rule out (as I explained in section 3): an argument that is a priori (despite not turning solely on semantic relations, since a degree of belief is not capable of being true or false) and that proceeds from the opinions that constitute a given inductive argument’s premises to the degree of belief that constitutes its conclusion. Such an argument would still be far from a justification of induction. Whether Bayesian conditionalization yields induction or counterinduction, whether it underwrites our ascribing high probability to “All emeralds are green” or to “All emeralds are grue,” whether it leads us to regard a relatively small sample of observed emeralds as having any bearing at all on unexamined emeralds — all depend on the “prior probabilities” plugged into Bayesian conditionalization along with our observations. So in order to explain why we ought to reason inductively (as an adequate justification of induction must do — see section 2 above), the rationality of Bayesian conditionalization would have to be supplemented with some constraints on acceptable priors. This argument has been challenged in several ways. Colin Howson [2000] argues that a justification of induction should explain why we ought to reason inductively. Thus, it can appeal to our prior probabilities; Bayesian conditionalization, acting on these particular priors, underwrites recognizably inductive updating. The “initial assignments of positive probability . . . cannot themselves be justified in any absolute sense”. [Howson, 2000, p. 239] But never mind, Howson says. Inductive arguments are in this respect like sound deductive arguments, they don’t give you something for nothing: you must put synthetic judgements in to get synthetic judgements out. But get them out you do, and in a demonstrably consistent way that satisfies certainly the majority of those intuitive criteria for inductive reasoning which themselves stand up to critical examination. [Howson, 2000, p. 239, see also p. 171] All we really want from a justification of induction is a justification for updating our beliefs in a certain way, and that is supplied by arguments showing Bayesian conditionalization to be rationally compulsory. As Frank Ramsey says, We do not regard it as belonging to formal logic to say what should be a man’s expectation of drawing a white or black ball from an urn; his
Hume and the Problem of Induction
81
original expectations may within the limits of consistency be any he likes, all we have to point out is that if he has certain expectations, he is bound in consistency to have certain others. This is simply bringing probability into line with ordinary formal logic, which does not criticize premisses but merely declares that certain conclusions are the only ones consistent with them. [Ramsey, 1931, p. 189] Ian Hacking puts the argument thus: At any point in our grown-up lives (let’s leave babies out of this) we have a lot of opinions and various degrees of belief about our opinions. The question is not whether these opinions are ‘rational’. The question is whether we are reasonable in modifying these opinions in light of new experience, new evidence. [Hacking, 2001, p. 256] But the traditional problem of induction is whether by reasoning inductively, we arrive at knowledge. If knowledge involves justified true belief, then the question is whether true beliefs arrived at inductively are thereby justified. And if an inductive argument, to justify its conclusion, must proceed from a prior state of opinion that we are entitled to occupy, then the question becomes whether we are entitled to those prior opinions, and if so, how come. I said that Bayesian conditionalization can underwrite reasoning that is intuitively inductive, but with other priors plugged into it, Bayesian conditionalization underwrites reasoning that is counterinductive or even reasoning that involves the confirmation of no claims at all regarding unexamined cases. However, it might be objected that if hypothesis h (given background beliefs b) logically entails evidence e, then as long as pr(h|b) and pr(e|b) are both non-zero, it follows that pr(e|h&b) = 1, and so by Bayes’s theorem, we have pr(h|e&b) = pr(h|b)pr(e|h&b)/pr(e|b) = pr(h|b)/pr(e|b) > pr(h|b), so by Bayesian conditionalization, e confirms h. On this objection, then, Bayesian conditionalization automatically yields induction. However, this confirmation of h (of “All emeralds are green,” for example) by e (“The emerald currently under examination is green”) need not involve any inductive confirmation of h — roughly, any confirmation of h’s predictive accuracy. For example, it need not involve any confirmation of g: “The next emerald I examine will turn out to be green.” Since g (given b) does not logically entail e, pr(e|g&b) is not automatically 1, and so pr(g|e&b) is not necessarily greater than pr(g|b). Howson insists that it is no part of the justification of induction to justify the choice of priors, just as deductive logic does not concern itself with justifying the premises of deductive arguments. [Howson, 2000, p. 2, see also pp. 164, 171, 239; cf. Howson and Urbach, 1989, pp. 189–190] To my mind, this parallel between deduction and induction is inapt. It presupposes that prior probabilities are the premises of inductive arguments — are, in other words, the neutral input or substrate to which is applied Bayesian conditionalization, an inductive rule of inference. But, as Howson rightly emphasizes, it is Bayesian conditionalization
82
Marc Lange
that is neutral; anything distinctively “inductive” about an episode of Bayesian updating must come from the priors. Consequently, a justification of induction must say something about how we are entitled to those priors. Some personalists about probability have argued that if anything distinctively “inductive” about Bayesian updating must come from the priors, then so much the better for resolving the problem of induction, since we are automatically entitled to adopt any coherent probability distribution as our priors. This view is prompted by the notorious difficulties (associated with Bertrand’s paradox of the chord) attending any principle of indifference for adjudicating among rival priors. For example, Samir Okasha writes: Once we accept that the notion of a prior distribution which reflects a state of ignorance is chimerical, then adopting any particular prior distribution does not constitute helping ourselves to empirical information which should be suppressed; it simply reflects the fact that an element of guess work is involved in all empirical enquiry.[Okasha, 2001, p. 322] Okasha’s argument seems to be that a prior state of opinion embodies no unjustified information about the world since any prior opinion embodies some information. But the inductive sceptic should reply by turning this argument around: Since any prior opinion strong enough to support an inductive inference embodies some information, no prior opinion capable of supporting an inductive inference is justified. In other words, Okasha’s argument seems to be that there are no objectively neutral priors, so if the inductive sceptic accuses our priors of being unjustified, we need only ask the sceptic ‘What prior probability do you recommend?’ [. . . ] It does not beg the question to operate with some particular prior probability distribution if there is no alternative to doing so. Only if the inductive sceptic can show that there is an alternative, i.e., that ‘information-free’ priors do exist, would adopting some particular prior distribution beg the question. [Okasha, 2001, p. 323] But there is an alternative to operating from a prior opinion strong enough to support the confirmation of predictions. If the sceptic is asked to recommend a prior probability, she should suggest a distribution that makes no probability assignment at all to any prediction about the world that concerns logically contingent matters of fact. By this, I do not mean the extremal assignment of zero subjective probability to such a claim. That would be to assign it a probability: zero. Nor do I mean assigning it a vague probability value. I mean making no assignment at all to any such claim. According to the inductive sceptic, there is no degree of confidence to which we are entitled regarding predictions regarding unexamined cases.28 28 Admittedly,
the sceptic’s prior distribution violates the requirement that the domain of a
Hume and the Problem of Induction
83
Though an observation’s direct result may be to assign some probability to e, the sceptic’s prior distribution fails to support inductive inferences from our observations (since it omits some of the probabilities required by Bayesian conditionalization or any generalization of it). But that is precisely the inductive sceptic’s point. There is no alternative to operating with a prior distribution that embodies information about the world, as Okasha says, if we are going to use our observations to confirm predictions. But to presuppose that we are justified in using our observations to confirm predictions is obviously to beg the question against the inductive sceptic. 10
WILLIAMS’ COMBINATORIAL JUSTIFICATION OF INDUCTION
In 1947, Donald Williams offered an a priori justification of induction that continues to receive attention [Williams, 1947; Stove, 1986, pp. 55–75]. The first ingredient in Williams’ argument is a combinatorial fact that can be proved a priori : if there is a large (but finite) number of F s, then in most collections of F s beyond a certain size (far smaller than the total population of F s), the fraction of Gs is close to the fraction of Gs in the total F population. For example, consider a population of 24 marbles, of which 16 (i.e., 2/3) are white. The number of 6-member sets that are also 2/3 white (i.e., 4 white, 2 non-white) is [(16 x 15 x 14 x 13)/(4 x 3 x 2)] x [(8 x 7)/2] = 40,960. The number of sets containing 5 white and 1 non-white marbles is 34,944, and the number containing 3 white and 3 non-white marbles is 31,360. So among the 134,596 6-member sets, about 80% contain a fraction of white marbles within 1 marble (16.7%) of the fraction of white marbles in the overall population. For a marble population of any size exceeding one million, more than 90% of the possible 3000-member samples have a fraction of white marbles within 3% of the overall population’s fraction, no matter what that fraction is (even if no sample has a fraction exactly matching the overall population’s). Notice that this is true no matter how small the sample may be as a fraction of the total population. [Williams, 1947, p. 96; Stove, 1986, p. 70] The second ingredient in Williams’ argument is the rationality of what he calls the “statistical [or proportional] syllogism”: if you know that the fraction of As that are B is r and that individual a is A, then if you know nothing more about a that is relevant to whether it is B, then r is the rational degree of confidence for you to have that a is B. Williams regards this principle as an a priori logical truth: “the native wit of mankind . . . has found the principle self-evident.” [Williams, probability function be a sigma algebra. For example, it may violate the additivity axiom [pr(q or ∼ q) = pr(q) + pr(∼ q)] by assigning to (q or ∼ q) a probability of 1 but making no probability assignment to q and none to ∼ q. Some Bayesians would conclude that the sceptic’s “pr” does not qualify as a probability function. However, the sceptic is not thereby made vulnerable to a Dutch Book. She is not thereby irrational or incoherent. What is the worst thing that can be said of her? That she shows a certain lack of commitment. That characterization will hardly bother the sceptic! It may well be overly restrictive to require that the domain of a probability function be a sigma algebra. (See, for instance, [Fine, 1973, p. 62] or any paper discussing the failure of logical omniscience.)
84
Marc Lange
1947, p. 8] It appears to be far enough from the inductive arguments in question that no circularity is involved in justifying them by appealing to it. A statistical syllogism would be another argument of the kind that Hume fails to rule out (as I explained in section 3): an argument that is a priori (despite not turning solely on semantic relations, since a degree of belief is not capable of being true or false) and that proceeds from the opinions constituting a given inductive argument’s premises to the degree of belief that constitutes its conclusion. But how can we use a statistical syllogism to justify induction? Let the As be the large samples of F s and let B be the property of having a fraction of Gs approximating (to whatever specified degree) the fraction in the overall F population. Since it is a combinatorial fact that the fraction of As that are B is large, it follows (by the statistical syllogism) that in the absence of any evidence to the contrary, we are entitled to have great confidence that the fraction of Gs in a given large sample of F s approximates the fraction of Gs among all F s.29 So if f is the fraction of Gs in the observed large sample of F s, then we are entitled (in the absence of any countervailing evidence) to have great confidence that f is the fraction of Gs among all F s. As Williams says, Without knowing exactly the size and composition of the original population to begin with, we cannot calculate . . . exactly what proportion of our “hyper-marbles” [i.e., large samples of marbles] have the quality of nearly-matching-the-population, but we do know a priori that most of them have it. Before we choose one of them, it is hence very probable that the one we choose will be one of those which match or nearly match; after we have chosen one, it remains highly probable that the population closely matches the one we have, so that we need only look at the one we have to read off what the population, probably and approximately, is like. [Williams, 1947, pp. 98–99] Induction is thereby justified. However, we might question whether the statistical syllogism is indeed a principle of good reasoning.30 It presupposes that we assign every possible large sample the same subjective probability of being selected (so since there are overwhelmingly more large samples that are representative of the overall population, we are overwhelmingly confident that the actual sample is representative). Why is this equal-confidence assignment rationally obligatory? The intuition seems to be that 29 This argument, as used to infer from a population’s fraction of Gs to a sample’s likely fraction, is often called “direct inference”, and hence the statistical syllogism is termed “the principle of direct inference” [Carnap, 1950]. An inference in the other direction, from sample to population, is then termed “inverse reasoning”. That probably a sample’s fraction of Gs approximates the population’s fraction can apparently take us in either direction. 30 It cannot be justified purely by Bayesian considerations. If P is that the fraction of Gs in the population is within a certain range of f , and S is that f is the fraction of Gs in the large observed sample, then Bayes’s theorem tells us that pr(P |S) = pr(P ) pr(S|P )/pr(S), where all of these probabilities are implicitly conditional on the size of the sample. It is unclear how to assign the priors pr(S) and pr(P ). Even pr(S|P ) does not follow purely from combinatorial considerations.
Hume and the Problem of Induction
85
when we have no reason to assign any of these samples greater subjective probability than any other, we ought to assign them equal subjective probabilities. To do otherwise would be irrational. But perhaps, in the absence of any relevant information about them, we have no reason to assign any subjective probabilities to any of them. In other words, the motivation for the equal-confidence assignment seems to be that if we have no relevant information other than that most marbles in the urn (or most possible samples) are red (or representative), then it would be irrational to be confident that a non-red marble (or unrepresentative sample) will be selected. But this undoubted fact does not show that it would be rational to expect that a red (or representative) one will be selected. Perhaps we are not entitled to any expectation unless we have further information, such as that the marble (or sample) is selected randomly (i.e., that every one has an equal objective chance of being selected). Williams is quite correct in insisting that in the absence of any relevant information, we are not entitled to believe that the sample is selected randomly. So why are we entitled to other opinions in the absence of any relevant information? Furthermore, we do have further information — for example, that samples with members that are remote from us in space or time will not be selected. This information does not suggest that the sample we select is unrepresentative — if we believe that F s are uniform in space and time. But we cannot suppose so without begging the question.31 Since Williams’s argument is purely formal, we could apparently just as well take G as green or as grue while taking the F s as the emeralds. But we cannot do both on pain of being highly confident both that all emeralds are green and that all emeralds are grue. If we regard the fact that all of the emeralds are sampled before the year 3000 as further information suggesting that the sample may be unrepresentative, then neither hypothesis is supported.32 Finally, is there any reason to believe that statistical syllogisms will lead us to place high confidence in truths more often than in falsehoods (or, at least, that they have a high objective chance of doing so)? If, in fact, our samples are unrepresentative in most cases where we have no other relevant information, then statistical syllogisms will lead us to place high confidence in falsehoods more often than in truths. That we have no good reason to think that a sample is unrepresentative does not show that we are likely to reach the truth if we presume 31 For defense of the statistical syllogism, see [Williams, 1947, pp. 66–73 and p. 176; Carnap, 1959, p. 494; McGrew, 2001, pp. 161–167]. Maher [1996] has argued that the fraction of Gs in the sample may suggest that the sample is unrepresentative and so undercut the statistical syllogism. How are we entitled a priori to have the opinion that a sample with a given fraction of Gs is no less likely if it is representative than if it is unrepresentative? 32 Stove [1986, pp. 140–142] says that he is not committed to reasoning with green in the same way as with grue, but he identifies no a priori ground for privileging one over the other. Campbell [1990], who endorses Williams’s argument, says that “a complex array of higher-order inductions, about natural kinds, about discontinuities in nature, and about the kinds of properties it is significant to investigate” justify privileging green over grue. But aren’t these inductions also going to be subject to versions of the grue problem?
86
Marc Lange
it to be representative unless we believe that if it were unrepresentative, we would probably have a good reason to suspect so. But why should we believe that we would be so well-informed?33 11 THE INDUCTIVE LEAP AS MYTHICAL Hume’s argument appears to show that our observations, apart from any theoretical background, are unable to confirm or to disconfirm any predictions. Supplemented by different background opinions (whether understood as prior probabilities or a uniformity principle), the same observations exert radically different confirmatory influences. But then to be warranted in making any inductive leap beyond the safety of our observations, we must be justified in holding some opinions regarding some prediction’s relation to our observations. These opinions may rest on various scientific theories, which in turn have been confirmed by other observations. But when we pursue the regress far enough — or look at cases where we have very little relevant background knowledge — how can any such opinions be justified? My view [Lange, 2004] is that especially in theoretically impoverished circumstances, the content of our observation reports may include certain expectations regarding the observations’ relations to as yet undiscovered facts — expectations enabling those observations to confirm predictions regarding those facts. An observation report (“That is an F ”) classifies something as belonging to a certain category (F ). That category may be believed to be a “natural kind” of a certain sort (e.g., a species of star, mineral, animal, disease, chemical. . . ). Part of what it is for a category to be a natural kind of a given sort is for its members generally to be alike in certain respects. (These respects differ for different sorts of natural kind.) Members of the same species of star, for instance, are supposed to be generally similar in temperature, intrinsic luminosity, the mechanism by which they generate light, and so forth (and generally different in many of these respects from members of other star species). Therefore, to observe that certain stars are (for instance) Cepheid-type variables, an astronomer must be justly prepared (in the absence of any further information) to regard the examined Cepheids’ possession of various properties (of the sorts characteristic of natural kinds of stars) as confirming that unexamined Cepheids possess those properties as well. In that case, Cepheid observations (in the absence of any other evidence) suffice to justify astronomers in expecting that unexamined Cepheids will exhibit, say, a certain simple period-luminosity relation that examined Cepheids display. No opinions independent of Cepheid observations must be added to them in order to give them the capacity to confirm certain predictions. Hence, there arises no regress-inducing problem of justifying some such independent opinions. 33 I made an analogous point regarding BonJour’s a priori justification of induction. McGrew [2001, pp. 167–170] responds to this criticism of Williams’s argument. Kyburg [1956] argues that even if we knew that some method of inductive reasoning would more often lead to truth than to falsehoods in the long run, we could not justify using it except by a statistical syllogism.
Hume and the Problem of Induction
87
To observe that certain stars are Cepheids, an astronomer must already have the resources for going beyond those observations. As Wilfrid Sellars says, in arguing for a similar point: The classical ‘fiction’ of an inductive leap which takes its point of departure from an observation base undefiled by any notion as to how things hang together is not a fiction but an absurdity. . . . [T]here is no such thing as the problem of induction if one means by this a problem of how to justify the leap from the safe ground of the mere description of particular situations, to the problematic heights of asserting lawlike sentences and offering explanations.[Sellars, 1963b, p. 355] I could not make observations categorizing things into putative natural kinds if I were not justified in expecting the members of those kinds to be alike in the respects characteristic of such kinds. The observations that we most readily become entitled to make in theoretically impoverished contexts are (perhaps paradoxically) precisely those with inductive import — those purporting to classify things into various sorts of natural kinds. That is because although an observation report (“That is an F ”) has noninferential justification, I am justified in making it only if I can justly infer, from my track record of accuracy in making similar responses in other cases, that my report on this occasion is probably true [Sellars, 1963a, pp. 167–170]. If I have no reason to trust myself — if I am not in a position to infer the probable accuracy of my report — then on mature reflection, I ought to disavow the report, regarding it as nothing more than a knee-jerk reaction. But obviously, this inference from my past accuracy in making similar responses is inductive. I am entitled to regard my past successes at identifying F s as confirming that my latest F report is accurate only because I am entitled to expect that generally, unexamined F s look like the F s that I have accurately identified in the past and look different from various kinds of non-F s. (Only then is my past reliability in distinguishing F s good evidence — in the absence of countervailing information — for my future reliability in doing so.) That is just what we expect when the F s form a natural kind. My past reliability at identifying F s, where F s form a natural kind, confirms (in the absence of countervailing information) my future reliability without having to be supplemented by further regress-inducing background opinions. Accordingly, “taxonomic observations” (i.e., identifications of various things as members of various species) are among the sorts of observations that scientists are most apt to be in a position to make in a new field, where their background theory is impoverished. Of course, there is no logical guarantee that these putative observations are accurate. But one becomes qualified to make them precisely because of — rather than despite — the “thickness” of their content. If observers in theoretically impoverished contexts had not taken the F ’s as forming a certain sort of natural kind, then the range of cases in which those observers are justified in making “That is F ” reports would have been different. For example, suppose astronomers had taken the Cepheid-type variable stars as
88
Marc Lange
consisting simply of all and only those variables having light curves shaped like those of the two prototype Cepheids (delta Cephei and epsilon Aquilae). Then astronomers would have classified as non-Cepheids certain stars that they actually deemed to be Cepheids (and would have deemed other stars to be “somewhat Cepheid” or “Cepheid-like”, categories that are never used). Instead astronomers took the Cepheid category as extending from the prototypical Cepheids out to wherever the nearest significant gap appears in the distribution of stellar light curves. That is because they understood the Cepheids to be a natural kind of star sharply different from any other kind. A taxonomic observation report (such as “That is a Cepheid”) embodies expectations regarding as yet undiscovered facts, and these expectations — which ground the most basic inductive inferences made from those observations — are inseparable from the reports’ circumstances of application. The content of the observation reports cannot be “thinned down” so as to remove all inductive import without changing the circumstances in which the reports are properly made. Hume’s problem of induction depends on unobserved facts being “loose and separate” (E, p. 49) from our observational knowledge. But they are not. 12
CONCLUSION
Having surveyed some of most popular recent responses to Hume’s argument (and having, in the previous section, bravely sketched the kind of response I favor), I give the final word to Hume: Most fortunately it happens, that since reason is incapable of dispelling these clouds, nature herself suffices to that purpose, and cures me of this philosophical melancholy and delirium, either by relaxing this bent of mind, or by some avocation, and lively impression of my senses, which obliterates all these chimeras. I dine, I play a game of backgammon, I converse, and am merry with my friends; and when after three or four hour’s amusement, I wou’d return to these speculations, they appear so cold, and strain’d, and ridiculous, that I cannot find in my heart to enter into them any farther. Here then I find myself absolutely and necessarily determin’d to live, and talk, and act like other people in the common affairs of life. (T , p. 269) Most fortunately for us, Hume did not act like other people in all affairs of life. Rather, he bequeathed to us an extraordinary problem from which generations of philosophers have derived more than three or four hours’ amusement. I, for one, am very grateful to him. BIBLIOGRAPHY [Beauchamp and Rosenberg, 1981] T. Beauchamp and A. Rosenberg. Hume and the Problem of Causation. Oxford University Press, New York, 1981.
Hume and the Problem of Induction
89
[Black, 1954] M.Black. Inductive support of inductive rules. In Black, Problems of Analysis. Cornell University Press, Ithaca, NY, pp. 191–208, 1954. [BonJour, 1986] L. BonJour. A reconsideration of the problem of induction. Philosophical Topics 14, pp. 93—124, 1986. [BonJour, 1998] Laurence BonJour. In Defense of Pure Reason. Cambridge University Press, Cambridge, 1998. [Brandom, 1994] R. Brandom. Making it Explicit. Harvard University Press, Cambridge, MA, 1994. [Broad, 1952] C.D. Broad. Ethics and the History of Philosophy. Routledge and Kegan Paul, London, 1952. [Brueckner, 2001] A. Brueckner. BonJour’s a priori justification of induction. Pacific Philosophical Quarterly 82, pp. 1–10, 2001. [Butler, 1813] J. Butler. Analogy of religion, natural and revealed. In Butler, The Works of Joseph Butler, volume 1. William Whyte, Edinburgh, 1813. [Campbell, 1990] K. Campbell. Abstract Particulars. Blackwell, Oxford, 1990. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. University of Chicago Press, Chicago 1950. [Dummett, 1981] M. Dummett. Frege: Philosophy of Language, 2nd ed. Harvard University Press, Cambridge, MA, 1981. [Feigl, 1950] H. Feigl. De principiis non disputandum. . . ?. In M. Black (ed.), Philosophical Analysis. Cornell Unniversity Press, Ithaca, pp. 119—156, 1950. [Fine, 1973] T. Fine. Theories of Probability. Academic, New York, 1973. [Franklin, 1987] J. Franklin. Non-deductive logic in mathematics. British Journal for the Philosophy of Science 38, 1—18, 1987. [Garrett, 1997] D. Garrett. Cognition and Commitment in Hume’s Philosophy. Oxford University Press, New York, 1997. [Goodman, 1954] N. Goodman. Fact, Fiction and Forecast. Harvard University Press, Cambridge, MA, 1954. [Hacking, 1968] I. Hacking. One problem about induction. In I. Lakatos (ed.), The Problem of Induction. North-Holland, Amsterdam, pp. 44—59, 1968. [Hacking, 2001] I. Hacking. An Introduction to Probability and Inductive Logic. Cambridge University Press, Cambridge, 2001. [Harman, 1965] G. Harman. Inference to the best explanation. Philosophical Review 74, pp. 88–95, 1965. [Horwich, 1982] P. Horwich. Probability and Evidence. Cambridge University Press, Cambridge, 1982. [Howson, 2000] C. Howson. Hume’s Problem: Induction and the Justification of Belief. Clarendon, Oxford, 2000. [Howson and Urbach, 1989] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. Open Court, La Salle, IL, 1989. [Hume, 1977] D. Hume. An Enquiry Concerning Human Understanding, ed. Eric Steinberg. Hackett, Indianapolis, 1977. [Hume, 1978] D. Hume. A Treatise of Human Nature, ed. L.A. Selby-Bigge and P.H. Nidditch, 2nd ed. Clarendon, Oxford, 1978. [Keynes, 1921] J. M. Keynes. A Treatise on Probability. Macmillan, London, 1921. [Kornblith, 1993] H. Kornblith. Inductive Inference and its Natural Ground. MIT Press, Cambridge, MA, 1993. [Kyburg, 1956] H. Kyburg. The justification of induction. Journal of Philosophy 53, pp. 394– 400, 1956. [Lange, 1999] M. Lange. Calibration and the epistemological role of Bayesian conditionalization. Journal of Philosophy 96, pp. 294–324. [Lange, 2004] M. Lange. Would direct realism resolve the classical problem of induction? Nous 38, pp. 197–232, 2004. [Lipton, 1991] P. Lipton. Inference to the Best Explanation. Routledge, London, 1991. [Mackie, 1980] J.L. Mackie. The Cement of the Universe. Clarendon, Oxford, 1980. [Maher, 1996] P. Maher. The hole in the ground of induction. Australasian Journal of Philosophy 74, 423–432, 1996. [McGrew, 2001] T. McGrew. Direct inference and the problem of induction. The Monist 84, pp. 153-78, 2001.
90
Marc Lange
[Mill, 1872] J.S. Mill. A System of Logic, 8th ed. Longmans, London, 1872. [Norton, 2003] J. Norton. A material theory of induction. Philosophy of Science 70, pp. 647–70, 2003. [Okasha, 2001] S. Okasha. What did Hume really show about induction? The Philosophical Quarterly 51, pp. 307–327, 2001. [Papineau, 1993] D. Papineau. Philosophical Naturalism. Blackwell, Oxford, 1993. [Popper, 1959] K. Popper. The Logic of Scientific Discovery. Basic Books, New York, 1959. [Popper, 1972] K. Popper. Conjectural knowledge: my solution to the problem of induction. In Popper, Objective Knowledge. Clarendon, Oxford, pp. 1—31, 1972. [Putnam, 1994] H. Putnam . Reichenbach and the limits of vindication. In Putnam, Words and Life. Harvard University Press, Cambridge, MA, pp. 131—148, 1994. [Ramsey, 1931] F. Ramsey. Truth and probability. In Ramsey, The Foundations of Mathematics and Other Logical Essays. Routledge and Kegan Paul, London, pp. 156—198, 1931. [Read and Richman, 2000] R.J. Read and K.A. Richman.The New Hume Debate. Routledge, London, 2000. [Reichenbach, 1938] H. Reichenbach. Experience and Prediction. University of Chicago Press, Chicago, 1938. [Reichenbach, 1949a] H. Reichenbach. The Theory of Probability. University of California Press, Berkeley, 1949. [Reichenbach, 1949b] H. Reichenbach. Comments and criticism. Journal of Philosophy 46, pp. 545–549, 1949. [Reichenbach, 1968] H. Reichenbach. The Rise of Scientific Philosophy. University of California Press, Berkeley, 1968. [Rhees and Phillips, 2003] R. Rhees and D.Z. Phillips. Wittgenstein’s On Certainty: There — Like our Life. Blackwell, Oxford, 2003. [Russell, 1919] B. Russell. Introduction to Mathematical Philosophy. George Allen & Unwin, London, 1919. [Russell, 1948] B. Russell. Human Knowledge: Its Scope and Limits. Simon and Schuster, New York, 1948. [Russell, 1959] B. Russell. The Problems of Philosophy. Oxford University Press, London, 1959. [Salmon, 1957] W. Salmon. Should we attempt to justify induction? Philosophical Studies 8, pp. 33–48, 1957. [Salmon, 1963] W. Salmon. Inductive inference. In B. Baumrin (ed.), Philosophy of Science: The Delaware Seminar, volume II. Interscience Publishers, New York and London, pp. 35370, 1963. [Salmon, 1967] W. Salmon. The Foundations of Scientific Inference. University of Pittsburgh Press, Pittsburgh, 1967. [Salmon, 1981] W. Salmon. Rational prediction. British Journal for the Philosophy of Science 32, pp. 115—25, 1981. [Salmon, 1984] W. Salmon. Scientific Explanation and the Causal Structure of the World. Princeton University Press, Princeton, 1984. [Salmon, 1991] W. Salmon. Hans Reichenbach’s vindication of induction. Erkenntnis 35, pp. 99–122, 1991. [Salmon, Barker, and Kyburg, 1965] W. Salmon, S. Barker, and H. Kyburg, Jr. Symposium on inductive evidence. American Philosophical Quarterly 2, pp. 265–80, 1965. [Sankey, 1997] H. Sankey. Induction and natural kinds. Principia 1, pp. 239—54, 1997. [Sellars, 1963a] W. Sellars. Empiricism and the philosophy of mind. In Sellars, Science, Perception and Reality. Routledge and Kegan Paul, London, pp. 127—196, 1963. [Sellars, 1963b] W. Sellars. Some reflections on language games. In Sellars, Science, Perception and Reality. Routledge and Kegan Paul, London, pp. 321–358, 1963. [Skyrms, 1986] B. Skyrms. Choice and Chance, 3rd ed. Wadsworth, Belmont, CA, 1986. [Smith, 1941] N. K. Smith. The Philosophy of David Hume. Macmillan, London. [Sober, 1988] E. Sober. Reconstructing the Past. Bradford, Cambridge, MA, 1988. [Stove, 1965] D. Stove. Hume, probability, and induction. Philosophical Review 74, pp. 160—77, 1965. [Stove, 1973] D. Stove. Probability and Hume’s Inductive Scepticism. Clarendon, Oxford, 1973. [Stove, 1986] D. Stove. The Rationality of Induction. Oxford University Press, Oxford, 1986. [Strawson, 1952] P.F. Strawson. An Introduction to Logical Theory. Methuen, London, 1952.
Hume and the Problem of Induction
91
[Strawson, 1958] P.F. Strawson. On justifying induction. Philosophical Studies 9, pp. 20—21. 1958. [Stroud, 1977] B. Stroud. Hume. Routledge, London and New York, 1977. [Thagard, 1978] P. Thagard. The best explanation: criterion for theory choice. Journal of Philosophy 75, pp. 76–92, 1978. [van Cleve, 1984] J. van Cleve. Reliability, justification, and the problem of induction. In P. French, T. Uehling, and H. Wettstein (eds.), Midwest Studies in Philosophy IX. University of Minnesota Press, Minneapolis, pp. 555–567, 1984. [van Fraassen, 1981] B. van Fraassen. The Scientific Image. Clarendon, Oxford, 1981. [van Fraassen, 1989] B. van Fraassen. Laws and Symmetry. Clarendon, Oxford, 1989. [Williams, 1947] D.C. Williams. The Ground of Induction. Harvard University Press, Cambridge, MA, 1947. [Wittgenstein, 1953] L. Wittgenstein. Philosophical Investigations. Blackwell, Oxford, 1953.
THE DEBATE BETWEEN WHEWELL AND MILL ON THE NATURE OF SCIENTIFIC INDUCTION Malcolm Forster
1 WHY THE DEBATE IS NOT MERELY TERMINOLOGICAL The very best examples of scientific induction were known in the time of William Whewell (1994–1866) and John Stuart Mill (1806–1873). It is puzzling, therefore, that there was such a deep disagreement between them about the nature of induction. It is perhaps astounding that the dispute is unresolved to this very day! What disagreement could there be about Newton’s discovery of universal gravitation? Prior to Newton, it was well known that gravity acts on objects near the Earth’s surface, and Copernicus even speculated that the planets have a spherical shape because they have their own gravity. But Newton was the first to understand that it’s the Earth’s gravity that keeps the Moon in orbit around the Earth, and that the Sun’s gravity keeps the Earth and the Moon in orbit around the Sun. At the root of this discovery was Newton’s explication of the kinematical concept of acceleration. To understand that the Moon (just like the fabled apple) is pulled by the Earth, one has to understand that the Moon is accelerating towards the Earth even if it is moving uniformly on the circular orbit around the Earth. Acceleration must not be defined as the time rate of change of speed, but as the time rate of change of velocity, where velocity has direction as well as magnitude. Thus, the Moon is accelerating towards the Earth because its velocity is changing its direction. Galileo, on the other hand, worked with a circular law of inertia, according to which uniform circular motion around the Earth was a “natural” motion that required no force. Further explication of the new conception of acceleration led Newton to discover that if the line from a point O to a body B sweeps out equal areas (Kepler’s second law), then B is accelerating towards O. If, in addition, the body follows an elliptical path with O at one focus (Kepler’s first law), then the acceleration towards O is inversely proportional to the square of the distance of B from O. In the case of the planets moving around the sun, if we assume that the constant of proportionality is the mass of the sun, then Kepler’s third law follows as well. Thus, Newton’s new conception of acceleration causes Kepler’s three laws to “jump together” in a way that tests the conceptions that Kepler had previously employed, involving ellipses,
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
94
Malcolm Forster
areas swept out by the line OB, the mean length of that line, and its period of revolution around the sun. For Whewell, the addition of the conceptions in each of these inductions is the defining characteristic of induction. Whewell introduced a new term for the process of binding the ‘facts’ by a new conception. He called it the colligation of facts, and used this phrase interchangeably with the word ‘induction’. Mill reacted negatively to this ‘improper’ use of the term. Mill agreed that new conceptions are often applied to the ‘facts’ during an induction, but he insisted that they are not part of the induction, and certainly not a defining characteristic. For Mill, induction consisted in extrapolating or interpolating a regularity from the known instances to the unknown instances, as is classically the case in examples of simple enumerative induction such as: All observed swans are white; therefore all swans are white. Whewell agreed that interpolation and extrapolation does, in general, result from a colligation of facts, but should not be the property that defines induction. It is tempting at this point to dismiss the debate as merely terminological. Whewell has an unusual conception of what induction is, but once it is taken on board, it is possible to translate between the two vocabularies. I agree that there is a large terminological component in the debate, but I insist that it is not merely terminological. Behind the difference in terminology is a very deep disagreement about the objectivity of human knowledge. Mill and Whewell both want to defend the objectivity of human knowledge. But they have quite distinctive views on how it comes about, and Whewell’s idea is interesting and new. Mill is entrenched in the rather extreme empiricist view that human knowledge is objective because it is built on an objective foundation of empirically given statements from which higher claims are inferred using the objective canons of inductive reasoning. Human knowledge maintains its objectivity (to the extent that it succeeds) by minimizing the influences of subjective elements at every stage of the process. For Whewell, subjective and objective elements are inseparable parts of human knowledge at any level in the hierarchy of knowledge, from the concept-ladenness of perceptual knowledge at the bottom, to the concept-ladenness of the highest forms of scientific knowledge at the top. The counter-proposal is that empirical success at the higher levels of knowledge, captured in terms of what he called the consilience of inductions, can help to secure the lower levels as a kind of bootstrapping effect. For example, Kepler’s colligations of facts are concept-laden in a way that makes them subjective at first, but once Newton used the new conception of force and acceleration to show how the facts, described in terms of Kepler’s colligations, lead successfully to a higher level colligation of facts, then the subjective elements involved are successfully “objectified”. Knowledge is like a building in which the addition of higher floors helps strengthen the lower levels. Whewell harbored a deep distain for Mill’s purely empiricist philosophy, which he saw as constantly downplaying the importance of the subjective component of knowledge, or as trying to reduce it to purely empirical elements at every stage.
The Debate between Whewell and Mill on the Nature of Scientific Induction
95
In contrast, the conceptual components of knowledge are, for Whewell, the very instruments that ultimately explain how human knowledge is possible. They produce the colligations that may be confirmed by the consiliences of colligations, which serves to objectify the subject elements, making knowledge possible. The introduction of new conceptions in the colligation of facts is therefore a defining characteristic of induction. There is a major problem in trying to understand the Whewell-Mill debate from what the authors wrote. Whewell was primarily a historian of science, but Mill did not have a good knowledge of the history of science. Whewell allowed Mill to center the debate on particular examples of induction such as Kepler’s inference that Mars moves on an ellipse. They got so tied up in that example, that the larger philosophical differences got lost in the discussion. It’s possible that Whewell’s hierarchical view of knowledge led him to believe that the bigger picture is played out in smaller examples on a smaller scale. Unfortunately, Whewell did not recall the details of the Kepler example in sufficient detail bring out those features of it. In section 2, I attempt to remedy that problem by describing the Kepler example in a way that challenges Mill’s picture of it. Section 3 turns to Whewell’s bigger picture by discussing his tests of hypotheses, while section 4 argues that WhewellMill debate helps us understand why sophisticated methods of induction have not been programmed to run automatically on a computer. Finally, section 5 asks whether the Whewell-Mill debate may help us identify fundamental limitations in the scope of Bayesian and Likelihoodist theories of evidence and confirmation. 2 THE KEPLER EXAMPLE AND THE COLLIGATION OF FACTS The colligation of facts was Whewell’s name for scientific induction. Its defining characteristic is the introduction of a new conception not previously applied to the data at hand, which unites and connects the data. In curve fitting, the idea is easy to visualize. According to Whewell, “the Colligation of ascertained Facts into general Propositions” consists of (1) the Selection of the Idea, (2) the Construction of the Conception, and (3) the Determination of the Magnitudes. In curve fitting, these three steps correspond to (1) the determination of the Independent Variable, (2) the Formula, and (3) the Coefficients. Once the variables are chosen (Step 1), one chooses a particular functional relationship (Step 2; choose the Formula, Conception, family of curves) characterized in terms of some adjustable parameters (which Whewell calls coefficients), and then one fits the curves to the data in order to estimate the values of the parameters (Step 3; determining the magnitude of the coefficients). Consider the simplest possible example. Suppose we hang an object on a beam balance in order to infer its mass from the distance at which a unit mass must be slid along the beam to counterbalance the object in question. If the units are chosen appropriately, and the device is built well, then the mass value can be read straight from the distance at which the unit weight balances the beam. The dependent variable chosen in step 1 of the colligation of facts is x (there is no
96
Malcolm Forster
independent variable), and the family of “curves” or the formula chosen in step 2 is x = m, where x is the distance of the unit mass from the fulcrum and m is an adjustable parameter, which represents the mass of the object in question. Whewell’s third step in the colligation of facts refers to the determination of the mass values by inferring them from the x values using the formula. The conception being introduced is the formula (∀o)(x(o) = m(o)), where ‘o’ ranges over a set of objects. The formula is something added to or imposed upon the facts by mind of the investigator; it is not contained in, or read from, those facts. Of course, the magnitude of the mass is read from the facts; indeed, this is the third step in Whewell’s colligation of facts. But that does not mean that the formula itself is determined by the facts. The underdetermination implies that the subjective elements in the colligation of facts make the inductive conclusions uncertain and conjectural. In order to defend the objectivity of our knowledge, we have two choices. We can choose the Millean strategy of denying that there is ever any such underdetermination, or go for the Whewellian strategy of allowing that the consilience of inductions can later test the conjecture, and upgrade its confirmational status in light of this higherlevel empirical success. To take our hindsight wisdom for granted, as Mill does, and to suppose that the initial induction had this status all along, is to commit the kind of error that non-historians often make. Though our beam balance example is not a real piece of history, the Millean mistake in that example would be to take the agreement of spring balance measurements of mass and beam balance measurements of mass for granted, and to assume that the justification for postulating ‘mass’ already existed prior to the consilience. Unfortunately, the debate centers around the Kepler example and neither author gives the details of this important example in sufficient detail for the purpose at hand. It is especially confusing because Mill held the very strange and rather complicated view that Kepler did not perform any induction at all, even in the very broad sense in which Mill uses the term. Mill’s strategy is to make a distinction between a description and an explanation, and to argue that the inductive conclusion in the Kepler example is merely a description of the data, and therefore, there was no induction performed by Kepler. For example, in his view, when the ancients hypothesized that the planets move by being embedded on crystalline spheres, they put forward an explanation of celestial motions. But when Ptolemy and Copernicus conceived of the motions in terms of the combinations of circles, they were merely putting forward a description. In Mill’s words: When the Greeks abandoned the supposition that the planetary motions were produced by the revolution of material wheels, and fell back upon the idea of “mere geometrical spheres or circles,” there was more in this change of opinion than the mere substitution of an ideal curve for a physical one. There was the abandonment of a theory, and the replacement of it by a mere description. No one would think of call-
The Debate between Whewell and Mill on the Nature of Scientific Induction
97
ing the doctrine of material wheels a mere description. That doctrine was an attempt to point out the force by which the planets were acted upon, and compelled to move in their orbits. But when, by a great step in philosophy, the materiality of the wheels was discarded, and the geometrical forms alone retained, the attempt to account for the motions was given up, and what was left of the theory was a mere description of the orbits. [Mill 1872, Book III, Chapter ii, section 4] It’s true that no one would think of calling the doctrine of material wheels a mere description. But it is very strange that Mill should insist that it becomes a mere description as soon as the materiality of the wheels is discarded. For these “mere descriptions” entail predictions that are not part of the data, and anything that goes beyond the data goes from the known to the unknown should therefore count as an induction, according to Mill’s own definition. Thus, even if Kepler’s conclusion were a mere description, in the sense that Mill has just described, it should not disqualify Kepler’s inference as counting as an induction. In order to be as charitable as possible to Mill, let me begin with the example that he presents as the clearest in his favor. It is about the circumnavigation of an island: A navigator sailing in the midst of the ocean discovers land: he cannot at first, or by any one observation, determine whether it is a continent or an island; but he coasts along it, and after a few days finds himself to have sailed completely round it: he then pronounces it an island. Now there was no particular time or place of observation at which he could perceive that this land was entirely surrounded by water: he ascertained the fact by a succession of partial observations, and then selected a general expression which summed up in two or three words the whole of what he so observed. But is there anything of the nature of an induction in this process? Did he infer anything that had not been observed, from something else which had? Certainly not. He had observed the whole of what the proposition asserts. That the land in question is an island, is not an inference from the partial facts which the navigator saw in the course of his circumnavigation; it is the facts themselves; it is a summary of those facts; the description of a complex fact, to which those simpler ones are as the parts of a whole. [Mill, 1872, Book III, ch. ii, section 3] Astonishingly, even in this example, Mill’s case is very weak. For if we think carefully about what is observed in this example, it is the similarity of the view of the shoreline at the start and the end of the circumnavigation. The views are not exactly the same because the distance from the shore is different, the tides are different, and the times of day are different. It is not given in the facts that the views are of the same shoreline. That is a conclusion. The hypothesis that an island has been circumnavigated explains why the views look similar. That the conclusion is inductive is made plain by the fact that it makes a prediction,
98
Malcolm Forster
which may be false. For it predicts that if we continue sailing further in the same direction, then we will see an ordered sequence of previously seen views of the shoreline. It is puzzling that Mill does not see this; he clearly defines induction, in his terms, as any inference from the known to the unknown. Perhaps he sees the logical gap as small in this case. But it gets much larger in the Kepler example because it is not merely a circumnavigation that is inferred, but also the exact path (Kepler’s first law) and rate of motion (Kepler’s area law). The puzzle is resolved a little once we look more carefully at Mill’s description of the Kepler example. He continues from the previous passage. Now there is, I conceive, no difference in kind between this simple operation [in the island example], and that by which Kepler ascertained the nature of the planetary orbits: and Kepler’s operation, all at least that was characteristic in it, was not more an inductive act than that of our supposed navigator. The object of Kepler was to determine the real path described by each of the planets, or let us say the planet Mars (since it was of that body that he first established the two of his three laws which did not require a comparison of planets). To do this there was no other mode than that of direct observation: and all which observation could do was to ascertain a great number of the successive places of the planet; or rather, of its apparent places. That the planet occupied successively all these positions, or at all events, positions which produced the same impressions on the eye, and that it passed from one of these to another insensibly, and without any apparent breach of continuity; thus much the senses, with the aid of the proper instruments, could ascertain. What Kepler did more than this, was to find what sort of a curve these points would make, supposing them to be all joined together. He expressed the whole series of the observed places of Mars by what Dr. Whewell calls the general conception of an ellipse. This operation was far from being as easy as that of the navigator who expressed the series of his observations on successive points of the coast by the general conception of an island. But it is the very same sort of operation; and if the one is not an induction but a description, this must also be true of the other. [Mill, 1872, Book III, ch. ii, section 3] Mill’s first naivet´e is his passing from “the successive apparent places of the planet” to “the successive places of the planet”, as if there is no important gap between the 3-dimensional positions of Mars and the angular position of Mars relative to the fixed stars. Then, without any additional argument, Mill simply affirms the analogy: “. . . if the one is not an induction but a description, this must also be true of the other.” Let’s read more. The only real induction concerned in the case, consisted in inferring that because the observed places of Mars were correctly represented
The Debate between Whewell and Mill on the Nature of Scientific Induction
99
by points in an imaginary ellipse, therefore Mars would continue to revolve in that same ellipse; and in concluding (before the gap had been filled up by further observations) that the positions of the planet during the time which intervened between two observations, must have coincided with the intermediate points of the curve. For these were facts which had not been directly observed. They were inferences from the observations; facts inferred, as distinguished from facts seen. But these inferences were so far from being a part of Kepler’s philosophical operation, that they had been drawn long before he was born. Astronomers had long known that the planets periodically returned to the same places. [Mill, 1872, Book III, ch. ii, section 3] So, finally, Mill states why Kepler did not perform an induction. The induction was already performed by astronomers before him who had concluded that the planets returned to the same places after a fixed period of time. Yes, astronomers before Kepler did assume that planets repeated exactly the same paths. But that inductive conclusion is very vague because it does not say what the path was. Specifying the path adds a great deal of predictive content, and so Kepler’s inference does take us from what is known to what is unknown even if we treat the periodicity of the orbits as known. The only way out for Mill is to insist that the full specification of the path (the particular ellipse) was a part of the data. Mill seems to be assuming that continuous sections of Mars’s orbit were observed at various time, and over time, these sections covered the whole ellipse. This is factually incorrect, as we shall see. But even if it were true, it still does not follow that Kepler conclusion is a mere description of the data, unless the observations are exact. Any margin of error can allow for a multitude of possible paths that can disagree in the accelerations that are attributed to the planets at different times. The consequences that Kepler’s laws have concerning the (unobserved) instantaneous accelerations of the planets will be crucial in Newton’s higher level colligation of Kepler’s three laws, according to which all the planets are attracted to the sun inversely to the square of their distances to the sun. It’s time to correct this series of mistakes (see also [Harper, 1989; 1993; 2002; Harper et al., 1994]).1 Mill’s first mistake was to ignore the difference between angular positions and 3-dimensional positions; this is a huge mistake. The correct story is complicated because it’s not so easy to fill this logical gap. To do it, Kepler first needed to determine earth’s orbit around the sun in relation to a particular point on the orbit of Mars. The measured period of the Martian orbit was 687 days, which is a little under two years. Tycho Brahe’s observations from earth at E, and at E1 687 days later, Kepler obtained the angle SE1 M directly, and obtained ESE1 from well known tabulations of the (angular) motion of the sun across the fixed stars. (See Fig. 1.) Mill is right that Kepler simply assumed that the orbits were 1I
follow Hanson’s [1970, pp. 277–282] account.
100
Malcolm Forster
periodic, even though it could never have been justified as exactly true (because it is not).
Figure 1. The first step in Kepler’s determination of Mars orbit was the calculation of the earth’s orbital motion. S denotes the sun, and M refers to Mars. As a check, Kepler might also have compared the two apparent positions of Mars relative to the fixed stars to obtain the third angle in the triangle, SM E1 (given that Mars returns to the same position M after 687 days). This is an important check given that the periodicity assumption is not entirely secure. The shape of the triangle SE1 M is thereby given, and this determines the distance SE1 as a ratio of the (unknown) distance SM . Similar calculations for triangles SE2 M , etc, obtained when Mars had returned to the point M again, then give the distances SE2 , etc, as a ratio of SM also. By then fitting a smooth elliptic orbit to these discrete data points, Kepler determined the motion of the Earth around the sun. Only now is Kepler able to return to the main problem of measuring the distance of Mars from the sun at different stages of its orbit. Consider another observation of Mars at M in opposition with the earth at E0 687 days later at E1 . (See Fig. 2.) Again, the shape of the triangle SE1 M is determined from the knowledge of its angles, and this gives the distance SM as a ratio of SE1 . But the distances SE1 are known (as a ratio of SM) from the previous colligation of the facts concerning the orbit of earth. Therefore, SM , SM , etc, are determined as ratios of SM . Kepler then fitted another elliptic curve to obtain the orbit of Mars around the sun as a continuous function of time, which he described in his first law (elliptic path) and second law (equal areas swept out in equal times). Here Kepler is adding a new conception by applying his elliptical formula to the inferred data. Although
The Debate between Whewell and Mill on the Nature of Scientific Induction
101
these inductions are suggestive, and he may have eliminated many competing hypotheses, Kepler himself did not succeed in fully justifying his results. That was left to Newton.
Figure 2. The second step in Kepler’s calculation of the Martian orbit. Against Mill, it is now clear that Kepler’s data was only a discrete sampling of points on Mars’ orbit. Moreover, each was inferred from measurements of the angles of a triangle and distance ratios that were inferred from another colligation of facts. They were hardly the incorrigible “givens” that empiricists like Mill assume to be the bedrock of inductive inferences. The 3-dimensional positions attributed to Mars were determined in a heavily theory-laden way. However natural it might seem to assume, in hindsight, that the planets live in a 3-dimensional space, such attributions are not part of any theory-neutral observation language [Kuhn, 1970]. But, for Whewell, this does not signal the end of the objectivity of science. Higher-level consiliences discovered by Newton will eventually ground the validity of these lower-level conceptions. The same point applies to Kepler’s ellipse. Yes, the ellipse hypothesis might have produced the best fit with the data out of the nineteen hypotheses that Kepler tried, but that does not mean that was completely secure at that time. It was later confirmed by the intimate connection between the inverse square law and Kepler’s first and second laws discovered by Newton when he proved that any planet moving such that the line from sun sweeps out equal areas in equal time is accelerating towards the sun, and further, that if the path is an ellipse, the sun-seeking acceleration is inversely proportional to the square of the distance. Furthermore, Kepler’s third law is icing on the cake because it also follows from
102
Malcolm Forster
the inverse square law that the ratios R3 /T 2 are independent measurements of the Sun’s mass, adding to the consilience of inductions. Colligation, for Mill, is a part of the invention process, whereas induction (properly so-called) is relevant to questions of justification. Whewell’s characterization of induction, Mill objects, belongs to (what we call) the ‘context of discovery’. Accordingly, Mill [1872, Book III, ch. ii, section 5] charges that “Dr Whewell calls nothing induction where there is not a new mental conception introduced and everything induction where there is.” “But,” he continues, “this is to confuse two very different things, Invention and Proof.” “The introduction of a new conception belongs to Invention: and invention may be required in any operation, but it is the essence of none.” Abstracting a general proposition from known facts without concluding anything about unknown instances, Mill goes on to say, is merely a “colligation of facts” and bears no resemblance to induction at all. In sum, Mill thinks that the colligation of facts are mere descriptions that have nothing to do with the justification of scientific hypotheses. Contrary to what Mill thinks, colligations are not mere descriptions. They do add something unknown to the facts; any general proposition (in Whewell’s sense) can be tested further, either by untried instances, or by the consilience of inductions. It does, therefore, go beyond the data. Yes, mental acts are essential to invention and discovery. But they are also essential to justification. Conceptions are essential to the justification of the hypothesis that results from a colligation of facts in spite of the fact that conceptions are mental, and therefore subjective. Conceptions are essential because there can be no consilience of inductions without them. For, the consilience of inductions often consists of the agreement of magnitudes (Step 3 in the colligation of facts) determined in separate inductions, which derive from the new conception imposed upon the facts in those inductions. Mill has no good reason to accuse Whewell of confusing invention and proof. At its core, the dispute is really about the nature of evidence and justification — about how hypotheses are tested and confirmed. 3 WHEWELL’S TESTS OF HYPOTHESES Whewell distinguishes four tests of scientific hypotheses (although the last one is more like a sign than a test). By ‘instances’ he is referring to empirical data that can be fitted to the hypothesis in question: 1. The Prediction of Tried Instances. 2. The Prediction of Untried Instances; 3. The Consilience of Inductions; and 4. The Convergence of a Theory towards Simplicity and Unity. Keep in mind that Whewell uses the term ‘colligation of facts’ interchangeably with ‘induction’. A consilience of inductions occurs when two, or more, colligations of
The Debate between Whewell and Mill on the Nature of Scientific Induction
103
facts are successfully unified in some way. Newton’s theory of gravity applied the same form of equation to celestial and terrestrial motions (the inverse square law), and in the case of the moon and the apple, both colligations of facts made use of the same adjustable parameter (the earth’s mass). Consequently, the moon’s motion and an apple’s motion provided independent measurements of the earth’s mass, and the agreement of these independent measurements was an important test of Newton’s hypothesis. This test is more than a prediction of tried or untried instances. It leads to a prediction of facts of a different kind (facts about celestial bodies from facts about terrestrial bodies, and vice versa). The consilience of inductions leads to a convergence towards simplicity and unity because unified theories forge connections between disparate phenomena, and these connections may be tested empirically, usually by the agreement of independent measurements. So, a theory can be unified in response to a successful consilience of inductions. Simplicity and unity are necessary conditions for the consilience of inductions, but not sufficient. A theory like ‘everything is the same as everything else’ is highly unified, but not consilient. As Einstein once described it, science should be simple, but not too simple. In the Novum Organon Renovatum, Whewell [1989, 151] speaks of the consilience of inductions in the following terms: We have here spoken of the prediction of facts of the same kind as those from which our rule was collected [tests (1) and (2)]. But the evidence in favour of our induction is of a much higher and more forcible character when it enables us to explain and determine cases of a kind different from those which were contemplated in the formation of our hypothesis. The instances in which this has occurred, indeed, impress us with a conviction that the truth of our hypothesis is certain. No accident could give rise to such an extraordinary coincidence. No false supposition could, after being adjusted to one class of phenomena, exactly represent a different class, where the agreement was unforeseen and uncontemplated. That rules springing from remote and unconnected quarters should thus leap to the same point, can only arise from that being the point where truth resides. Accordingly the cases in which inductions from classes of facts altogether different have thus jumped together, belong only to the best established theories which the history of science contains. And as I shall have occasion to refer to this peculiar feature of their evidence, I will take the liberty of describing it by a particular phrase; and will term it the Consilience of Inductions. [Whewell, 1989, 153] “Real discoveries are . . . mixed with baseless assumptions” (Whewell, 1989, 145), which is why Whewell considers the consilience of inductions to provide additional guidance in finding the “point where the truth resides.” Whewell has been soundly criticized over the years for his claim that the consilience of inductions “impress us with a conviction that the truth of our hypothesis
104
Malcolm Forster
is certain” and that “no false supposition could, after being adjusted to one class of phenomena, exactly represent a different class, where the agreement was unforeseen and uncontemplated.” Given the explication of the notion of truth that we use today, according to which a hypothesis is false if any small part of it is false, Whewell’s claims cannot be defended. But if they are suitably qualified, they cannot be so easily dismissed. It is true that such cases “belong only to the best established theories which the history of science contains.” In place of the consilience of inductions, Mill talks about the deductive subsumption of lower level empirical laws under more fundamental laws, which is a well-known part of hypothetico-deductivism. Whewell’s account of consilience gets around the common objection that deductive subsumption is too easy to satisfy. For instance, hypothetico-deductivism tries to maintain that Galileo’s theory of terrestrial motion, call it G, and Kepler’s theory of celestial motion, K, are subsumed under Newton’s theory N because N deductively entails G and K. The problem is that G and K are also subsumed under the mere conjunction of (G&K), so deductive subsumption by itself cannot fully capture the advantage that N is more unified or consilient. Many respond to the problem by saying that unification and simplicity must be added to confirmational equation as non-empirical virtues. But this is to short-change empiricism, because N does make empirical predictions that (G&K) does not. Namely, N predicts the agreement of independent measurements of the earth’s mass from celestial and terrestrial phenomena. That is why Whewell’s theory is better than Mill’s theory. Many of these ideas about confirmation have been raised in the literature before [Forster, 1988]. Earman [1978] uses the idea that unified hypotheses have greater empirical content to make sense of Ramsey’s argument for realism. Friedman [1981; 1983] uses a similar idea to makes sense of arguments for the reality of spacetime. Glymour [1980] discusses ideas about theory and evidence that have a distinctly Whewellian flavor. Norton [2000a; 2000b] emphasizes the overdetermination of parameters, Harper and Myrvold [2002], Harper [2002; 2007] emphasize the importance of the agreements of independent measurements, and provide excellent detailed examples. These authors appreciate the nuances involved in real examples of scientific discovery, yet there is still a failure to see two things very clearly: (1) The depth of difficulties for standard theories of confirmation, such as Bayesianism, and (therefore) a failure to appreciate (2) the relevance of Whewell’s ideas to contemporary debates about theory and evidence. To defend the objectivity of knowledge, we need to understand how conceptions introduced in our best explanations are “objectified” by the agreement of independent measurements in a hierarchy of successive generalizations. None of this is going to “fall out” of standard formal theories of epistemology.
The Debate between Whewell and Mill on the Nature of Scientific Induction
105
4 DISPUTES ABOUT INDUCTION THAT HAVE IGNORED THESE LESSONS Hempel [1945] made an important distinction between the direct and indirect confirmation of hypotheses. Direct confirmation is the familiar process by which a generalization is confirmed by observed instances of it, while indirect confirmation arises from its place in a larger network of hypotheses. For example, the law of free fall on the moon is directly confirmed by the experiments done on the moon by the Apollo astronauts, but was indirectly confirmed long before that by being deduced from Newton’s theory of gravitation, which has its own support. Whewell’s discussion of what he termed successive generalizations and the consilience of inductions can be seen as an account of indirect confirmation. Whewell’s idea is this: The aim of any inductive inference is to extract information from the data that can then be used in higher level inductions. For example, Copernicus’s theory can be used to infer 3-dimensional positions of the planets relative to the sun from 2-dimensional positions relative to the fixed stars. The 3-dimensional positions were then used by Newton to provide instances of the inverse square law of gravitation, which enable us to make predictions about one planet based on observations of other planets. It was only this higher-level empirical success that finally confirmed Copernicus’s conjecture that the earth moved with the sun at the center. Only then can we fully trust the inferences about 3-dimensional positions inferred from Copernicus’s theory on which Newton’s inductions we based. Whewell explains why this circle is not vicious. Mill’s mistake is to reduce Whewell’s innovative idea of the consilience of inductions solely as the deductive subsumption of lower-level generalizations under higher-level laws. The problem with Mill’s idea is that it seems to involve a kind of circular reasoning: A is confirmed because A entails B and B is confirmed; but wait, B is now better confirmed because A is confirmed and A entails B. Mill fails to notice that higher-level generalizations have a direct kind of empirical confirmation in terms of the agreement of independent measurements of theoretically postulated quantities. In the case of Newton’s theory of planetary motions, it was the agreement of independent measurements of the earth’s mass obtained by observing the moon’s motion and terrestrial projectiles, and the agreement of independent measurements of the sun’s mass, and of Jupiter’s mass, and so on. The consilience of inductions thereby relies on aspects of the data that play no role in the confirmation of lower-level generalizations. This is why indirect confirmation, on the Whewellian view, avoids the Millian circle. Whewell’s writings were responsible, in part, for the existence of Book III On Induction in Mill’s System of Logic, in which many footnotes and sections are devoted to the important task of separating Mill’s views from Whewell’s. In 1849, Whewell published a reply called “Of Induction, with Especial reference to Mr. Mill’s System of Logic”. Near the beginning of his commentary, Whewell [1989, p. 267] main complaint is that Mill “has extended the use of the term Induction not only to cases in which general induction is consciously applied to particular
106
Malcolm Forster
instances; but to cases in which the particular instance is dealt with by means of experience in the rude sense in which experience is asserted of brutes; and in which, of course, we can in no way imagine that the law is possessed or understood as a general proposition. Mill has thus “overlooked the broad and essential difference between speculative knowledge and practical action; and has introduced cases which are quire foreign to the idea of science, alongside with cases from which we may hope to obtain some views of the nature of science and the processes by which it must be formed.” In a footnote to chapter i, Book III, Mill [1872] replies: “I disclaim, as strongly as Dr. Whewell can do, the application of such terms as induction, inference, or reasoning, to operations performed by mere instinct, that is from an animal impulse, without the exertion of any intelligence.” But the essence of Whewell’s complaint is that simple enumerative induction, and Mill’s other methods of induction, are no more complicated than animal impulses even when it is consciously employed; at least, not different in a way that accounts for the difference in intelligence. If the complaint is about the established use of the word “induction”, then I tend to think that Whewell is the one swimming against the tide. But it would be a mistake to think that this is merely a linguistic debate about the use of the word ‘induction’; for as Whewell notes, there is always a proposition that accompanies every definition, and the proposition in this case is something like: Simple enumerative induction (such as inferring that all humans are mortal from John, Paul,. . . are mortal) adequately represents the habit of mind that brings about the highest forms of human knowledge. This is an assumption that should be questioned in light of what we know today. Whewell expands upon his worries by characterizing most generalizations of the form “All humans are mortal” as a mere juxtapositions of particular cases [Whewell, 1989, 163]. Whewell agrees that induction is the operation of discovering and proving general propositions, but he appears to have a different understanding of the term “general”. For Whewell (1989, 47) it is necessary that “In each inductive process, there is some general idea introduced, which is given, not by the phenomena, but by the mind.” The inductive conclusion is, therefore, composed of facts and conceptions “bound together so as to give rise to those general propositions of which science consists”. “All humans are mortal” is not general in the appropriate sense because there has been no conception added to the fact that John, Paul,. . . are mortal.2 Whewell insists that in every genuine induction, “The facts are known but they are insulated and unconnected . . . The pearls are there but they will not hang together until some one provides the string” [Whewell, 1989, 140-141]. The “pearls” are the data points and the “string” is a new conception that connects and unifies the data. The “pearls” in “All As are Bs” are unstrung because “All As are Bs”, though general in the sense that it is 2 But it would be incorrect to say that Whewell thinks that no generalization of the for All As are Bs can introduce a new conception. For example, it could be that “All metals conduct electricity” qualifies as an induction conclusion because the term ‘metal’ may represent a new conception not contained in the facts. I owe this point to Dan Schneider.
The Debate between Whewell and Mill on the Nature of Scientific Induction
107
universally quantified, does not connect or unify the facts; it does not colligate the facts. For Whewell, this process of uniting the facts under a general conception, which he calls the colligation of facts, is an essential step in the formation of human knowledge. Mill would gladly transfer Whewell’s description of the colligation of facts to his own pages, but fails to see that it has the kind of importance that Whewell attaches to it. There are two worries that everyone should have about simple enumerative induction: (1) It is not a habit of mind that we have in a great many cases; in fact, it is the subject of well known philosophical jokes. A philosopher jumps from the Empire State Building and is heard to say as he falls past the 99th floor “99 floors and I’m not dead!” As a different example, imagine a study of radioactive decay in which all the samples observed are radioactive, yet the very law of radioactive decay discovered from these observations leads us to deny that any finite sample will be radioactive for all times. (2) When such a habit of mind is desirable, it is very easy to implement. Simple associative learning is not what marks the difference between human intelligence and animal intelligence. I say ‘salt’ and you think ‘pepper’. Pavlov’s dogs are the most famous case of a kind of associative learning in animals known as classical conditioning. In more recent times, the same learning ability has been demonstrated in animals as primitive as sea slugs (Aplysia californica). It’s not just that “brutes” do it, sea slugs do it! A strong 1-sec electric shock to the mantle of the slug (called the unconditioned stimulus UCS) elicits a prolonged withdrawal of its siphon. The UCS in Pavlov’s dogs is the smell of meat, which elicits salivation. The aim of the experiments is to demonstrate an ability to learn to predict the UCS from a conditioned stimulus (CS). In Pavlov’s dogs, the CS was the sound of a bell. When presented immediately prior the presentation of food on several occasions, the bell would eventually trigger the salivation response by itself without the smell of meat, thereby indicating that the dogs had learned to predict the presence of meat from the sound of the bell. In the case of the sea slugs, one CS was a short tactile stimulation of the siphon, which elicited a short withdrawal of the siphon. When the CS was presented a short 0.5 sec before the UCS, and this was repeated 15 times, the CS would produce a siphon withdrawal that is more 4 times as long as what would have resulted without the learned association between the CS and the UCS. Just as Pavlov’s dogs appear to learn to “predict” the presence of food from the sound of a bell, the sea slugs appear to anticipate a large electrical shock from a short tactile stimulation of the siphon.3 Sea slugs have about 20,000 nerve cells in its central nerve system arranged in nine ganglia [Macphail, 1993, p. 32] compared to the approximately 1012 neurons in a human being, some of which may have several thousand synaptic contacts [Nauta and Feirtag, 1986]. What is the function of these extra neurons? To learn a billion more associations of the same 3 No such association is learned when the CS is presented after the UCS. See [Macphail, 1993, pp. 103-5], for a more complete description of the experiment, or the original source; Carew, Hawkins, and Kandel 1983.
108
Malcolm Forster
kind? If so, how are these learned associations organized or associated together? The most influential part of the System of Logic is Mill’s four methods of induction [Mill, 1972, Book III, Chapter VIII, IX]; but these are also the butt of many jokes. A philosopher goes to a bar on Monday and drinks whiskey and soda water all night. The next day he drinks vodka and soda. The following night, gin and soda, and then the night after that, bourbon and soda. Finally, on Friday, he comes into the bar and complains that he’s been too inebriated for the past week to get much work done, so tonight he’s going to drink whiskey without the soda. The philosopher has used Mill’s the method of agreement to observe that the only common thread in the four times he’s been inebriated is that he’s been drinking soda water. Therefore, soda water causes inebriation. So much the worse for simple inductive rules mindlessly applied. Of Mill’s four methods, Whewell [1989, p. 286] writes: “Upon these methods, the obvious thing to remark is, that they take for granted the very thing which is the most difficult to discover, the reduction of the phenomena to formulae such as are here presented to us. When we have any set of complex facts offered to us; for instance. . . the facts of the planetary paths, of falling bodies, of refracted rays, of cosmical motions, of chemical analysis; and when, in any of these cases, we would discover the law of nature which governs them, or if any one chooses so to term it, the feature in which all the cases agree, where are we to look for our A, B, C, and a, b, c? Nature does not present to us the cases in this form. . . ” Whewell’s point is very simple. In order to discover a connections between two disparate phenomena, we need to be able to extract the relevant information from each domain, that is, introduce quantities that will prove to be connected, yet we don’t know that until after we collect the right kind of data and see whether the quantities fit together in higher-level regularities. This kind of catch-22 makes discovery extremely difficult, though not impossible for human beings. But for present-day machines, computer systems, and primitive organisms, it has not been possible. A failure to see the depth of the problem is the root cause of the overly optimistic forecasts in the 1960s about how the AI systems would match human intelligence within 20 years. Even the apparent exceptions to this, such as the Deep Blue chessplaying program, prove the rule. In 1996, Deep Blue became the first computer system to defeat a reigning world champion (Garry Kasparov) in a match under standard chess tournament time controls. But it did it by brute force computing power, rather than the pattern-recognition techniques of the human chess masters, which enable them to play 40 opponents at once. (See [Dreyfus, 1992] for an indepth analysis.) In 1987, researchers based at Carnegie Mellon University (CMU) published a book called Scientific Discovery: Computational Explorations of the Creative Process by Langley, Simon, Bradshaw, and Zytkow. Again, the basic Whewellian criticism was raised about the computer programs such as Bacon, an AI system that rediscovered numeric laws such as Kepler’s third law, which equates the period of revolution of a planet around Sun to the 3/2 power of the mean radius. It’s one
The Debate between Whewell and Mill on the Nature of Scientific Induction
109
thing to ask how to relate one variable to another when the variables are already given, but quite another to discover Kepler’s laws from raw data about the angular positions of the planets at various times. Even knowing that ‘position relative to the fixed stars’ and ‘time’ can be functionally related is a major step forward. Nothing like this has been replicated by any computer system. That’s not to say that it’s impossible (indeed Langley and Bridewell (in press) speak in terms that remind me of Whewell). After all, our brains are computers and a network of these computers did solve the problem. But we must recognize that the requisite “explication of the conceptions”, to use Whewell’s term, is difficult. The most recent instance of this kind of disagreement surrounds the work by another group at CMU headed by Spirtes, Glymour and Scheines [1993], who have developed algorithms for discovering causal models or Bayes nets. Humphreys and Freedman [1996] published a critique, while Spirtes, Glymour and Scheines [1997] and Korb and Wallace [1997] published a reply. Again, this research in computerautomated algorithms of scientific discovery is an extremely valuable. The question is whether it could be improved by an implementation of Whewellian ideas (see [Forster, 2006]). In 1981, Hinton and Anderson edited an important volume on Parallel Models of Associative Memory, which was followed up by the very famous work on parallel distributed processing edited by Rumelhart and McClelland in 1986, which gave birth to a thriving industry on connectionist networks, otherwise known as artificial neural networks. The breakthrough was made possible by the mathematical discovery about how to implement a learning algorithm in neural networks that propagates backwards in the network to adjust connection weights so as to reduce the error in the output [Rumelhart et al., 1986]. Yet again, the lesson turned out to be the same: An all-purpose neural network is able to approximate any function in principle; but in practice too much flexibility creates difficulties. Top-down constraints need to be imposed on the network before data-driven search methods can match any of the cognitive abilities of human beings. My only point is that, in each of these episodes, it has taken quite some time to rediscover some of the points that were raised 150 years ago in the Whewell-Mill debate. 5
IMPLICATIONS FOR PROBABILISTIC THEORIES OF EVIDENCE AND CONFIRMATION
Allow me to predict a new example of the same thing. At the present time, there seems to me to be an overestimation of what the methods of statistical inference can achieve. In philosophy of science, major figures in the field endorse the view that Bayesian or Likelihoodist approaches to statistical reasoning can be extended to cover scientific reasoning more generally. In [Forster, 2007], I have argued that standard statistical methods of model selection, such as AIC [Akaike, 1973] and BIC [Schwartz, 1978], are fundamentally limited in their ability to replicate the methods of scientific discovery. (Note that connectionist networks are also implementing a standard statistical learning rule known as the method
110
Malcolm Forster
of least squares.) In [Forster, 2006], I put forward a positive suggestion about how Whewellian ideas about the consilience of inductions enrich the relationship between theory and evidence, which could improve the rate of learning and the amount that can be learned. Continuing on the same theme, philosophers of science, such as Hesse [1968; 1971], Achinstein [1990; 1992; 1994], and more recently Myrvold [2003], have tried to capture the confirmational value of consilience and unification in terms of standard probabilistic theories of confirmation, but with limited success. The reason for their limited success is illustrated by the following schematic example. Suppose we have a set of three objects {a, b, c} that can be hung on a mass measuring device, either individually or in pairs, a*b, a*c, and b*c, where a*b denotes the object consisting of a conjoined with b, and so on. Suppose that the Data consists of six measurements of the distances at which the counterweight need to be hung from the center of a beam balance in order to balance the object being measured. Let’s denote this observed distance as x(o), where o is the name of the object being measured. In order to talk about the consilience of inductions, we need two, or more, separate inductions; so let’s divide the data into two parts, and consider inductions performed on each part. Data1 = {x(a) = 1, x(b) = 2, x(c) = 3}, and Data2 = {x(a*b) = 3, x(a*c) = 4, x(b*c) = 5}. The core hypothesis under consideration is the assertion that for all objects o, x(o) = m(o), where m(o) denotes a theoretically postulated property of object o called mass.
M:
(∀o)(x(o) = m(o)).
The quantity x can be repeatedly measured, but no assumption is made that its value will be the same on different occasions. That depends on what the world is like. On the other hand, the hypothesis M asserts that masses are constant over time. The postulated constancy of m, combined with the equation, predicts that repeated measurements on the same object will be the same. It’s easy to equate some new quantity m with the outcome of measurement x, but it’s not so easy to defend the new quantity as representing something real underlying the observable phenomena. If we apply the conception that x(o) = m(o) to the two data sets, we notice that the hypothesis accommodates the data in each case, and there is no test of the hypothesis in the precise sense that the hypothesis would not have been refuted had the data been “generated by” a contrary hypothesis [Mayo, 1996]. The predictive content is not tested by single measurements of each mass. Yet, we
The Debate between Whewell and Mill on the Nature of Scientific Induction
111
can arrive at an inductive conclusion from the data according to standard rules. In the case of Data1, we arrive at the hypothesis h1 :
M &{m(a) = 1, m(b) = 2, m(c) = 3}.
Note that h1 ⇒ Data1, where ‘⇒’ means ‘logically entails’. I have no problem with the claim that the data Data1 confirms the hypothesis h1 , although it does so by pointing to the particular predictive hypothesis out of all those compatible with M , rather than confirming M itself. Now let’s consider the inductive conclusion arrived at on the basis of Data2: h2
M &{m(a ∗ b) = 3, m(a ∗ c) = 4, m(a ∗ c) = 5}.
Again, h2 ⇒ Data2, and the data confirms the inductive hypothesis. On my understanding of Whewell and Mill, they would agree on this. To explain the difference between Whewell and Mill, let’s consider a stronger inductive conclusion that includes the standard Newtonian conception that the mass of a composite object such as a*b is the sum of the masses of the parts. We shall call this the law of the composition of masses (LCM), and write it more formally as: LCM
(∀o1 )(∀o2 )(m(o1 ∗ o2 ) = m(o1 ) + m(o2 )).
Let’s denote the stronger inductive conclusions drawn from the data sets by H1 = h1 & LCM and H2 = h2 & LCM, respectively. Again, the data confirms the respective hypotheses, but only by picking out the mass values that correctly apply to the objects. There is no confirmation of the general propositions in the inductive hypotheses by Data1 or Data2. But all this changes when we consider the bigger picture; for H1 and H2 entail more than the data from which they were inductively inferred, they predict the other data set as well. That is, H1 ⇒ Data2, and H2 ⇒ Data1. This is an illustration of the idea behind Whewell’s consilience of inductions. . . “That rules springing from remote and unconnected quarters should thus leap to the same point, can only arise from that being the point where truth resides” [Whewell, 1989, p. 153]. The hypotheses h1 and h2 enjoy no such relationship with the data. Another way of seeing the same thing is to note that the two data sets, Data1 and Data2, provide independent measurements of the theoretically postulated masses, m(a), m(b), and m(c), and the independent measurements agree.4 From Data1, we obtain values of m(a), m(b), and m(c), and from Data2, we obtain values of m(a) + m(b) = 3, m(a) + m(c) = 4, and m(b) + m(c) = 5. Since there are three equations in three unknowns, these equations yield an independent set 4 “Independent” just means that the measurements are calculated from non-overlapping sets of data.
112
Malcolm Forster
of values for the three masses, which agree with the first set. Therefore H is confirmed by agreement of independent measurements of its postulated quantities, while h = h1 &h2 is not. The intuition just described is far more forceful if we were to embellish the example by including a set of mass measurements on a larger set of objects; say 25 objects. Then Data1 consists of 25 measurements of the 25 objects, whereas Data2 consists of 300 measurements of all possible pairings of the 25 objects, which provides 12 more independent measurements of each mass. That fact that 13 independent measurements of mass agree for each of 25 different objects is very strong evidence for the hypothesis H. Unfortunately, we cannot obtain this conclusion (that H is better supported by the Data than h) from the standard theories of confirmation used in contemporary philosophy of science or in statistics, such as Bayesianism and Likelihoodism.5 These views are committed to a likelihood theory of evidence that says that degree to which a total evidence, the Data in our example, supports a hypothesis, such as H or h, is fully exhausted by likelihoods P (Data|H) and P (Data|h). But, H ⇒ Data, and h ⇒ Data, and, therefore, P (Data|H) = 1 = P (Data|h). The relationship between theory and evidence is therefore the same for each of the hypotheses according to these (well respected) accounts of the nature of evidence. I suspect that the Bayesians and Likelihoodists will respond to this example along the following lines. Instead of considering the hypotheses as I have defined them, which include the “determination of the magnitudes” (as Whewell would put it), we should consider just the generalizations M and (M & LCM). Then we can argue that (M & LCM) gives the Data a greater probability (i.e., the hypothesis has a greater likelihood). They may argue that P ( Data| M &LCM) > P ( Data| M ).6 The idea behind this claim is very simple, but first you need to understand that (by the axioms of probability) the likelihood of a family of hypotheses is equal to a weighted average of the likelihoods of the hypotheses in the family. (M & LCM) is a family of hypotheses in which one member, namely H, has likelihood 1, while all the others have likelihood 0 because they get at least one mass value wrong (out of the masses that have been measured). The same applies to M ; it contains one hypothesis with likelihood 1 and the rest with likelihood 0. (Having likelihood 0 usually means that the hypothesis is refuted by the data.) Thus, (M & LCM) has a greater likelihood because its likelihood is calculated by 5 The one exception that I know of is Mayo [1996]. Her take on this example would be that H is severely tested by the Data because the probability is high that H would be refuted if H were false. But h is not severely tested by the Data because it would not be refuted if h were false. The uneasiness I have with this approach is the reference to counterfactual data. Other things being equal, I prefer a theory of confirmation that focuses only on the actual data. 6 Proof : P ( Data| M &LCM) = P (Data1|M &LCM)P (Data2|M &LCM & Data1). But P (Data2|M &LCM & Data1) = 1, so P ( Data| M &LCM) = P (Data1|M &LCM). But now it is clear that the hypotheses “say the same thing” about Data1, so P (Data1|M &LCM) = P (Data1|M ), and it is obvious thatP (Data1|M ) > P ( Data| M ). Thus, the result follows.
The Debate between Whewell and Mill on the Nature of Scientific Induction
113
averaging over a larger set of other hypotheses, all of which have zero likelihood. In other words, the likelihood of M is smaller because its maximum likelihood is washed out by averaging over a greater number of hypotheses. The first problem with this reply is that it changes the subject. We began by talking about the confirmation of H and h, and ended about talking about something else. But let’s consider the confirmation of (M & LCM) and M . The problem is that under any sensible way of averaging likelihoods, it turns out to be zero, zilch, nil. This is because there is only one point hypothesis that has non-zero likelihood, so any weighting that averages (integrates) over an infinite number (a continuum) of point hypotheses will yield an average likelihood of zero (Forster and Sober 1994). So, the claim that P (Data|M &LCM) > P (Data|M ) is incorrect. It should have been P (Data|M &LCM) ≥ P (Data|M ). And under the rather general conditions I have stated, P (Data|M &LCM) = P (Data|M ). The core part of the Bayesian argument, the part that was right, derives from the inequality P (Data2|M &LCM & Data1) = 1 > P (Data2|M & Data1) = 0. But this inequality is just what lies at the heart of Whewell’s consilience of inductions! Once we see that the inequality is what’s crucial, then we can express what should be said about the original example in the language of probability, without changing the subject. For note that the hypothesis (M & LCM) & Data1 is logically equivalent to H1 , as we previously defined it, and M & Data1 is logically equivalent to h1 . So the inequality is just P (Data2|H1 ) = 1 > P (Data2|h1 ) = 0, to which we could add P (Data1|H2 ) = 1 > P (Data1|h2 ) = 0. In other words, the part of the likelihood analysis that makes sense rests on Whewellian principles. Why try to wrap it up in a Bayesian package with trappings that are false at worst, and irrelevant at best? I suggest that it is philosophically more fruitful to understand the relationship between theory and evidence in Whewellian terms right from the beginning. To repeat, as Whewell points out, nature does not present inductive problems in a form that lends itself to any simple methods of induction. In the mass measurement example, we began with two sets of data, with two phenomena, each of which is colligated by the formula x(o) = m(o), but we can discover no deeper connection between them until we explicate the concept of mass by introducing the law of composition of masses (LCM). Question: How do we explain why these thirteen independent measurements agree? Answer: By concluding that they are measurements of the same quantity, the effects of a common cause. Arguing that we should explain many effects in terms of a common cause is the easy part of the discovery. The harder part is to arrive at the problem in this form. The same is true of the Kepler example.
114
Malcolm Forster
ACKNOWLEDGEMENTS I would like to thank Elizabeth Wrigley-Field and Daniel Schneider for very helpful comments on an earlier draft. BIBLIOGRAPHY [Achinstein, 1990] P. Achinstein. Hypotheses, Probability, and Waves. British Journal for the Philosophy of Science 41: 73-102, 1990. [Achinstein, 1992] P. Achinstein. Inference to the Best Explanation: Or, Who Won the MillWhewell Debate? Studies in the History and Philosophy of Science 23: 349-364, 1992. [Achinstein, 1994] P. Achinstein. Explanation v. Prediction: Which Carries More Weight? In David Hull and Richard M. Burian (eds.), PSA 1994, vol. 2, East Lansing, MI, Philosophy of Science Association, 156-164, 1994. [Akaike, 1973] H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle. B. N. Petrov and F. Csaki (eds.), 2nd International Symposium on Information Theory: 26781. Budapest: Akademiai Kiado, 1973. [Carew et al., 1983] T. J. Carew, R. D. Hawkins, and E. R. Kandel. Differential classical conditioning of a defensive withdrawal reflex in Aplysia californica. Science 219: 397-400, 1983. [Dreyfus, 1992] H. L. Dreyfus. What Computers Still Can’t Do: A Critique of Artificial Reason MIT Press: Cambridge, Mass, 1992. [Earman, 1978] J. Earman. Fairy Tales vs. an Ongoing Story: Ramsey’s Neglected Argument for Scientific Realism. Philosophical Studies 33: 195-202, 1978. [Forster, 1988] M. R. Forster. ‘Unification, Explanation, and the Composition of Causes in Newtonian Mechanics. Studies in the History and Philosophy of Science 19: 55–101, 1988. [Forster, 2006] M. R. Forster. Counterexamples to a Likelihood Theory of Evidence, Mind and Machines, 16: 319-338, 2006. [Forster, 2007] M. R. Forster. A Philosopher’s Guide to Empirical Success, Philosophy of Science, Vol. 74, No. 5: 588-600, 2007. [Forster and Sober, 1994] M. R. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions. British Journal for the Philosophy of Science 45: 1–35, 1994. [Friedman, 1981] M. Friedman. Theoretical Explanation. In Time, Reduction and Reality. Edited by R. A. Healey. Cambridge: Cambridge University Press. Pages 1–16, 1981. [Friedman, 1983] M. Friedman. Foundations of SpaceTime Theories. Princeton, NJ: Princeton University Press, 1983. [Glymour, 1980] C. Glymour. Explanations, Tests, Unity and Necessity. Noˆ us 14: 31–50, 1980. [Hanson, 1973] N. R. Hanson. Constellations and Conjectures, W. C. Humphreys, Jr. (ed.) D. Reidel: DordrechtHolland, 1973. [Harper, 1989] W. L. Harper. Consilience and Natural Kind Reasoning. In J. R. Brown and J. Mittelstrass (eds.) An Intimate Relation: 115152. Dordrecht: Kluwer Academic Publishers, 1989. [Harper, 1993] W. L. Harper. Reasoning from Phenomena: Newton’s Argument for Universal Gravitation and the Practice of Science. In Paul Theerman and Seeff, Adele F. (eds.) Action and Reaction, Newmark: University of Delaware Press, 144–182, 1993. [Harper, 2002] W. L. Harper. Howard Stein on Isaac Newton: Beyond Hypotheses. In David B. Malament (ed.) Reading Natural Philosophy: Essays in the History and Philosophy of Science and Mathematics. Chicago and La Salle, Illinois: Open Court. 71–112, 2002. [Harper, 2007] W. L. Harper. ‘Newton’s Method and Mercury’s Perihelion before and after Einstein. Philosophy of Science 74: 932-942, 2007. [Harper et al., 1994] W. L. Harper, B. H. Bennett and S. Valluri. “Unification and Support: Harmonic Law Ratios Measure the Mass of the Sun.” In D. Prawitz and D. Westerst¨ ahl (eds.) Logic and Philosophy of Science in Uppsala: 131-146. Dordrecht: Kluwer Academic Publishers, 1994. [Hinton and Anderson, 1981] G. E. Hinton and J. A. Anderson, eds. Parallel Models of Associative Memory. Hillsdale, NJ: Lawrence Erlbaum Associates, 1981.
The Debate between Whewell and Mill on the Nature of Scientific Induction
115
[Hempel, 1945] C. G. Hempel. Studies in the Logic of Confirmation. Mind, vol. 54, 1945. Reprinted in Hempel [1965]. [Hempel, 1965] C. G. Hempel. Aspects of Scientific Explanation and Other Essays in the Philosophy of Science. New York: The Free Press, 1965. [Hesse, 1968] M. Hesse. Consilience of Inductions. In I. Lakatos (ed.), Inductive Logic. NorthHolland, Amsterdam, 1968. [Hesse, 1971] M. Hesse. Whewell’s consilience of inductions and predictions, The Monist , 55, 520-524, 1971. [Humphreys and Freedman, 1996] P. Humphreys and D. Freedman. The Grand Leap. British Journal for the Philosophy of Science 47: 113-123, 1996. [Korb and Wallace, 1997] K. B. Korb and C. S. Wallace. In Search of the Philosopher’s Stone: Remarks on Humphreys and Freedman’s Critique of Causal Discovery. British Journal for the Philosophy of Science 48: 543-553, 1997. [Kuhn, 1970] T. Kuhn. The Structure of Scientific Revolutions, Second Edition. Chicago: University of Chicago Press, 1970. [Kuhn, 1970a] T. Kuhn. The Structure of Scientific Revolutions, Second Edition. Chicago: University of Chicago Press, 1970. [Langley et al., 1987] P. H. Langley, H. A. Simon, G. L. Bradshaw, and J. M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Process. MIT Press, Cambridge, Mass, 1987. [Langley and Bridewell, in press] P. H. Langley and W. Bridewell. Processes and constraints in explanatory scientific discovery. Proceedings of the Thirtieth Annual Meeting of the Cognitive Science Society. Washingon, D.C., in press. [Macphail, 1993] E. M. Macphail. The Neuroscience of Animal Intelligence: From the Seahare to the Seahorse. Columbia University Press, New York, 1993. [Mayo, 1996] D. G. Mayo. Error and the Growth of Experimental Knowledge. Chicago and London, The University of Chicago Press, 1996. [Mill, 1872] J. S. Mill. A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation, 1872. Eighth Edition (Toronto: University of Toronto Press, 1974). [Myrvold, 2003] W. Myrvold. A Bayesian Account of the Virtue of Unification, Philosophy of Science 70: 399-423, 2003. [Myrvold and Harper, 2002] W. Myrvold and W. L. Harper. Model Selection, Simplicity, and Scientific Inference, Philosophy of Science 69: S135-S149, 2002. [Nauta and Feirtag, 1986] W. J. H. Nauta and M. Feirtag. Fundamental Neuroanatomy. New York: W. H. Freeman, 1986. [Norton, 2000a] J. D. Norton. The Determination of Theory by Evidence: The Case for Quantum Discontinuity, 1900-1915, Synthese 97: 1-31, 2000. [Norton, 2000b] J. D. Norton. How We Know about Electrons. In Robert Nola and Howard Sankey (eds.) After Popper, Kuhn and Feyerabend, Kluwer Academic Press, 67-97, 2000. [Rumelhart et al., 1986] D. E. Rumelhart, J. McClelland, et al. Parallel Distributed Processing, Volumes 1 and 2. MIT Press, Cambridge, Mass, 1986. [Rummelhart et al., 1986a] D. E. Rumelhart, G. Hinton, and R. J. Williams. Nature 323: 533536, 1986. [Schwarz, 1978] G. Schwarz. Estimating the Dimension of a Model. Annals of Statistics 6: 4615, 1978. [Spirtes et al., 1993] P. Spirtes, C. Glymour and R. Scheines. Causation, Prediction and Search. New York: Springer-Verlag, 1993. [Whewell, 1840] W. Whewell. The Philosophy of the Inductive Sciences (1967 edition). London: Frank Cass & Co. Ltd. 1840. [Whewell, 1847] W. Whewell. Philosophy of the Inductive Sciences , 2 vols. (London, John W. Parker), 1847. [Whewell, 1858] W. Whewell. Novum Organon Renovatum, Part II of the 3rd the third edition of The Philosophy of the Inductive Sciences, London, Cass, (1858), 1967. [Whewell, 1989] W. Whewell. William Whewell: Theory of Scientific Method. Edited by Robert Butts. Hackett Publishing Company, Indianapolis/Cambridge, 1989.
AN EXPLORER UPON UNTRODDEN GROUND: PEIRCE ON ABDUCTION Stathis Psillos Abduction, in the sense I give the word, is any reasoning of a large class of which the provisional adoption of an explanatory hypothesis is the type. But it includes processes of thought which lead only to the suggestion of questions to be considered, and includes much besides. Charles Peirce (2.544, note) 1
INTRODUCTION
Charles Sanders Peirce (1839-1914), the founder of American pragmatism, spent a good deal of his intellectual energy and time trying to categorise kinds of reasoning, examine their properties and their mutual relations. During this intellectual adventure, he was constantly breaking new ground. One of his major achievements was that he clearly delineated a space for non-deductive, that is ampliative, reasoning. In particular, he took it to be the case that there are three basic, irreducible and indispensable forms of reasoning. Deduction and Induction are the two of them. The third is what he came to call abduction, and whose study animated most of Peirce’s intellectual life. In his fifth lecture on Pragmatism, in 1903, Peirce claimed that “abduction consists in studying facts and devising a theory to explain them” (5.145).1 And in the sixth lecture, he noted that “abduction is the process of forming an explanatory hypothesis” (5.171). He took abduction to be the only kind of reasoning by means of which new ideas can be introduced (cf. 5.171). In fact, he also thought that abduction is the mode of reasoning by means of which new ideas have actually been introduced: “All the ideas of science come to it by the way of Abduction” (5.145). “Abduction”, he added, “consists in studying facts and devising a theory to explain them. Its only justification is that if we are ever to understand things at all, it must be in that way (5.145). 1 All references to Peirce’s works are to his Collected Papers, and are standardly cited by volume and paragraph number. The Collected Papers are not in chronological order. Every effort has been made to make clear the year in which the cited passages appeared.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
118
Stathis Psillos
Peirce never doubted the reality, importance, pervasiveness and reasonableness of explanatory reasoning. And yet he thought that explanatory reasoning had been understudied—its character as a distinct logical operation had not been understood. Nor had it been sufficiently distinguished from other basic forms of reasoning. In 1902, in the middle of his unfinished manuscript Minute Logic, he made it clear that he was fully aware of the unprecedented character of the task he had set to himself. In his study of ‘Hypothetic inference’, as he put it, he was “an explorer upon untrodden ground” (2.102). In this chapter I will narrate the philosophical tale of this exploration. Section 2 will recount Peirce’s debts to Kant and Aristotle. Section 3 will articulate and present Peirce’s own two-dimensional framework for the study of reasoning and set out Peirce’s key aim, viz., the study of the mode of reasoning that is both ampliative and generative of new content. Section 4 explains Peirce’s early syllogistic approach to inference and discusses his division of ampliative reasoning into Hypothesis and Induction. Section 5 examines Peirce’s mature approach to abduction. Section 6 focuses on the issue of the legitimacy of abduction qua mode of reasoning and relates it to Peirce’s pragmatism. Section 7 relates Peirce’s conception of inquiry as a three-stage project which brings together all three basic and ineliminable modes of reasoning, viz., abduction, deduction and deduction. The chapter concludes with some observations about Peirce’s legacy.
2 IDEAS FROM KANT AND ARISTOTLE In setting out for the exploration of the untrodden ground, Peirce had in his philosophical baggage two important ideas; one came from Kant and the other from Aristotle. From Kant he took the division of all reasoning into two broad types: explicative (or necessary) and ampliative. In his Critique of Pure Reason, Kant famously drew a distinction between analytic and synthetic judgements (A7/B11). He took it that analytic judgements are such that the predicate adds nothing to the concept of the subject, but merely breaks this concept up into “those constituent concepts that have all along been thought in it, although confusedly”. For this reason, he added that analytic judgements can also be called “explicative”. Synthetic judgements, on the other hand, “add to the concept of the subject a predicate which has not been in any wise thought in it, and which no analysis could possibly extract from it; and they may therefore be entitled ampliative”. Peirce (cf. 5.176) thought that Kant’s conception of explicative reasoning was flawed, if only because it was restricted to judgements of the subject-predicate form. Consequently, he thought that though Kant was surely right to draw the distinction between explicative and ampliative judgements, the distinction was not properly drawn.2 He then took it upon himself to rectify this problem. One way in which Kant’s distinction was reformed was by a further division 2 For Peirce’s critique of Kant’s conception of analytic judgements, see his The Logic of Quantity (4.85-4.93), which is chapter 17 of the Grand Logic, in 1893.
An Explorer upon Untrodden Ground: Peirce on Abduction
119
of ampliative reasoning into Induction and Hypothesis. In fact, Peirce found in Aristotle the idea that there is a mode of reasoning which is different from both deduction and Induction. In his Prior Analytics, chapter 25 (69a20ff), Aristotle introduced an inferential mode which he entitled apag¯ og¯e and was translated into English as ‘reduction’. Aristotle characterised apag¯ og¯e as follows: “We have Reduction (1) when it is obvious that the first term applies to the middle, but that the middle applies to the last term is not obvious, yet nevertheless is more probable or not less probable than the conclusion”. This is rather opaque,3 but the example Aristotle used may help us see what he intended to say.4 Let A stand for ‘being taught’ or ‘teachable’; B for ‘knowledge’ and C for ‘morality’. Is morality knowledge? That is, is it the case that C is B? This is not clear. What is evidently true, Aristotle says, is that knowledge can be taught, i.e., B is A. From this nothing much can be inferred. But if we hypothesise or assume that C is B (that morality is knowledge), we can reason as follows: C is B B is A Therefore, C is A. That is: Morality is knowledge; Knowledge can be taught; therefore, morality can be taught. If the minor premise (C is B; morality is knowledge) is not less probable or is more probable than the conclusion (C is A; morality can be taught), Aristotle says, we have apag¯ og¯e : “for we are nearer to knowledge for having introduced an additional term, whereas before we had no knowledge that [C is A] is true”. The additional term is, clearly, B and this, it can be argued, is introduced on the basis of explanatory considerations.5 In the uncompleted manuscript titled Lessons from the History of Science (c. 1896), Peirce noted that “There are in science three fundamentally different kinds of reasoning, Deduction (called by Aristotle {synag¯og¯e} or {anag¯ og¯e}, Induction (Aristotle’s and Plato’s {epag¯ og¯e}) and Retroduction (Aristotle’s {apag¯ og¯e}, but misunderstood because of corrupt text, and as misunderstood usually translated abduction)” (1.65). Peirce formed the hypothesis that Aristotle’s text was corrupt in some crucial respects and that Aristotle had in fact another kind of inference in mind than the one reconstructed above (and was also acknowledged by Peirce himself; cf. 7.2507.252).6 He took it that Aristotle was after an inference according to which the 3 There
is a second clause in Aristotle’s definition, but it need not concern us here. is actually some controversy over the exact rendering of Aristotle’s text and the interpretation of ‘reduction’. For two differing views, see W. D. Ross (1949, 480-91) and Smith (1989, 223-4). 5 Ross [1949, p. 489] claims that apag¯ og¯ e is a perfect syllogism and works on the assumption that if a proposition (which is not known to be true) is admitted, then a certain conclusion follows, which would not have followed otherwise. Smith [1989, p. 223] notes that apag¯ og¯ e “is a matter of finding premises from which something may be proved”. 6 Here is how he explained things in his fifth lecture on Pragmatism in 1903: “[I]t is necessary to recognize three radically different kinds of arguments which I signalized in 1867 and which 4 There
120
Stathis Psillos
minor premise (Case) of a syllogism is “inferred from its other two propositions as data” (7.249), viz., the major premise (Rule) and the conclusion (Result). Peirce took it that the proper form of hypothetical reasoning (the one that, according to Peirce, Aristotle was after in Prior Analytics, chapter 25) must be:7 Rule — M is P Result — S is P Case — S is M which amounts to a re-organisation of the premises and the conclusion of the following deductive argument: Rule — M is P Case — S is M Result — S is P . But then it transpired to Peirce that there is yet another re-organisation of the propositions of this argument, viz., Case — S is M Result — S is P Rule — M is P which, in his early period, he took it to characterise induction. We shall discuss the details of Peirce’s account of ampliative inference in the sequel. For the time being, let us just rest with the note that Peirce’s studies of ampliative reasoning were shaped by his re-evaluation and critique of ideas present in Kant and Aristotle. As we are about to see in the next section, Peirce created his own framework for the philosophical study of reasoning, which was essentially two-dimensional. 3
PEIRCE’S TWO-DIMENSIONAL FRAMEWORK
In one of his last writings on the modes of reasoning, a letter he sent to Dr Woods in November 1913, Peirce summed up the framework within which he examined had been recognized by the logicians of the eighteenth century, although [those] logicians quite pardonably failed to recognize the inferential character of one of them. Indeed, I suppose that the three were given by Aristotle in the Prior Analytics, although the unfortunate illegibility of a single word in his MS. and its replacement by a wrong word by his first editor, the stupid [Apellicon], has completely altered the sense of the chapter on Abduction. At any rate, even if my conjecture is wrong, and the text must stand as it is, still Aristotle, in that chapter on Abduction, was even in that case evidently groping for that mode of inference which I call by the otherwise quite useless name of Abduction—a word which is only employed in logic to translate the [{apag¯ og¯ e}] of that chapter” (5.144). 7 Peirce said of Aristotle: “Certainly, he would not be Aristotle, to have overlooked that question [whether the minor premise of a syllogism is not sometimes inferred from its major premise and the conclusion]; and it would no sooner be asked than he would perceive that such inferences are very common” (7.249). In 8.209 (c. 1905), Peirce expressed doubts concerning his earlier view that the text in chapter 25 of Prior Analytics was corrupt.
An Explorer upon Untrodden Ground: Peirce on Abduction
121
reasoning and its properties. As he explained (8.383-8.388), there are two kinds of desiderata or aims that logicians should strive for when they study types of reasoning: uberty and security. Uberty is the property of a mode of reasoning in virtue of which it is capable of producing extra content; its “value in productiveness”, as he (8.384) put it. Security is the property of a mode of reasoning in virtue of which the conclusion of the reasoning is at least as certain as its premises. These two desiderata delineate a two-dimensional framework within which reasoning is studied. Peirce’s complaint is that traditional studies of reasoning have focused almost exclusively on its “correctness”, that is on “its leaving an absolute inability to doubt the truth of the conclusion as long as the premises are assumed to be true” (8.383). By doing so — by focusing only on security — traditional approaches have tended to neglect non-deductive modes of reasoning. They have confined their attention to deduction, which is the only mode of reasoning that guarantees security. This one-dimensional approach has, however, obscured the fact that there are different types of reasoning, which have different forms and require independent and special investigation. Induction and abduction are such types of reasoning. What is more, together with deduction they constitute the three ultimate, basic and independent modes of reasoning. This is a view that runs through the corpus of the Peircean work. Peirce’s two-dimensional framework suggests a clear way to classify the three modes of reasoning. Deduction scores best in security but worst in uberty. Abduction scores best in uberty and worst in security. Induction is between the two (cf. 8. 387). But why is uberty needed? One of Peirce’s stable views was that reasoning should be able to generate new ideas; or new content. The conclusion of a piece of reasoning should be able to be such that it exceeds in information and content whatever is already stated in the premises. Deduction cannot possibly do that. So either all attempts to generate new content should be relegated to processes that do not constitute reasoning or there must be reasoning processes which are non-deductive. The latter option is the one consistently taken by Peirce. The further issue then is the logical form of non-deductive reasoning. Peirce was adamant that there are two basic modes of non-deductive reasoning. Throughout his intellectual life he strove to articulate these two distinct modes, to separate them from each other and to relate them to deductive reasoning. ‘Abduction’, ‘Retroduction’, ‘Hypothetic Inference’, ‘Hypothesis’ and ‘Presumption’ are appellations aiming to capture a distinct mode of reasoning — distinct from both induction and deduction. What makes this kind of reasoning distinctive is that it relies on explanation — and hence it carves a space in which explanatory considerations guide inference. In his letter to Dr Woods, he said: “I don’t think the adoption of a hypothesis on probation can properly be called induction; and yet it is reasoning and though its security is low, its uberty is high” (8.388). What exactly is reasoning? In his entry on ampliative reasoning in the Dictionary of Philosophy and Psychology (1901-2), Peirce wrote: Reasoning is a process in which the reasoner is conscious that a judgment, the conclusion, is determined by other judgment or judgments,
122
Stathis Psillos
the premisses, according to a general habit of thought, which he may not be able precisely to formulate, but which he approves as conducive to true knowledge. By true knowledge he means, though he is not usually able to analyse his meaning, the ultimate knowledge in which he hopes that belief may ultimately rest, undisturbed by doubt, in regard to the particular subject to which his conclusion relates. Without this logical approval, the process, although it may be closely analogous to reasoning in other respects, lacks the essence of reasoning (2.773). This is instructive in many respects. Reasoning is a directed movement of thought (from the premises to the conclusion) that occurs according to a rule, though the rule (what Peirce calls “a general habit of thought”) might not be explicitly formulated. Actually, it is the task of the logician, broadly understood, to specify, and hence make explicit, these rules. There is more however to reasoning than being a directed movement of thought according to a rule. The rule itself — the general pattern under which a piece of reasoning falls — must be truthconducive. Or at least, the reasoner should hold it to be truth-conducive. Here again, it is the task of the logician to show how and in virtue of what a reasoning pattern (a rule) is truth-conducive. Peirce puts the matter in terms of knowledge: reasoning should lead to knowledge (and hence to truth). This is important because the reasoning process confers justification on the conclusion: it has been arrived at by a knowledge-conducive process. But Peirce’s account of knowledge is thoroughly pragmatic. Knowledge is not just true belief — justification is also needed. And yet, a belief is justified if it is immune to doubt; that is, if it is such that further information relevant to it will not defeat whatever warrant there has been for it. If justification amounts to resistance to doubt, it is clear that there are two broad ways in which a process of reasoning can confer justification on a belief. The first is by making it the case that if the premises are true, the conclusion has to be true. The second is by rendering a belief plausible and, in particular, by making a belief available for further testing, which — ideally at least — will be able to render this belief immune to revision. The security of deductive reasoning (and the justification it offers to its conclusions) are related to the fact that no new ideas are generated by deduction. But no new ideas are generated by induction either (cf. 5.145). Induction, understood as enumerative induction, generalises an observed correlation from a (fair) sample to a population. It moves from ‘All observed As have been B’ to ‘All As are B’. This is clearly non-demonstrative reasoning; and it is equally clearly ampliative, or content-increasing. But it is true that no new ideas are generated by this kind of reasoning. The reason is simple. The extra content generated by induction is simply a generalisation of the content of the premises; it amounts to what one may call ‘horizontal extrapolation’. Enumerative induction, pretty much like deduction, operates with the principle ‘garbage in, garbage out’: the descriptive vocabulary of the conclusion cannot be different from that of the premises. Hence with enumerative induction, although we may arguably gain knowledge of hitherto unobserved correlations between instances of the attributes
An Explorer upon Untrodden Ground: Peirce on Abduction
123
involved, we cannot gain ‘novel’ knowledge, i.e., knowledge of entities and causes that operate behind the phenomena. Peirce was quite clear on this: “[Induction] never can originate any idea whatever. No more can deduction. All the ideas of science come to it by the way of Abduction” (5.145). For Peirce then there must be a mode of reasoning which is both ampliative and generates new ideas. How this could be possible preoccupied him throughout his intellectual life. The key idea was that new content is generated by explanation — better put, explanatory reasoning (viz., reasoning that is based on searching for and evaluating explanations) is both ampliative and has the resources to generate new content, or new ideas. But in line with his overall approach to reasoning, it has to have a rather definite logical form. What then is this form? Peirce changed his mind over this at least once in his life. The reason is that his first attempt to characterise explanatory reasoning was constrained by his overall syllogistic conception of inference. This conception did not leave a lot of room for manoeuvre when it came to the formal properties of reasoning. A rather adequate formal characterisation of explanatory reasoning required a broadening of Peirce’s conception of logic, as the logic of inquiry. In fact, in 1882, Peirce came to see logic as “the method of methods”, or “the art of devising methods of research” (7.59). In his later writings, he was more inclined to equate logic with the scientific method, broadly understood as “the general method of successful scientific research” (7.79). We are about to see all this in detail in the next section. The general point, if you wish, is that Peirce’s view of explanatory reasoning has gone through two distinct phases. In the first phase, he took explanatory reasoning to be a specific inferential pattern that stands on its own and is meant to capture the formation and acceptance of explanatory hypotheses. In the second phase, explanatory reasoning was taken to be part of a broader three-stage methodological pattern (the method of inquiry). Explanatory reasoning no longer stands on its own; and though Peirce’s later view of explanatory reasoning shares a lot with his earlier view, (for instance the thought that explanatory reasoning is the sole generator of new content), the key difference is that explanatory reasoning yields conclusions that need further justification, which is achieved by means of deduction and induction. Abduction — Peirce’s settled term for explanatory reasoning — leads to hypotheses with excess/fresh content. These hypotheses do not bear their justification on their sleeves. They need to be further justified — by deduction of predictions and their confirmation (which is Peirce’s settled view of induction), because it is only this further process that can render them part of a body of beliefs that though fallible cannot be overturned by experience. Abductively generated beliefs should be subjected to further testing (encounters with possibly recalcitrant experience) and if they withstand it successfully (especially in the long run) they become true — in the sense that Peirce thinks of truths as doubt-resistant (or permanently settled) beliefs (truth: “a state of belief unassailable by doubt” 5.416).8 8 Two of Peirce’s commentators Arthur Burks [1946, p. 301] and K. T. Fann [1970, pp. 910] have rightly contrasted the two phases of Peirce’s views of explanatory reasoning along the following lines: in the first phase, Hypothesis is an evidencing process, while in the second phase
124
Stathis Psillos
4 HYPOTHESIS VS INDUCTION “The chief business of the logician”, Peirce said in 1878, “is to classify arguments; for all testing clearly depends on classification. The classes of the logicians are defined by certain typical forms called syllogisms” (2.619). As noted already in section 2, given this syllogistic conception of argument, typically exemplified by Barbara (S is M; M is P; therefore S is P), there is a division of ampliative types of argument into Induction and Hypothesis. Deduction is captured by a syllogism of the form: D: {All As are B; a is A; therefore, a is B}. There are two re-organisations of the premises and the conclusion of this syllogism: I: {a is A; a is B; therefore All As are B}; and H: {a is B; All As are B; therefore a is A}. Here is Peirce’s own example (2.623) DEDUCTION Rule. — All the beans from this bag are white. Case. — These beans are from this bag. ∴ Result. — These beans are white. INDUCTION Case. — These beans are from this bag. Result. — These beans are white. ∴ Rule. — All the beans from this bag are white HYPOTHESIS Rule. — All the beans from this bag are white. Result. — These beans are white. ∴ Case. — These beans are from this bag. The crucial thing here is that Peirce took both I and H to be formal argument patterns, which characterise “synthetic” reasoning. So, making a hypothesis falls under an inferential pattern. Peirce said: Suppose I enter a room and there find a number of bags, containing different kinds of beans. On the table there is a handful of white beans; and, after some searching, I find one of the bags contains white beans only. I at once infer as a probability, or as a fair guess, that this handful was taken out of that bag. This sort of inference is called making an hypothesis. it is a methodological as well as an evidencing process.
An Explorer upon Untrodden Ground: Peirce on Abduction
125
The intended contrast here, I take it, is with the statement “These beans are from the white-beans bag” (the conclusion of H) being a mere guess, or a wild conjecture and the like. Though it is clear that the conclusion of H does not logically follow from the premises (this is simply to say that H is not D), it is a conclusion, that is, the result of an inferential process — a movement of thought according to a rule. Already in 1867, in On the Natural Classification of Arguments, (Proceedings of the American Academy of Arts and Sciences, vol. 7, April 9, 1867, pp. 261-87), he took it to be the case that the adoption of a hypothesis is an inference “because it is adopted for some reason, good or bad, and that reason, in being regarded as such, is regarded as lending the hypothesis some plausibility” (2.511, note). More generally, although both H and I are logically invalid, they are not meant to be explicative inferences but ampliative. Their conclusion is adopted on the basis that the premises offer some reason to accept it as plausible: were it not for the premises, the conclusion would not be considered, even prima facie, plausible; it would have been a mere guess. The difference between I and H is, as Peirce put it, that induction “classifies”, whereas hypothesis “explains” (2.636). Classification is not explanation and this implies that Induction and Hypothesis are not species of the same genus of ampliative reasoning, viz., explanatory reasoning. Induction is a more-of-the-same type of inference.9 As noted already, the conclusion of an induction is a generalisation (or a rule) over the individuals mentioned in the premises. Hypothesis (or hypothetical inference) is different from induction in that the conclusion is a hypothesis which, if true, explains the evidence (or facts) mentioned in the premises. Here is how Peirce put it: “Hypothesis is where we find some very curious circumstance, which would be explained by the supposition that it was a case of a certain general rule, and thereupon adopt that supposition” (2.624). The supposition is adopted for a reason: it explains ‘the curious circumstance’. The centrality of explanation in (at least a mode of) ampliative reasoning is a thought that Peirce kept throughout his intellectual engagement with the forms of reasoning. A key idea that Peirce had already in the 1870s10 was that Hypothesis is different from Induction in that the conclusion of H is typically a new kind of fact, or something “of a different kind from what we have directly observed, and frequently something which it would be impossible for us to observe directly” (2.640). The excess content (the new ideas) generated by Hypothesis concerns, in a host of 9 In his seventh of his Lowell lectures in 1903, Peirce developed a rather elaborate theory of the several types of induction (cf. 7.110-7.130). He distinguished between three types of induction: Rudimentary or crude induction, which is a form of default reasoning, viz., if there is no evidence for A, we should assume that A is not the case; Predictive induction, where some predictions are being drawn from a hypothesis and they are further tested; and Statistical or Quantitative induction, where a definite value is assigned to a quantity, viz., it moves from a quantitative correlation in a sample to the entire class “by the aid of the doctrine of chances”. Peirce goes on to distinguish between several subtypes of induction. The characteristic of all types is that their justification is that they will lead to convergence to the truth in the limit. Rudimentary induction is the weakest type, while statistical induction is the strongest one. See also his The Variety and Validity of Induction, (2.755-2.760), from manuscript ‘G’, c. 1905. 10 In a series of six articles published in Popular Science Monthly between 1877 and 1878.
126
Stathis Psillos
typical cases, unobservable entities that cause (and hence causally explain) some observable phenomena. Indeed, Peirce took this aspect of hypothetical reasoning to be one of the three reasons which suggest that Hypothesis and Induction are distinct modes of ampliative reasoning. As he put it in 1878: “Hypothetic reasoning infers very frequently a fact not capable of direct observation” (2.642). Induction lacks this capacity because it is constrained by considerations of similarity. Induction works by generalisation, and hence it presupposes that the facts mentioned in the conclusion of an inductive argument are similar to the facts mentioned in the premises. Hypothesis, on the other hand, is not constrained by similarity. It is perfectly possible, Peirce noted, that there are facts which support a hypothetically inferred conclusion, which are totally dissimilar to the facts that suggested it in the first place. The role played by similarity in Induction but not in Hypothesis is the second reason why they are distinct modes of ampliative reasoning. As he put it: (. . . ) the essence of an induction is that it infers from one set of facts another set of similar facts, whereas hypothesis infers from facts of one kind to facts of another” (2.642).11 The example Peirce used to illustrate this feature of hypothetical reasoning is quite instructive. The existence of Napoleon Bonaparte is based on a hypothetical inference. It is accepted on the basis that it accounts for a host of historical records, where these records serve as the ground for the belief in the historical reality of Napoleon. There is no way, Peirce thought, this kind of inference be turned to an induction. If there were, we would have to be committed to the view that all further facts that may become available in the future and confirm the historical reality of Napoleon will be similar to those that have already been available. But it is certainly possible, Peirce suggested, that evidence that may become available in the future might well be of a radically different sort than the evidence already available. To illustrate this, he envisaged the possibility that “some ingenious creature on a neighboring planet was photographing the earth [when Napoleon was around], and that these pictures on a sufficiently large scale may some time come into our possession, or that some mirror upon a distant star will, when the light reaches it, reflect the whole story back to earth” (2.642). There is clearly no commitment to similarity in the case of Hypothesis. Further facts that may confirm the hypothesis that Napoleon was historically real may be of any sort whatever. Actually, this thought tallies well with his claim that hypotheses can gain extra strength by unifying hitherto unrelated domains of fact (see 2.639). A case like this is the kinetic theory of gases which Peirce took it to have gained in strength by relating (unifying) a “considerable number of observed facts of different kinds” (2.639).12 11 Another salient point that distinguishes Induction from Hypothesis is that Hypothesis does not involve an enumeration of instances (see 2.632). 12 The third reason for the distinction between Hypothesis and Induction is somewhat obscure. He takes it that Induction, yielding as it does a general rule, leads to the formation of a habit. Hypothesis, by contrast, yielding as it does an explanation, leads to an emotion. It’s not quite clear what Peirce intends to say here. The analogy he uses is this: “Thus, the various sounds made by the instruments of an orchestra strike upon the ear, and the result is a peculiar musical
An Explorer upon Untrodden Ground: Peirce on Abduction
127
From the various examples of hypothetical reasoning Peirce offers,13 he clearly thinks that Hypothesis (meaning: hypothetical inference) is pervasive. It is invariably employed in everyday life as well as in science. But as noted already, Peirce’s approach to amplitative reasoning has been two-dimensional. Hypothesis might well score high in uberty, but it scores quite low in security. It is clear that Hypothesis does give us reasons to accept a conclusion, but the reasons might well be fairly weak. Here is Peirce again: “As a general rule, hypothesis is a weak kind of argument. It often inclines our judgment so slightly toward its conclusion that we cannot say that we believe the latter to be true; we only surmise that it may be so” (2.625). The point, clearly, is that the fact that a hypothesis might explain some facts is not, on its own, a conclusive reason to think that this hypothesis is true. Perhaps, Peirce’s careful wording (compare: “it often inclines our judgement so slightly . . . ”) implies that some conclusions of hypothetical inferences are stronger than others — that is, there might be some further reasons which enhance our degree of belief in the truth of the conclusion of a hypothetical inference. Indeed, Peirce went on to offer some rules as to how “the process of making an hypothesis should lead to a probable result” (2.633). This is very important because it makes clear that from quite early on, Peirce took it that hypothetical reasoning needs some, as it were, external support. It may stand on its own as a mode of reasoning (meaning: as offering grounds or reasons for a conclusion), but its strength (meaning: how likely the conclusion is) comes, at least partly, from the further testing that the adopted hypothesis should be subjected to. The three rules are (cf. 2.634): 1. Further predictions should be drawn from the adopted hypothesis. 2. The testing of the hypothesis should be severe. That is, the hypothesis should be tested not just against data for which it is known to do well but also against data that would prove it wrong, were it false. 3. The testing should be fair, viz., both the failures and the successes of the hypotheses should be noted. emotion, quite distinct from the sounds themselves” (2.643). “This emotion”, he carries on saying, “is essentially the same thing as an hypothetic inference, and every hypothetic inference involves the formation of such an emotion”. My guess, motivated by the example above, is that Peirce means to highlight the fact that Hypothesis generates beliefs with extra content or new ideas, which exceed the content of whatever beliefs were meant to explain. This might be taken to generate in the mind a feeling of comprehension that was not there before. In a later piece, he seemed to equate the emotion involved in hypothetical inference with the fact that the adopted explanation removes the surprise of the explanandum (cf. 7.197). 13 (A) “I once landed at a seaport in a Turkish province; and, as I was walking up to the house which I was to visit, I met a man upon horseback, surrounded by four horsemen holding a canopy over his head. As the governor of the province was the only personage I could think of who would be so greatly honored, I inferred that this was he”. (B) “This was an hypothesis. Fossils are found; say, remains like those of fishes, but far in the interior of the country. To explain the phenomenon, we suppose the sea once washed over this land. This is another hypothesis”. (C) Numberless documents and monuments refer to a conqueror called Napoleon Bonaparte. Though we have not seen the man, yet we cannot explain what we have seen, namely, all these documents and monuments, without supposing that he really existed. Hypothesis again. (2.625).
128
Stathis Psillos
Peirce did also suggest — somewhat in passing — that the proper ground for hypothetical inference involves comparison and elimination of alternative hypotheses: “When we adopt a certain hypothesis, it is not alone because it will explain the observed facts, but also because the contrary hypothesis would probably lead to results contrary to those observed” (2.628). But he did not say much about this. Later on, in 1901 (in The Logic of Drawing History from Ancient Documents), he did say a lot more on the proper ground for explanatory inference (by then called abduction). For instance, he insisted that the hypothesis that is adopted should be “likely in itself, and render the facts likely” (7.202). Part of the ground for stronger hypothetical inferences (abductions) comes from eliminating alternative and competing hypotheses, that if true, would account for the facts to be explained. One interesting issue concerns the very nature of a hypothesis. The very term ‘hypothesis’ alludes to claims that are put forward as conjectures or as suppositions or are only weakly supported by the evidence. Peirce was alive to this problem, but he nonetheless stuck with the term ‘hypothesis’ in order to stress two things: first, hypotheses are explanatory; and second, they admit of various degrees of strength.14 In his discussion of the case of the kinetic theory of gases (2.639 ff), Peirce makes clear that the outcome of hypothetical reasoning might vary from a “pure” hypothesis to a “theory”. What it is does not vary with respect to how it is adopted — it is always adopted on the basis of explanatory considerations. But it varies with respect to its explanatory power, which may well change over time. For instance, the kinetic theory of gases was a “pure hypothesis” when it merely explained Boyle’s law, but it became a “theory” when it unified a number of empirical laws of different kinds and received independent support by the mechanical theory of heat. The idea here is that a hypothesis gains in explanatory strength when it unifies various phenomena and when it gets itself unified with other background theories (like the principles of mechanics). “The successful theories” as Peirce put it, “are not pure guesses, but are guided by reasons” (2.638). In a move that became famous later on by Wilfrid Sellars, Peirce stressed that hypotheses gain in explanatory strength when they do not just explain an empirical law, but when they also explain “the deviations” from the law (2.638). Empirical laws are typically inexact and approximate and genuinely explanatory hypotheses replace them with stricter and more accurate theoretical models. Already in the 1870s, Peirce took it to be case that hypothetical reasoning is indispensable. Immediately after he noted that Hypothesis is a weak type of argument, he added: “But there is no difference except one of degree between such an inference and that by which we are led to believe that we remember the occurrences of yesterday from our feeling as if we did so” (2.625). Bringing memory 14 Concerning his own use of the term ‘hypothesis’, he said: “That the sense in which I have used ‘hypothesis’ is supported by good usage, I could prove by a hundred authorities. The following is from Kant: ‘An hypothesis is the holding for true of the judgment of the truth of a reason on account of the sufficiency of its consequents.’ Mill’s definition (Logic, Book III, Ch. XIV §4) also nearly coincides with mine” (2.511, note).
An Explorer upon Untrodden Ground: Peirce on Abduction
129
in as an instance of hypothetical inference suggests that if hypothetical inference cannot be relied upon, there is no way to form any kind of beliefs that exceed what we now perceive. Hence, belief with any content that exceeds what is immediately given in experience requires and relies upon hypothetical reasoning. The very issue of the trustworthiness of memory is tricky, but this should be clear. There is no way to justify the reliability of memory without presupposing that it is reliable. Even if an experiment were to be made to determine the reliability of memory, it should still be the case that the results of the experiments should themselves be correctly remembered before they could play any role in the determination of the reliability of memory. All this might well imply that memory (and hence hypothetical inference) is too basic a mode of belief formation either to be fully doubted or to be justifiable on the basis of even more basic inferential modes. Indeed, to say that hypothetical reasoning is weak (or low in security, as Peirce would later on put it) is not to say that it is unjustified or unjustifiable. Rather, it implies that its justification is a more complex affair than the justification of deduction. It also implies that there can be better or worse hypothetical inferences, and the task of the logician is to specify the conditions under which a hypothetical inference is good. Given Peirce’s syllogistic conception of inference, H and I clearly have different logical forms. Besides, Peirce has insisted that it is only Hypothesis that explains — that is, that is based on explanatory considerations. But, it may be argued, presented as above, the difference between H and I is rather superficial.15 The conclusions of both H and I are hypotheses, even though H and I have different logical forms. Besides, both types of conclusion seem to be explanatory. As Peirce himself put it on one occasion already in 1878: “(...) when we make an induction, it is drawn not only because it explains the distribution of characters in the sample, but also because a different rule would probably have led to the sample being other than it is” (2.628). This is a clear point to the effect that Induction is also based on explanatory considerations and is guided by them. More generally, that laws (law-like generalisations) are explanatory of their instances has been part of traditional view of explanation that Peirce clearly shared. Peirce took it that explanation and prediction are the two sides of the same coin. Actually, his overall conception of explanation is that it amounts to “rationalisation”, that is to rendering a phenomenon rational (rationally explicable), where this rationalisation consists in finding a reason why the phenomenon is the way it is, the reason being such that were it taken into account beforehand, the phenomenon would have been predicted with certainty or high probability. Here is how he put the matter: “(. . . ) what an explanation of a phenomenon does is to supply a proposition which, if it had been known to be true before the phenomenon presented itself, would have rendered that phenomenon predictable, if not with certainty, at least as something very likely to occur. It thus renders that phenomenon rational, — that is, makes it a logical consequence, necessary 15 If
I read him correctly, Nagel [1938, p. 385] makes this point, but from a different angle.
130
Stathis Psillos
or probable” (7.192).16 It should be obvious then that law-like generalisations do explain their instances, and in particular, that they do explain the observed correlation between two properties or factors. Peirce went as far to a argue that regularisation is a type of rationalisation (cf. 7.199), where a regularisation makes some facts less isolated than before by subsuming them under a generalisation: why are these As B? because all As are B. If laws are explanatory and if law-like statements (qua generalisations) are the products of Induction, it seems that H and I are closer to each other than Peirce thought. Hence, it might be argued, both H and I are modes of generation and acceptance of explanatory hypotheses, be they about singular facts (e.g., causes) or about generalisations (e.g., All As are B). Besides, it seems that H involves (at least in many typical cases) a law-like generalisation in its premises, since in the syllogistic guise it has been presented thus far, the claim is that what explains a certain singular fact is another singular fact and a general fact in tandem, viz., H: {a is B; All As are B; therefore a is A}. As Peirce acknowledged: “By hypothesis, we conclude the existence of a fact quite different from anything observed, from which, according to known laws, something observed would necessarily result” (2.636). It seems reasonable to claim that the chief difference between H and I is that Induction involves what we have called ‘horizontal extrapolation’, whilst Hypothesis involves (or allows for) ‘vertical extrapolation’, viz., hypotheses whose content is about unobservable causes of the phenomena. Indeed, as has been stressed already, the very rationale for Hypothesis is that it makes possible the generation of new content or new ideas. It turns out, however, that if Hypothesis is constrained by its syllogistic form, it cannot play its intended role as a creator of new content. Think of H presented as above: H: {All As are B; a is B; therefore a is A}. The conclusion of H might well be a hypothesis, but its content is not really new: it is already contained in the major premise. So the inference does not create new content; rather it unpacks content that is already present in the premises. The very syllogistic character of H leaves no choice here: premises and conclusion must share vocabulary; otherwise the conclusion cannot be inferred in the way H suggests. The inference process is such that the antecedent of the major premise is detached and is stated as the conclusion. This might be an illegitimate move in deductive inference but it captures the essence of H. In this process, no new content is created; instead some of the content of the premises is detached and is asserted. This creates a certain tension in Peirce’s account. Hypothesis is ampliative and the sole generator of new ideas or content. And yet, in the syllogistic conception 16 The similarity with the standard Deductive-Nomological model of explanation developed by Hempel [1965] is quite striking. On a different occasion, Peirce noted that an explanation is “a syllogism exhibiting the surprising fact as necessarily consequent upon the circumstances of its occurrence together with the truth of the credible conjecture, as premises” (6.469).
An Explorer upon Untrodden Ground: Peirce on Abduction
131
of hypothetic inference, the new ideas or content must already be there before they are accepted as the conclusion of the inference. If Hypothesis is the sole generator of new content and ideas, and if this is the reason why it is, in the end, indispensable despite its insecurity, it must have been a great problem for Peirce that the syllogistic conception of reasoning, and of Hypothesis in particular, obscured this fact. Perhaps this was part of the reason why Peirce abandoned the syllogistic conception of explanatory reasoning. Another part of the reason is that, as noted above, Peirce came to think that the difference between the logical forms of Induction and Hypothesis is not as fundamental as he initially thought. In 1902, he offered the following honest diagnosis of his earlier thinking about explanatory reasoning: (M)y capital error was a negative one, in not perceiving that, according to my own principles, the reasoning with which I was there dealing [‘Hypothetic Inference’] could not be the reasoning by which we are led to adopt a hypothesis, although I all but stated as much. But I was too much taken up in considering syllogistic forms and the doctrine of logical extension and comprehension, both of which I made more fundamental than they really are. As long as I held that opinion, my conceptions of Abduction necessarily confused two different kinds of reasoning. When, after repeated attempts, I finally succeeded in clearing the matter up, the fact shone out that probability proper had nothing to do with the validity of Abduction, unless in a doubly indirect manner (2.102). What then are the two different kinds of reasoning that Peirce’s earlier syllogistic approach confused? It seems clear that the confusion was between the reasoning process by means of which hypotheses (with new and extra content) are being formulated and adopted on the basis of explanatory considerations and the reasoning process by means of which these hypotheses are rendered likely. In Peirce’s later writings hypothetical inference is liberalised. It is no longer constrained by the syllogistic conception of inference. It becomes part of a broader methodological process of inquiry. Induction, on the other hand, is given the role of confirmation. In his eighth Lowell lecture in 1903, Peirce took abduction to be “any mode or degree of acceptance of a proposition as a truth, because a fact or facts have been ascertained whose occurrence would necessarily or probably result in case that proposition were true” (5.603).
5 THE ROAD TO ABDUCTION In his unfinished manuscript of 1896, Lessons from the History of Science, Peirce employed the term ‘retroduction’ (or ‘retroductive inference’) to capture hypothetical reasoning — or perhaps, the species of it where the hypothesis concerns things
132
Stathis Psillos
past. Here too, it is explanation that makes this mode of inference distinctive. As he put it: Now a retroductive conclusion is only justified by its explaining an observed fact. An explanation is a syllogism of which the major premiss, or rule, is a known law or rule of nature, or other general truth; the minor premise, or case, is the hypothesis or retroductive conclusion, and the conclusion, or result, is the observed (or otherwise established) fact (1.89). As he explained, he took ‘retroduction’ to render into English the Aristotelian term apag‘=og¯e, which as Peirce noted (and as we have already seen in section 2), was “misunderstood because of corrupt text, and as misunderstood [it was] usually translated abduction” (1.65). But he did opt for the term ‘abduction’ in the end, though he also toyed with the term ‘Presumption’. As he put it in 1903: “Presumption, or, more precisely, abduction (. . . ), furnishes the reasoner with the problematic theory which induction verifies” (2.776). In the same context, he noted that “Logical or philosophical presumption is non-deductive probable inference which involves a hypothesis. It might very advantageously replace hypothesis in the sense of something supposed to be true because of certain facts which it would account for” (2.786). It should be clear that abduction inherits some of the characteristics of Hypothesis, while it forfeits others. The two main points of contact are that a) it is explanatory considerations that guide abduction, qua an inference; and b) abduction is “the only kind of reasoning which supplies new ideas, the only kind which is, in this sense, synthetic” (2.777). But, unlike Hypothesis, abduction c) is not constrained by syllogistic formulations; d) any kind of hypothesis can be adopted on its basis, provided it plays an explanatory role. As Peirce notes: “Abduction must cover all the operations by which theories and hypotheses are engendered” (5.590). Besides, Peirce took it that his shift to abduction lay further emphasis on the fact that abduction is an insecure mode of reasoning; that the abductively adopted hypothesis is problematic; that it needs further testing. The rationale for abduction, then, is that if rational explanation is possible at all, it can only achieved by abduction. As he put it: “Its only justification is that its method is the only way in which there can be any hope of attaining a rational explanation” (2.777). Peirce’s classic characterisation of abduction qua inference, in his seventh lecture on Pragmatism titled Pragmatism and Abduction in 1903, is this (cf. 5.189): (CC) The surprising fact, C, is observed; But if A were true, C would be a matter of course, Hence, there is reason to suspect that A is true. Immediately before (CC) he noted that abduction is “the operation of adopting an explanatory hypothesis” and that “the hypothesis cannot be admitted, even as
An Explorer upon Untrodden Ground: Peirce on Abduction
133
a hypothesis, unless it be supposed that it would account for the facts or some of them”. The emphasis on C being a matter of course relates to Peirce’s conception of explanation as rational expectability. What follows the classic characterisation is even more interesting. Peirce says: “Thus, A cannot be abductively inferred, or if you prefer the expression, cannot be abductively conjectured until its entire content is already present in the premise, ‘If A were true, C would be a matter of course”’. This claim captures the way in which abduction (in contradistinction to the earlier Hypothesis) can be genuinely ampliative and generative of new ideas and content. What Peirce implies, and what seems right anyway, is that the abductive inference generates both the major premise ‘If A were true, C would be a matter of course’ and licenses the conclusion that there is reason to accept A as true. Though A may be familiar in itself, it does not follow that it is the case that ‘If A were true, C would be a matter of course’. Qua a hypothesis A has excess and new content vis-` a-vis C precisely because it offers a reason (explains) why C holds. The following then sounds plausible. Abduction is a dual process of reasoning. It involves the generation of some hypothesis A with excess content in virtue of which the explanandum C is accounted for, where the explanatory connection between A and C is captured by the counterfactual conditional ‘If A were true, C would be a matter of course’. But it also allows the detachment of the antecedent A from the conditional and hence its acceptance in its own right. The detachment of the antecedent A requires reasons and these are offered by the explanatory connection there is between the antecedent and the consequent. Peirce has had a rather broad conception of a surprising fact. He took it that the very presence of regularities in nature is quite surprising in that irregularity (the absence of regularity) is much more common than regularity in nature. Hence the presence of regular patterns under which sequences of events fall are, for Peirce, unexpected and requires explanation (cf. 7.189; 7.195). This suggests that Peirce would not take the regularities there are in nature as crude facts — which admit or require no further explanation. Regularities hold for a reason and their explanation amounts to finding the reason for which they hold, thereby rendering the phenomena rational (cf. 7.192). But aren’t also deviations from a regularity surprising? Peirce insists that if an existing regularity is breached, this does require explanation (cf. 7.191). So both the regularity and the deviations from it require explanation, though the explanations offered are at different levels. A key feature of explanation according to Peirce is that it renders the explananda less “isolated” than they would have been in the absence of an explanation (cf. 7.199). This feature follows from the fact that explanation amounts to rational expectability. For, a fact is isolated if it does not fall under a pattern. And if it does not fall under a pattern, in its presence we “do not know what to expect” (7.201). Actually, abduction is justified, Peirce claimed, because it is the “only possible hope of regulating our future conduct rationally” (2.270), and clearly this rational regulation comes from devising explanations which render the facts less isolated.
134
Stathis Psillos
A stable element of Peirce’s thought on explanatory reasoning is that it is a reasoning process — it obeys a rule of a sort. In The Logic of Drawing History from Ancient Documents (1901), he insisted that abduction amounts to an adoption of a hypothesis “which is likely in itself, and renders the facts likely”. He noted: “I reckon it as a form of inference, however problematical the hypothesis may be held” (7.202). And he queried about the “logical rules” that abduction should conform to. But why should he think that abduction is a reasoning process? Recall that for him reasoning is a conscious activity by means of which a conclusion is drawn based on reasons. In his Short Logic (1893), he emphasised that reasoning (the making of inferences) amounts to the “conscious and controlled adoption of a belief as a consequence of other knowledge” (2.442). Reasoning is a voluntary activity, which among other things, involves considering and eliminating options (cf. 7.187). This is what abduction is and does. The point is brought home if we consider abduction as an eliminative inference. Not all possible explanatory hypotheses are considered. In answering the objection that abduction is not reasoning proper because one is free to examine whatever theories one likes, Peirce noted: The answer [to the question of what need of reasoning was there?] is that it is a question of economy. If he examines all the foolish theories he might imagine, he never will (short of a miracle) light upon the true one. Indeed, even with the most rational procedure, he never would do so, were there not an affinity between his ideas and nature’s ways. However, if there be any attainable truth, as he hopes, it is plain that the only way in which it is to be attained is by trying the hypotheses which seem reasonable and which lead to such consequences as are observed (2.776). What exactly is this criterion of reasonableness? Peirce took it that abduction is not a topic-neutral inferential pattern. It operates within a framework of background beliefs, depends on them and capitalises on them. It is these beliefs that determine reasonableness or plausibility. Here is Peirce, in 1901: Of course, if we know any positive facts which render a given hypothesis objectively probable, they recommend it for inductive testing. When this is not the case, but the hypothesis seems to us likely, or unlikely, this likelihood is an indication that the hypothesis accords or discords with our preconceived ideas; and since those ideas are presumably based upon some experience, it follows that, other things being equal, there will be, in the long run, some economy in giving the hypothesis a place in the order of precedence in accordance with this indication (7.220). Background beliefs play a dual role in abduction. Their first role is to eliminate a number of candidates as “foolish”. What hypotheses will count as foolish will surely depend on how strong and well supported the background beliefs are. But
An Explorer upon Untrodden Ground: Peirce on Abduction
135
Peirce also insisted that some hypotheses must be discarded from further consideration ab initio. They are the hypotheses that, by their very nature, are untestable (cf. 6.524). This call for testability is a hallmark of Peirce’s pragmatism. In his Lectures on Pragmatism, he famously noted that “the question of pragmatism” is “nothing else than the question of the logic of abduction” (5.196). As he went on to explain, the link between the two is testability. The Maxim of Pragmatism is that admissible hypotheses must be such that their truth makes a difference in experience. In the present context, Peirce put the Maxim thus: (. . . ) [T]he maxim of pragmatism is that a conception can have no logical effect or import differing from that of a second conception except so far as, taken in connection with other conceptions and intentions, it might conceivably modify our practical conduct differently from that second conception (5.196). In other words, there can be no logical difference between two hypotheses that results in no difference in experience. Abduction, according to Peirce, honours this maxim because it “puts a limit upon admissible hypotheses” (5.196). And the limit is set by the logical form of abduction CC, which has already been noted. The major premise of an abductive inference is: ‘If A were true, C would be a matter of course’. For A to be admissible at all it must be the case that it renders C explicable and expectable. We have already seen that Peirce equated explanation with rational expectability and this clearly implies that he took it that explanation yields predictions (or even that explanation and prediction are the two sides of the same coin). Predictions are, ultimately, what differentiates between admissible (qua testable) hypotheses and inadmissible (qua untestable) ones. Here is Peirce’s own way to put the point: Admitting, then, that the question of Pragmatism is the question of Abduction, let us consider it under that form. What is good abduction? What should an explanatory hypothesis be to be worthy to rank as a hypothesis? Of course, it must explain the facts. But what other conditions ought it to fulfil to be good? The question of the goodness of anything is whether that thing fulfils its end. What, then, is the end of an explanatory hypothesis? Its end is, through subjection to the test of experiment, to lead to the avoidance of all surprise and to the establishment of a habit of positive expectation that shall not be disappointed. Any hypothesis, therefore, may be admissible, in the absence of any special reasons to the contrary, provided it be capable of experimental verification, and only insofar as it is capable of such verification. This is approximately the doctrine of pragmatism (5.197). The second role that background beliefs play in abduction concerns the ranking of the admissible candidates in an “order of preference”; or the selection of hypotheses. Accordingly, the search for explanatory hypotheses is not blind but guided by reasons. The search aims to create, as Peirce (5.592) nicely put it,
136
Stathis Psillos
“good” hypotheses. So the search will be accompanied by an evaluation of hypotheses, and by their placement in an order of preference according to how good an explanation they offer. In Hume on Miracles (1901) Peirce put this point as follows: The first starting of a hypothesis and the entertaining of it, whether as a simple interrogation or with any degree of confidence, is an inferential step which I propose to call abduction. This will include a preference for any one hypothesis over others which would equally explain the facts, so long as this preference is not based upon any previous knowledge bearing upon the truth of the hypotheses, nor on any testing of any of the hypotheses, after having admitted them on probation (cf. 6.525). It is clear from this passage that the preferential ranking of competing hypotheses that would explain the facts, were they true, cannot be based on judgements concerning their truth, since if we already knew which hypothesis is the true one, it would be an almost trivial matter to infer this as against its rivals. So the ranking should be based on different criteria. What are they? The closest Peirce comes to offering a systematic treatment of this subject is in his Abduction, which is part of The Logic of Drawing History from Ancient Documents (1901). The principles which should “guide us in abduction or the process of choosing a hypothesis” include: A. Hypotheses should explain all relevant facts.17 B. Hypotheses should be licensed by the existing background beliefs; C. Hypotheses should be, as far as possible, simple (“incomplex” (7.220-1)); D. Hypotheses should have unifying power (“breadth” (7.220-1));18 E. Hypotheses should be further testable, and preferably entail novel predictions (7.220).19
6 FROM THE INSTINCTIVE TO THE REASONED MARKS OF TRUTH The picture of abduction that Peirce has painted is quite complex. On the face of it, there may be a question of its coherence. Abduction is an inference by 17 “Still, before admitting the hypothesis to probation, we must ask whether it would explain all the principal facts” (7.235). 18 Peirce characterised unifying power thus: “The purpose of a theory may be said to be to embrace the manifold of observed facts in one statement, and other things being equal that theory best fulfils its function which brings the most facts under a single formula” (7.410). 19 Peirce stressed that “the strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis” (7.115).
An Explorer upon Untrodden Ground: Peirce on Abduction
137
means of which explanatory hypotheses are admitted, but it is not clear what this admission amounts to. Nor is it clear whether there are rules that this mode of inference is subject to. In a rather astonishing passage that preceded Peirce’s classic characterisation of abduction noted above, Peirce stressed: “It must be remembered that abduction, although it is very little hampered by logical rules, nevertheless is logical inference, asserting its conclusion only problematically or conjecturally, it is true, but nevertheless having a definite logical form” (5.188). How can it be that abduction has a definite logical form (the one suggested by CC above) and yet not be hampered by logical rules? Besides, Peirce made the seemingly strange point that “(. . . ) abduction commits us to nothing. It merely causes a hypothesis to be set down upon our docket of cases to be tried” (5.602). To resolve the possible tensions here we need to take into account Peirce’s overall approach to ampliative reasoning. Peirce was adamant that the conclusion of an abductive inference can be accepted only on “probation” or “proviosionally”. Here is one of the very many ways in which Peirce expressed this key thought of his: “Abduction, in the sense I give the word, is any reasoning of a large class of which the provisional adoption of an explanatory hypothesis is the type” (2.544, note). One important reason why explanatory hypotheses can only be accepted on probation comes from the history of science itself, which is a history of actual abductions. Though it is reasonable to accept a hypothesis as true on the basis that “it seems to render the world reasonable”, a closer look at the fate of explanatory hypotheses suggests that they were subsequently controverted because of wrong predictions. “Ultimately” Peirce said, the circumstance that a hypothesis, although it may lead us to expect some facts to be as they are, may in the future lead us to erroneous expectations about other facts, — this circumstance, which anybody must have admitted as soon as it was brought home to him, was brought home to scientific men so forcibly, first in astronomy, and then in other sciences, that it became axiomatical that a hypothesis adopted by abduction could only be adopted on probation, and must be tested (7.202). Peirce did consider the claim that abduction might admit of a strict logical form based on Bayes’s theorem. Well, he did not put it quite that way, but this is what he clearly meant when he said that according to a common theory, reasoning should be “guided by balancing probabilities, according to the doctrine of inverse probability” (2.777). The idea here is that one updates one’s degree of belief in a proposition by using Bayes’s theorem: Probnew (H)=Probold (H/e), where Prob(H/e)=Prob(e/H) × Prob(H)/Prob(e). “Inverse probabilities” are what later on became known as likelihoods, viz., Prob(e/H). As Peirce immediately added, this approach to reasoning relies “upon knowing antecedent probabilities”, that is prior probabilities. But he was entirely clear that this Bayesian approach could not capture the logical form of abduction because he thought that prior probabilities in the case of hypotheses were not available. Peirce was totally
138
Stathis Psillos
unwilling to admit subjective prior probabilities — if there were well-defined prior probabilities applied to hypotheses, they would have to be “solid” statistical probabilities “like those upon which the insurance business rests”. But when it comes to hypotheses, the hope for solid statistical facts is futile:
An Explorer upon Untrodden Ground: Peirce on Abduction
139
But they are not and cannot, in the nature of things, be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? An objective probability is the ratio of frequency of a specific to a generic event in the ordinary course of experience. Of a fact per se it is absurd to speak of objective probability. All that is attainable are subjective probabilities, or likelihoods, which express nothing but the conformity of a new suggestion to our prepossessions; and these are the source of most of the errors into which man falls, and of all the worst of them (2.777). It might be objected here that Peirce’s last point is unfair to probabilists, since he himself has tied the adoption of explanatory hypotheses to background beliefs, which could well be the sources of most of the errors into which man falls. Fair enough. Peirce would not think this is an objection to his views, precisely because the dependence of abduction on background beliefs is the reason he thought that it could not, in the first instance and on its own, yield probable results and that its conclusion should be accepted on probation, subject to further testing. So his point is that those who think that by relying on subjective prior probabilities can have a conception of inference which yields likely hypotheses delude themselves. What all this means is that abductive inference per se is not the kind of inference that can or does lead to likely conclusions. It’s not as if we feed a topic-neutral and algorithmic rule with suitable premises and it returns likely conclusions. Abduction is not like that at all. Peirce insisted that when it comes to abduction “yielding to judgments of likelihood is a fertile source of waste of time and energy” (6.534). So abduction is not in the business of conferring probabilities on its conclusions. But this is not to imply abduction is neither an inference nor a means, qua inference, to yield reasonably held conclusions. In his fifth lecture on Pragmatism (1903), Peirce drew a distinction between validity and strength (which is different from the one between uberty and security noted in section 3). An argument is valid, Peirce suggested “if it possesses the sort of strength that it professes and tends toward the establishment of the conclusion in the way in which it pretends to do this” (5.192). This might sound opaque, but the underlying idea is that different sort of inferences aim at different things and hence cannot be lamped together. Deduction aims at truth-preservation or truth-maintenance: if the premises are true, the conclusion has to be true. In deduction, validity and strength coincide because the conclusion of a deductive argument is at least as secure as its premises. But this is peculiar to deductive inference. Other inferential patterns may be such that validity and strength do not coincide. An inference may be weak and yet valid. It may be weak in that the conclusion of the inference might not be strongly supported by the premises, and yet it may be valid in Peirce’s sense above: the inference does not pretend to license stronger conclusions than it actually does. As he put it: “An argument is none the less logical for being weak, provided it does not pretend to a strength that it does not possess” (5.192). Abduction is a weak inference, but it can be reasonable
140
Stathis Psillos
nonetheless (or “valid”, as Peirce would put it). Unlike deduction, abduction does not advertise itself as truth-preserving. Its aim is the generation of extra content and the provision of reasons for its adoption (based on explanatory considerations). Here is Peirce’s own way to out the point: “The conclusion of an abduction is problematic or conjectural, but is not necessarily at the weakest grade of surmise, and what we call assertoric judgments are, accurately, problematic judgments of a high grade of hopefulness” (5.192). Peirce had actually examined this issue in his earlier Notes on Ampliative Reasoning (1901-2). There, after noting that “an argument may be perfectly valid and yet excessively weak” (2.780), he went on to suggest that the strength of abduction is a function of its eliminative power.20 The strength of an abductively inferred hypothesis depends on “the absence of any other hypothesis”. But this would suggest that abduction is very weak, since how can it possibly be asserted that all other potentially explanatory hypotheses have been eliminated? To avoid rendering abduction excessively weak, Peirce suggested that strength might be measured in terms of “the amount of wealth, in time, thought, money, etc., that we ought to have at our disposal before it would be worth while to take up that hypothesis for examination”. This introduces a new factor into reasoning — over and above the requirements of explanation and testability noted above. We can call this factor ‘economy’ echoing Peirce’s own characterisation of it. Peirce tied economy to a number of features of abductive reasoning. In his eighth Lowell lecture in 1903, he stressed that “the leading consideration in Abduction” is “the question of Economy–Economy of money, time, thought, and energy” (5.600). Economy is related to the range of potential explanations that may be entertained and be subjected to further testing (cf. 6.528). It is related to the eliminative nature of abduction. Economy dictates that when there is need for choice between competing hypotheses which explain a set of phenomena, some crucial experiment should be devised which eliminates many or most of the competitors.21 Economy is also related to the use of Ockham’s razor (cf. 6.535). The demand for, and the preference of, simple explanation is “a sound economic principle” because simpler explanations are more easily tested. The general point here is that abduction — qua reasoning — is subjected to criteria that do not admit a precise logical formulation. But these criteria are necessary for the characterisation of abduction nonetheless if abduction is to be humanly possible. In contradistinction to Descartes, Peirce was surely not inter20 In the case of Induction, Peirce noted that the larger the number of instances that form the inductive basis, the stronger the induction. But, he added, weak inductions (based on small numbers of instances) are perfectly valid (cf. 2.780). 21 Here is how Peirce put it: “Let us suppose that there are thirty-two different possible ways of explaining a set of phenomena. Then, thirty-one hypotheses must be rejected. The most economical procedure, when it is practicable, will be to find some observable fact which, under conditions easily brought about, would result from sixteen of the hypotheses and not from any of the other sixteen. Such an experiment, if it can be devised, at once halves the number of hypotheses” (6.529).
An Explorer upon Untrodden Ground: Peirce on Abduction
141
ested in the project of pure enquiry. Inference, in particular ampliative inference, does not operate in a vacuum; nor is it subjected to no constraints other than the search for the truth. Nor does it occur in an environment of unlimited resources of time and energy. Principles of economy govern abductive inference precisely because abduction has to work its way through a space of hypotheses that is virtually inexhaustible. So either no abductive inference would be possible or there should be criteria that cut down the space of hypotheses to a reasonable size (cf. 2.776). It might be thought that these considerations of economy render abduction totally whimsical. For, one may wonder, what possibly could be the relation between abduction and truth? Note, however, that this worry would be overstated. Principles of economy are principles which facilitate the further testing of the selected hypotheses. Hence, they can facilitate finding out whether a hypothesis is true in the only sense in which Peirce can accept this, viz., in the sense of making a hypothesis doubt-resistant. But there is a residual worry that is more serious. If abduction does not operate within a network of background of true beliefs, there is no way in which it can return hypotheses which have a good chance of being true. How can these true background beliefs emerge? In at least two different places, Peirce argues that the human mind has had the power to imagine correct theories, where this power is a “natural adaptation” (5.591). On one of these two occasions, he clearly associated this power of the human mind (“the guessing instinct”) with the principles of economy in abductive reasoning. These principles work because the mind has the power to hit upon the truth in a relatively small number of trials. Here is how he put it: In very many questions, the situation before us is this: We shall do better to abandon the whole attempt to learn the truth, however urgent may be our need of ascertaining it, unless we can trust to the human mind’s having such a power of guessing right that before very many hypotheses shall have been tried, intelligent guessing may be expected to lead us to the one which will support all tests, leaving the vast majority of possible hypotheses unexamined (6.530). Peirce does not prove this claim, how could he? He does say in its support that truth has survival value (cf. 5.591). But it is not clear that this is anything other than speculation. A more likely ground for Peirce’s claim is quasi-transcendental, viz., that unless we accept that the human mind has had this power to guess right, there can be no rational explanation of why it has come up with some true theories in the first place. Peirce tries to substantiate this claim by means of a further argument. True theories cannot be a matter of chance because given all possible theories that could have been entertained, stumbling over a true one is extremely unlikely. The possible theories, Peirce said, “if not strictly innumerable, at any rate exceed a trillion – or the third power of a million; and therefore the chances are too overwhelmingly against the single true theory in the twenty or thirty thousand years during which man has been a thinking animal, ever having
142
Stathis Psillos
come into any man’s head” (5.591). Note that this kind of argument is based on a statistical claim, which might be contentious: how can we come up with such statistics in the first place? Be that as it may, Peirce’s key point here is that though abduction does not wear its justification on its sleeve, it is reasonable to think that abduction does tend to operate within networks of true background beliefs. It is fair to say that though abduction cannot have a foundational role, its products cannot be doubted en masse, either. Its justification, qua mode of inference, comes from the need for rational explanation and in particular from the commitment to the view that rational explanation is possible; that the facts “admit of rationalization, and of rationalization by us” (7.219). Interestingly, Peirce claims that this commitment embodies another hypothesis, and as such it is the product of “a fundamental and primary abduction”. As Peirce put it: “it is a primary hypothesis underlying all abduction that the human mind is akin to the truth in the sense that in a finite number of guesses it will light upon the correct hypothesis”. This creates an air of circularity, of course. In essence, a grand abduction is employed to justify the possibility of abductive inference. Peirce does not address this problem directly. For him it seems that this circularity is the inevitable price that needs to be paid if human understanding is at all possible. Explanation aims at (and offers) understanding, but unless it is assumed that the human mind has a capacity or power to reach the truth in a finite number of trials, hitting the right explanations would be a random walk. It is no surprise, then, that Peirce brings in instinct once more. He draws a distinction between two kinds of considerations “which tend toward an expectation that a given hypothesis may be true”: the purely instinctive and the reasoned ones (7.220). The instinctive considerations kick in when it comes to the primary hypothesis that the human mind has a power to hit upon the truth. This is not reasoned, though it is supported by an induction on the past record of abductive inferences. As Peirce put it, “it has seldom been necessary to try more than two or three hypotheses made by clear genius before the right one was found”. The reasoned considerations kick in when a body of background beliefs has emerged which has some measure of truth in it. Then, the choice among competing hypotheses is guided by criteria noted above, e.g., breadth and incomplexity. For Peirce, however, it would be folly to try to hide the claim that “the existence of a natural instinct for truth is, after all, the sheet-anchor of science. From the instinctive, we pass to reasoned, marks of truth in the hypothesis” (7.220).22 In one of the first systematic treatments of Peirce’s views of abduction, Harry Frankfurt (1958, 594) raised what might be called Peirce’s paradox. This is that Peirce appears to want to have it both ways, viz., “that hypotheses are the products of a wonderful imaginative faculty in man and that they are products of a certain sort of logical inference”. It should be clear by now that this paradox is only 22 The role of instinct in abduction is raised and discussed, in more or less the same way, in Peirce’s fifth lecture on Pragmatism, in 1903 (5.171-5.74). There, Peirce expresses his view that the instinct of guessing right is accounted for by evolution.
An Explorer upon Untrodden Ground: Peirce on Abduction
143
apparent. Abduction involves a guessing instinct and is a reasoned process, but for Peirce these two elements operate at different levels. The guessing instinct is required for the very possibility of a trustworthy abductive inference. The reasoned process operates within an environment of background beliefs and aims to select among competing hypotheses on the basis of explanatory considerations. 7
THE THREE STAGES OF INQUIRY
In Peirce’s mature thought, we have already seen, abduction covers a cluster of operations that generate and evaluate explanatory hypotheses. Peirce was adamant, it was noted, that abduction is not the kind of inference that returns likely hypotheses. It’s not in the business of producing judgements of likelihood. This is not to say, we have stressed, that abduction is not trustworthy. Rather, its trustworthiness is a function of the background beliefs within which it operates. But is it not the case that, in the end of the day, we want theories or hypotheses that are likely to be true? Peirce never doubted this. In 1901 he summed this up by saying: “A hypothesis then has to be adopted which is likely in itself and renders the facts likely. This process of adopting a hypothesis as being suggested by the facts is what I call abduction” (7.202). But how can abduction lead to likely hypotheses if it is not meant to do so? In his mature writings Peirce treated abduction as the first part of a threestage methodological process, the other two stages being deduction and induction. The burden of likelihood is carried not by abduction in and of itself but by the other two methods which complement abduction in the overall method of inquiry. Abduction might confer plausibility or reasonableness on its conclusion, but their probability is determined by their further testing. Here is a long but nice summary by Peirce himself, offered in 1908: The whole series of mental performances between the notice of the wonderful phenomenon and the acceptance of the hypothesis, during which the usually docile understanding seems to hold the bit between its teeth and to have us at its mercy, the search for pertinent circumstances and the laying hold of them, sometimes without our cognizance, the scrutiny of them, the dark laboring, the bursting out of the startling conjecture, the remarking of its smooth fitting to the anomaly, as it is turned back and forth like a key in a lock, and the final estimation of its Plausibility, I reckon as composing the First Stage of Inquiry. Its characteristic formula of reasoning I term Retroduction, i.e. reasoning from consequent to antecedent (6.469). Retroduction does not afford security. The hypothesis must be tested. This testing, to be logically valid, must honestly start, not as Retroduction starts, with scrutiny of the phenomena, but with examination of the hypothesis, and a muster of all sorts of conditional experiential consequences which would follow from its truth. This constitutes the
144
Stathis Psillos
Second Stage of Inquiry. For its characteristic form of reasoning our language has, for two centuries, been happily provided with the name Deduction (6.470). The purpose of Deduction, that of collecting consequents of the hypothesis, having been sufficiently carried out, the inquiry enters upon its Third Stage, that of ascertaining how far those consequents accord with Experience, and of judging accordingly whether the hypothesis is sensibly correct, or requires some inessential modification, or must be entirely rejected. Its characteristic way of reasoning is Induction (6.472). Abduction is the sole method by means of which new ideas are introduced. It is the only method by means of which the phenomena are ‘rationalised’ by being explained. But to get from an abductively inferred hypothesis to a judgement of probability, this hypothesis should be subjected to further testing. According to Peirce: The validity of a presumptive adoption of a hypothesis for examination consists in this, that the hypothesis being such that its consequences are capable of being tested by experimentation, and being such that the observed facts would follow from it as necessary conclusions, that hypothesis is selected according to a method which must ultimately lead to the discovery of the truth, so far as the truth is capable of being discovered, with an indefinite approximation to accuracy (2.781). Taken on its own, abduction is the method of generation and ranking of hypotheses which potentially explain a certain explanandum. Peirce says: “The first starting and the entertaining of [a hypothesis], whether as a simple interrogation or with any degree of confidence, is an inferential step which I propose to call abduction” (6.525). But these hypotheses should be subjected to further testing which will determine, ultimately, their degree of confirmation. Accordingly, Peirce suggests that abduction should be embedded in a broader framework of inquiry so that the hypotheses generated and evaluated by abduction can be further tested. The result of this testing is the confirmation or disconfirmation of the hypotheses. So, Peirce sees abduction as the first stage of the reasoners’ attempt to add reasonable beliefs into their belief-corpus in the light of new phenomena or observations. The process of generation and first evaluation of hypotheses (abduction) is followed by deduction — i.e., by deriving further predictions from the abduced hypotheses — and then by induction which now Peirce understands as the process of testing these predictions and hence the process of confirming the abduced hypothesis (cf. 7.202ff). “As soon as a hypothesis has been adopted”, Peirce (7.203) says, the next step “will be to trace out its necessary and probable experiential consequences. This step is deduction”. And he adds:
An Explorer upon Untrodden Ground: Peirce on Abduction
145
Having, then, by means of deduction, drawn from a hypothesis predictions as to what the results of experiment will be, we proceed to test the hypothesis by making the experiments and comparing those predictions with the actual results of the experiment. (. . . ) When, (. . . ), we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment, whether without modification or with a merely quantitative modification, we begin to accord to the hypothesis a standing among scientific results (7.206). Induction, then, is given an overall different role than the one it had in his earlier thinking. It now captures the methods by means of which hypotheses are confirmed. Hence, in the transition from his earlier views to his later ones, what really changed is not Peirce’s conception of explanatory reasoning, but rather his views on induction. Induction changed status: from a distinct mode of ampliative reasoning with a definite syllogistic form which leads to the acceptance of a generalisation as opposed to a fact (early phase) to the general process of testing a hypothesis. As Peirce out it: “This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction” (7.206). Induction is a process “for testing hypotheses already in hand. The induction adds nothing” (7.217). Induction is no less indispensable than abduction in the overall process of inquiry — but its role is clearly different from the role of abduction. Peirce put this point in a picturesque way when he said that our knowledge of nature consists in building a “cantilever bridge of inductions” over the “chasm that yawns between the ultimate goal of science and such ideas of Man’s environment”, but that “every plank of [this bridge] is first laid by Retroduction alone” (6.475). Peirce kept his view that abduction and induction are distinct modes of reasoning. In The Logic of Drawing History from Ancient Documents (1901), he noted that abduction and induction are “the opposite poles of reason, the one the most ineffective, the other the most effective of arguments” (7.218). Abduction is “the first step of scientific reasoning, as induction is the concluding step”. Abduction is “merely preparatory”. Abduction makes its start from the facts, without, at the outset, having any particular theory in view, though it is motived by the feeling that a theory is needed to explain the surprising facts. Induction makes its start from a hypothesis which seems to recommend itself, without at the outset having any particular facts in view, though it feels the need of facts to support the theory. Abduction seeks a theory. Induction seeks for facts. In abduction the consideration of the facts suggests the hypothesis. In induction the study of the hypothesis suggests the experiments which bring to light the very facts to which the hypothesis had pointed. Nonetheless, abduction and induction have a common feature: “that both lead to
146
Stathis Psillos
the acceptance of a hypothesis because observed facts are such as would necessarily or probably result as consequences of that hypothesis”. Hence, Peirce has moved a long way from his earlier view on induction. Abduction covers all kinds of explanatory reasoning (including explanation by subsumption under a generalisation), while induction is confirmation. What is important to note is that Peirce took it that induction is justified in a way radically distinct from the way abduction is justified. He thought that induction is, essentially, a self-corrective method23 , viz., that “although the conclusion [of induction] at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error” (5.145). Being a frequentist about probabilities, Peirce clearly thought that a consistent application of the straight rule of induction will converge in the limit to the true relative frequency of a certain factor A in a class of events B. In one of the most interesting studies of Peirce’s abduction, Douglas R. Anderson (1986, 162) noted that Peircean abduction “is a possibilistic inference whose test is in futuro”. This claim goes a long way in capturing the essence of Peircean abduction. Peirce employed the Aristotelian idea of “esse in futuro” to capture a mode of being which is potential, and not actual. For him, potentialities as well as laws of nature have their esse in futuro. Abduction, it might be claimed, has its justification in futuro — or, better put, it has its full justification in futuro. This means that although a hypothesis might be reasonably accepted as plausible based on explanatory considerations (abduction), the degree of confidence in this hypothesis is not thereby settled. Rather it is tied to the degree of confirmation of this hypothesis, where the latter depends, ultimately, on the future performance of the hypothesis, viz., on how well-confirmed it becomes by further evidence. This conception of justification in futuro tallies well with Peirce’a account of knowledge and truth. The aim of inquiry is to get doubt-resistant beliefs. As noted already, truth itself boils down to doubt-resistant belief. In What Pragmatism Is, in 1905, Peirce said: You only puzzle yourself by talking of this metaphysical ‘truth’ and metaphysical ‘falsity,’ that you know nothing about. All you have any dealings with are your doubts and beliefs, with the course of life that forces new beliefs upon you and gives you power to doubt old beliefs. If your terms ‘truth’ and ‘falsity’ are taken in such senses as to be definable in terms of doubt and belief and the course of experience (as for example they would be, if you were to define the ‘truth’ as that to a belief in which belief would tend if it were to tend indefinitely toward absolute fixity), well and good: in that case, you are only talking about doubt and belief. But if by truth and falsity you mean something not definable in terms of doubt and belief in any way, then you are talking of entities of whose existence you can know nothing, and which Ockham’s razor would clean shave off. Your problems would 23 “That
Induction tends to correct itself, is obvious enough” (5.776).
An Explorer upon Untrodden Ground: Peirce on Abduction
147
be greatly simplified, if, instead of saying that you want to know the ‘Truth,’ you were simply to say that you want to attain a state of belief unassailable by doubt (5.416). All beliefs, then, which are not certain should be subjected to further testing — it is only this further testing (or, at least, the openness to further testing) that can render beliefs permanently settled and hence doubt-resistant. The justification of all fallible beliefs is in futuro. Abduction generates and recommends beliefs; but the process of their becoming doubt-resistant is external to abduction — there is where induction rules.
8
LOOKING AHEAD
Peirce had the intellectual courage to explore uncharted territories, but this exploration did not leave behind a full and comprehensive map. Despite his expressed wish to write short book on “the real nature” of explanatory reasoning, he left behind papers, notes and unfinished manuscripts and, with them, a big challenge to his followers to reconstruct his thinking and put together a coherent and comprehensive theory of ampliative reasoning. A few decades passed after Peirce’s death in 1913 before philosophers started to appreciate the depth, richness and complexity of Peirce’s views of abduction. It was not until the publication of the first two volumes of his collected papers in 1931-2, that philosophers started to pay a more systematic attention to Peirce’s philosophy, and to his writings on abduction, in particular. In a paper published a few years after Peirce’s death, Professor Josiah Royce (who bequeathed Peirce’s manuscripts to the Harvard philosophy department and the Harvard library) brought to attention Peirce’s Lectures on Pragmatism as well as his Lowell Lectures on Logic in 1903-4 and made the following characteristic comment [Royce, 1916, p. 708]: It was these latter [the Lowell Lectures] which James described as ‘flashes of brilliant light relieved against Cimmerian darkness — ‘darkness’ indeed to James as to many others must have seemed those portions on ‘Existential Graphs’ or ‘Abduction’. William James’s reported view of Peirce’s writings on abduction was far from atypical. There is virtually no attempt for a reconstruction or exegesis of Peirce’s views of abduction before Arthur Burks’s (1946). In his long an instructive review of the first two volumes of Peirce’s Collected Papers, Ernest Nagel (1933, 382) devoted only a few lines on abduction noting that “Presumptive reasoning, (. . . ) (also called abduction, retroduction, hypothesis), consists in inferring an explanation, cause, or hypothesis from some fact which can be taken as a consequence of the hypothesis”. And Hans Reichenbach made the following passing note in his [1938, p. 36]:
148
Stathis Psillos
I admire Charles Peirce as one of the few men who saw the relations between induction and probability at an early time; but just his remarks concerning what he calls ‘abduction’ suffer from an unfortunate obscurity which I must ascribe to his confounding the psychology of scientific discovery with the logical situation of theories in relation to observed facts. When Peirce’s views were studied more carefully, there were two broad ways in which they were developed. The first focused on the issue of justification and reliability of ampliative reasoning; the second focused on the process of discovery of explanatory theories. Gilbert Harman’s [1965] paper on Inference to the Best Explanation (IBE) argued that the best way to conceive of abduction qua an inferential method was to see it as the method of inferring to the truth of the best among a number of competing rival explanations of a set of phenomena. On Harman’s view, abduction is the mode of inference in which a hypothesis H is accepted on the basis that a) it explains the evidence and b) no other hypothesis explains the evidence as well as H does. In a sense, IBE ends up being a liberalised version of Peircean abduction; it is defended as the mode of ampliative reasoning that can encompass hypotheticodeductivism and enumerative induction as special cases.24 One important issue in this way of thinking about abduction concerns its justification: why should it be taken to be the case that IBE is truth-conducive? Here the issue of the justification of IBE has been tied to the prospects of the defence of scientific realism in the philosophy of science.25 Another important issue concerns the virtues of hypotheses that make up goodness of explanation, or measure explanatory power. The identification of these virtues has not gone much further than what Peirce suggested (see, for instance, [Thagard, 1978]). But the justification of the truthconductive character of these virtues has become a subject of intense debate (see [McMullin, 1992]). A third issue concerns the relationship between abduction, qua IBE, and the Bayesian theory of confirmation and belief updating (see [Lipton, 2004]). It was Norman Russell Hanson in the 1950s who suggested that Peirce’s abduction should be best seen as a logic of discovery. The then dominant tradition was shaped by Reichenbach’s distinction between the context of discovery and the context of justification and the key thought (shared by Karl Popper and others as well) was that discovery was not subject to rules — it obeyed no logic; it was subject only to a psychological study (see Reichenbach’s comment on Peirce above). Hanson suggested that discovery falls under rational patterns and argued that this was Peirce’s key idea behind abduction. He took it that a logic of discovery is shaped by the following type of structure: it proceeds retroductively, from an anomaly to the delineation of a kind of explanation H which fits into an organised pattern of concepts [1965, p. 50]. 24 For 25 For
more on this, see Psillos [2002]. more on this, see Psillos [1999, chapter 4].
An Explorer upon Untrodden Ground: Peirce on Abduction
149
In the 1980s, the study of abduction found a new home in Artificial Intelligence. The study of reasoning, among other things, by computer scientists unveiled a variety of modes of reasoning which tend to capture the defeasible, non-monotonic and uncertain character of human reasoning. The study of abduction became of prominent aspect of this new focus on reasoning. In this respect, pioneering among the researchers in AI has been Bob Kowalski. Together with his collaborators, Kowalski attempted to offer a systematic treatment of both the syntax and the semantic of abduction within the framework of Logic Programming. The aim of an abductive problem is to assimilate a new datum O into a knowledge-base (KB). So, KB is suitably extended by a certain hypothesis H into KB’ such that KB’ incorporates the datum O. Abduction is the process through which a hypothesis H is chosen (see [Kakas et al., 1992; 1997]). Others, notably Bylander and his collaborators (1991), have aimed to offer computational models of abduction which capture its evaluative element.26 Abduction has been used in a host of areas such as fault diagnosis (where abduction is used for the derivation of a set of faults that are likely to cause a certain problem); belief revision (where abduction is used in the incorporation of new information in a belief corpus); as well as scientific discovery, legal reasoning; natural language understanding, and modelbased reasoning. In these areas, there have been attempts to advance formal models of abductive reasoning so that its computational properties are clearly understood and its relations to other kinds of reasoning becomes more precise. A rich map of the conceptual and computational models of abduction is offered in Gabbay and Woods [2005]. In this work, Gabbay and Woods advance their own formal model of abduction that aims to capture some of the nuances of Peirce’s later account. They treat abduction as a method to solve an ignorance-problem, where the latter is a problem not solvable by presently available cognitive resources. Given a choice between surrender (leaving the problem unsolved) and subduance (looking for novel cognitive resources), Gabbay and Woods promote abduction as a middle way: ignorance is not (fully) removed, but becomes the basis for looking for resources upon which reasoned action can be based. The abduced hypothesis does not become known, but it is still the basis for further exploration and action. Circa 1897, Peirce wrote this: The development of my ideas has been the industry of thirty years. I did not know as I ever should get to publish them, their ripening seemed so slow. But the harvest time has come, at last, and to me that harvest seems a wild one, but of course it is not I who have to pass judgment. It is not quite you, either, individual reader; it is experience and history (1.12). Both experience and history have now spoken. Peirce’s theory of abduction still yields fruits and promises good harvests for many years to come. 26 See also [Josephson and Josephson, 1994]. For a good survey of the role of abduction in AI, see [Konolige, 1996].
150
Stathis Psillos
FURTHER READING Perhaps the most important early writings on Peirce’s theory of abduction are by Burks [1946], Frankfurt [1958], and Fann [1970]. A very significant more recent article is Anderson [1986]. Even more recent work that discusses aspects of Peirce’s views of abduction are Hofmann [1999] and Paavola [2007]. An excellent, brief but comprehensive account of Peirce’s philosophy of pragmatism is given in Misak [1999]. Thagard’s [1981] is a brief but suggestive account of the relation between abduction and hypothesis, while his [1977] explains the relation between Induction and Hypothesis. On Peirce’s account of Induction, see Goudge [1940], Jessup [1970] and Sharpe [1970]. On issues related to the abduction as a logic of discovery, see Hanson [1965]. The classic book-length treatment of Inference to the Best Explanation is by Lipton [1991]. A recent thorough discussion of the rival interpretations of Peirce (IBE vs logic of discovery) is given in McKaughan, D. J. [2008]. For an emphasis on computational aspects of abduction, see Aliseda [2006]. The role of abduction in science is discussed in Magnani [2001]. On the relation between abduction and Bayesian confirmation, see the symposium on Peter Lipton’s Inference to the Best Explanation in Philosophy and Phenomenological Research, 74: 421-462, (2007) (Symposiasts: Alexander Bird, Christopher Hitchcock and Stathis Psillos). For a development of the Peircean two-dimensional framework see Psillos [2002] and [2009]. BIBLIOGRAPHY [Aliseda, 2006] A. Aliseda. Abductive Reasoning. Logical Investigations into Discovery and Explanation, Synthese Library vol.330, Springer, 2006. [Anderson, 1986] D. R. Anderson. The Evolution of Peirce’s Concept of Abduction, Transactions of the Charles S. Peirce Society 22: 145-64, 1986. [Burks, 1946] A. Burks. Peirce’s Theory of Abduction, Philosophy of Science 13: 301-306, 1946. [Bylander et al., 1991] T. Bylander, D. Allemang, M. C. Tanner, and J. R. Josephson. The Computational Complexity of Abduction, Artificial Intelligence 49: 25-60, 1991. [Fann, 1970] K. T. Fann. Peirce’s Theory of Abduction, The Hague: Martinus Nijhoff, 1970. [Frankfurt, 1958] H. Frankfurt. Peirce’s Notion of Abduction, The Journal of Philosophy 55: 593-7, 1958. [Gabbay and Woods, 2005] D. M. Gabbay and J. Woods. The Reach of Abduction: Insight and Trial, volume 2 of A Practical Logic of Cognitive Systems, Amsterdam: North-Holland, 2005. [Goodge, 1940] T. A. Goodge. Peirce’s Treatment of Induction, Philosophy of Science 7: 56-68, 1940. [Hanson, 1965] N. R. Hanson. Notes Towards a Logic of Discovery, in R.J. Bernstein (ed.) Critical Essays on C.S. Peirce, Yale University Press, 1965. [Harman, 1965] G. Harman. Inference to the Best Explanation, The Philosophical Review 74: 88-95, 1965. [Hempel, 1965] C. G. Hempel. Aspects of Scientific Explanation, New York: The Free Press, 1965. [Hofmann, 1999] M. Hofmann. Problems with Peirce’s Concept of Abduction, Foundations of Science 4: 271-305, 1999. [Josephson and Josephson, 1994] R. Josephson and S. G. Josephson, eds. Adducive Inference, Cambridge University Press, Cambridge, 1994. [Jessup, 1974] J. A. Jessup. Peirce’s Early Account of Induction, Transactions of the Charles S. Peirce Society 10: 224-34, 1974.
An Explorer upon Untrodden Ground: Peirce on Abduction
151
[Kakas et al., 1992] A. C. Kakas, R. A. Kowalski, and F. Toni. Abductive Logic Programming, Journal of Logic and Computation 2: 719-70, 1992. [Kakas et al., 1997] A. C. Kakas, R. A. Kowalski, and F. Toni. The Role of Abduction in Logic Programming, in D. Gabbay et al. (eds) Handbook in Artificial Intelligence and Logic Programming, Oxford: Oxford University Press, 1997. [Konolige, 1996] K. Konolige. Abductive Theories in Artificial Intelligence, in G. Brewka (ed.) Principles of Knowledge Representation, CSLI Publications, 1996. [Lipton, 1991] P. Lipton. Inference to the best Explanation (2nd enlarged edition, 2004), London: Routledge, 1991. [Magnani, 2001] L. Magnani. Abduction, Reason and Science. Processes of Discovery and Explanation, New York: Kluwer Academic, 2001. [McKaughan, 2008] D. J. McKaughan. From Ugly Duckling to Swan: C. S. Peirce Abduction and the Pursuit of Scientific Theories, Transactions of the Charles S. Peirce Society 44: 446-68, 2001. [McMullin, 1992] E. McMullin. The Inference that Makes Science, Milwaukee: Marquette University Press, 1992. [Misak, 1999] C. Misak. American Pragmatism — Peirce, in C. L. Ten (ed.) Routledge History of Philosophy, Volume 7, The Nineteenth Century, London: Routledge, 1999. [Nagel, 1933] E. Nagel. Charles Peirce’s Guesses at the Riddle, The Journal of Philosophy 30: 365-86, 1933. [Paavola, 2007] S. Paavola. On the Origin of Ideas: An Abductivist Approach to Discovery, Philosophical Studies from the University of Helsinki 15, 2007. [Peirce, 1931–1958] C. S. Peirce. Collected Papers of Charles Sanders Peirce, C. Hartshorne & P. Weiss (eds) (volumes 1-6) and A. Burks (volumes 7 and 8), Cambridge MA: Belknap Press, 1931-1958. [Psillos, 1999] S. Psillos. Scientific Realism: How Science Tracks Truth, London: Routledge, 1999. [Psillos, 2002] S. Psillos. Simply the Best: A Case for Abduction, in A. C. Kakas and F. Sadri (eds) Computational Logic: From Logic Programming into the Future, LNAI 2408, BerlinHeidelberg: Springer-Verlag, pp.605-25, 2002. [Psillos, 2009] S. Psillos. Knowing the Structure of Nature, Palgrave-MacMillan, 2009. [Reichenbach, 1938] H. Reichenbach. On Probability and Induction, Philosophy of Science 5: 21-45, 1938. [Ross, 1949] W. D. Ross. Aristotle’s Prior and Posterior Analytics, (with intr. and commentary), Oxford: Clarendon Press, 1949. [Royce, 1916] J. Royce. Charles Sanders Peirce, The Journal of Philosophy 13: 701-9, 1916. [Sharpe, 1970] R. Sharpe. Induction, Abduction and the Evolution of Science, Transactions of the Charles S. Peirce Society 6: 17-33, 1970. [Smith, 1989] R. Smith. Aristotle — Prior Analytics (translation, with intr., notes and commentary), Indianapolis: Hackett Publishing Company, 1989. [Thagard, 1977] P. Thagard. On the Unity of Peirce’s Theory of Hypothesis, Transactions of the Charles S. Peirce Society 113: 112-21, 1977. [Thagard, 1978] P. Thagard. The Best Explanation: Criteria for Theory Choice, The Journal of Philosophy 75: 76-92, 1978. [Thagard, 1981] P. Thagard. Peirce on Hypothesis and Abduction, in K. Ketner et al., eds., Proceedings of the C.S. Peirce Bicentennial International Congress, Texas: Texas Tech University Press, 271-4, 1981.
THE MODERN EPISTEMIC INTERPRETATIONS OF PROBABILITY: LOGICISM AND SUBJECTIVISM Maria Carla Galavotti This chapter will focus on the modern epistemic interpretations of probability, namely logicism and subjectivism. The qualification “modern” is meant to oppose the “classical” interpretation of probability developed by Pierre Simon de Laplace (1749-1827). With respect to Laplace’s definition, modern epistemic interpretations do not retain the strict linkage with the doctrine of determinism. Moreover, Laplace’s “Principle of insufficient reason” by which equal probability is assigned to all possible outcomes of a given experiment (uniform prior distribution) has been called into question by modern epistemic interpretations and gradually superseded by other criteria. In the following pages the main traits of the logical and subjective interpretations of probability will be outlined together with the position of a number of authors who developed different versions of such viewpoints. The work of Rudolf Carnap, who is widely recognised as the most prominent representative of logicism, will not be dealt with here as it is the topic of another chapter in the present volume.1 1
1.1
THE LOGICAL INTERPRETATION OF PROBABILITY
Forefathers
The logical interpretation regards probability as an epistemic notion pertaining to our knowledge of facts rather than to facts themselves. Compared to the “classical” epistemic view of probability forged by Pierre Simon de Laplace, this approach stresses the logical aspect of probability, and regards the theory of probability as part of logic. According to Ian Hacking, the logical interpretation can be traced back to Leibniz, who entertained the idea of a logic of probability comparable to deductive logic, and regarded probability as a relational notion to be valued in relation to the available data. More particularly, Leibniz is seen by Hacking as anticipating Carnap’s programme of inductive logic.2 1 See the chapter by Sandy Zabell in this volume. For a more extensive treatment of the topics discussed here, see Galavotti [2005]. 2 See Hacking [1971] and [1975].
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
154
Maria Carla Galavotti
The idea that probability represents a sort of degree of certainty, more precisely the degree to which a hypothesis is supported by a given amount of information, was later worked out in some detail by the Czech mathematician and logician Bernard Bolzano (1781–1848). Author of the treatise Wissenschaftslehre (1837) which is reputed to herald contemporary analytical philosophy,3 Bolzano defines probability as the “degree of validity” (Grad der G¨ ultigkeit) relative to a proposition expressing a hypothesis, with respect to other propositions, expressing the possibilities open to it. Probability is seen as an objective notion, exactly like truth, from which probability derives.4 The main ingredients of logicism, namely the idea that probability is a logical relation between propositions endowed with an objective character, are found in Bolzano’s conception, which can be seen as a direct ancestor of Carnap’s theory of probability as partial implication.
1.2 Nineteenth century British logicists In the nineteenth century the interpretation of probability was widely debated in Great Britain, and opposite viewpoints were upheld. Both the empirical and the epistemic views of probability counted followers. The empirical viewpoint imprinted the frequentist interpretation forged by two Cambridge scholars: Robert Leslie Ellis (1817–1859) and John Venn (1834–1923), author of The Logic of Chance, that appeared in three editions in 1866, 1876 and 1888. The epistemic viewpoint inspired the logical interpretation embraced by George Boole, Augustus De Morgan and Stanley Jevons, whose work is analysed in volume IV of the Handbook of the History of Logic, devoted to British Logic in the Nineteenth Century.5 Therefore, the present account will be limited to a brief outline of these authors’ views on probability. George Boole (1815–1864) is the author of the renowned An Investigation of the Laws of Thought, on Which are Founded the Mathematical Theories of Logic and Probabilities (1854). Although his name is mostly associated with (Boolean) algebra, Boole made important contributions to differential and integral calculus, and also probability. According to biographer Desmond MacHale, Boole’s work on probability was “greatly encouraged by W.F. Donkin, Savilian Professor of Astronomy in Oxford, who had himself written some important papers on the subject of probability. Boole was gratified that Donkin agreed with his results” [MacHale, 1985, p. 215]. William Donkin, on whom something will be added in the second part of this chapter, shared with Boole an epistemic view of probability, although he was himself closer to the subjective outlook. According to Boole “probability is expectation founded upon partial knowledge” [Boole, 1854a; 1916, p. 258]. In other words, probability gives grounds for 3 See
Dummett [1993]. Bolzano [1837]. 5 See Gabbay and Woods, eds. [2008], in particular the chapters by Dale Jacquette on “Boole’s Logic” (pp. 331–379); Michael E. Hobart and Joan L. Richards on “De Morgan’s Logic” (pp. 283– 329); and Bert Mosselmans and Ard van Moer on “William Stanley Jevons and the Substitution of Similars” (pp. 515–531). 4 See
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
155
expectation, based on the information available to those who evaluate it. However, probability is not itself a degree of expectation: “The rules which we employ in life-assurance, and in the other statistical applications of the theory of probabilities, are altogether independent of the mental phaenomena of expectation. They are founded on the assumption that the future will bear a resemblance to the past; that under the same circumstances the same event will tend to recur with a definite numerical frequency; not upon any attempt to submit to calculation the strength of human hopes and fears”. [Boole, 1854a; 1916, pp. 258-259] Boole summarizes his own attitude thus: “probability I conceive to be not so much expectation, as a rational ground for expectation” [Boole, 1854b; 1952, p. 292]. The accent on rationality features a peculiar trait of the logical interpretation, which takes a normative attitude towards the theory of probability. As we shall see, this marks a major difference from subjectivism. Within Boole’s perspective, the normative character of probability derives from that of logic, to which it belongs. The “laws of thought” investigated in his most famous book are not meant to describe how the mind works, but rather how it should work in order to be rational: “the mathematical laws of reasoning are, properly speaking, the laws of right reasoning only” [Boole, 1854a, 1916, p. 428].6 According to Boole’s logical perspective, probability does not represent a property of events, being rather a relationship between propositions describing events. In Boole’s words: “Although the immediate business of the theory of probability is with the frequency of the occurrence of events, and although it therefore borrows some of its elements from the science of number, yet as the expression of the occurrence of those events, and also of their relations, of whatever kind, which connect them, is the office of language, the common instrument of reason, so the theory of probabilities must bear some definite relation to logic. The events of which it takes account are expressed by propositions; their relations are involved in the relations of propositions. Regarded in this light, the object of the theory of probabilities may be thus stated: Given the separate probabilities of any propositions to find the probability of another proposition. By the probability of a proposition, I here mean [...] the probability that in any particular instance, arbitrarily chosen, the event or condition which it affirms will come to pass”. [Boole, 1851, 1952, pp. 250-251] Accordingly, the theory of probability is “coextensive with that of logic, and [...] it recognizes no relations among events but such as are capable of being expressed by propositions” [Boole, 1851, 1952, p. 251]. 6 According to some authors, Boole combines a normative attitude towards logic with psychologism. See Kneale [1948] and the “Introduction” (Part I by Ivor Grattan-Guinness and Part II by G´ erard Bornet) in Boole [1997].
156
Maria Carla Galavotti
Two kinds of objects of interest fall within the realm of probability: games of chance and observable phenomena belonging to the natural and social sciences. Games of chance confront us with a peculiar kind of problems, where the ascertainment of data is in itself a way of measuring probabilities. Events of this kind are called simple. Sometimes such events are combined to form a compound event, as when it is asked what is the probability of obtaining a six twice in two successive throws of a die. By contrast, the probability of phenomena encountered in nature can only be measured by means of frequencies, and then we face compound events. Simple events are described by simple propositions, and compound events are described by compounded propositions. Simple propositions are combined to form compounded propositions by means of the logical relations of conjunction and disjunction, and the dependence of the occurrence of certain events upon others can be represented by conditional propositions. Once the events subject to probability are described by propositions, these can be handled by using the methods of logic. The fundamental rules for calculating compounded probabilities are presented by Boole in such a way as to show their intimate relation with logic, and more precisely with his algebra. The conclusion attained is that there is a “natural bearing and dependence” [Boole, 1854a, 1916, p. 287] between the numerical measure of probability and the algebraic representation of the values of logical expressions. The task Boole sets himself is “to obtain a general method by which, given the probabilities of any events whatsoever, be they simple or compound, dependent or independent, conditioned or not, one can find the probability of some other event connected with them, the connection being either expressed by, or implicit in, a set of data given by logical equations”. [Boole, 1854a, 1916, p. 287] In so doing Boole sets forth the logicist programme, to be resumed by Carnap a hundred years later.7 Another representative of nineteenth century logicism is the mathematician Augustus De Morgan (1806–1871), who greatly influenced Boole.8 De Morgan’s major work in logic is the treatise Formal Logic: or, The Calculus of Inference, Necessary and Probable (1847) in which he claims that “by degree of probability we really mean, or ought to mean, degree of belief” [De Morgan, 1847, 1926, p. 198]. De Morgan strongly opposed the tenet that probability is an objective feature of objects, like their physical properties: “I throw away objective probability altogether, and consider the word as meaning the state of the mind with respect to an assertion, a coming event, or any other matter on which absolute knowledge does not exist” [De Morgan, 1847, 1926, p. 199]. However, when making these claims, De Morgan does not refer to actual belief, entertained by individual persons, but 7 The reader is addressed to Hailperin [1976] for a detailed account of Boole’s theory of probability. 8 On De Morgan’s life see the memoir written by his wife, in De Morgan, Sophia Elizabeth [1882], also containing some correspondence.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
157
rather to the kind of belief a rational agent ought adopt when evaluating probability. Therefore, to say that the probability of a certain event is three to one should be taken to mean “that in the universal opinion of those who examine the subject, the state of mind to which a person ought to be able to bring himself is to look three times as confidently upon the arrival as upon the non-arrival” [De Morgan, 1847, 1926, p. 200]. De Morgan also wrote some essays specifically devoted to probability, including Theory of Probabilities (1837), and An Essay on Probabilities, and on their Applications to Life, Contingencies and Insurance Offices (1838), where he maintains that “the quantities which we propose to compare are the forces of the different impressions produced by different circumstances” [De Morgan, 1838, p. 6], and that “probability is the feeling of the mind, not the inherent property of a set of circumstances” [De Morgan, 1838, p. 7]. At first glance, De Morgan’s description of probability as a “degree of belief” and “state of mind” associate him with subjectivism. But his insistence upon referring to the human mind as transcending individuals, not to the minds of single agents who evaluate probabilities, sets him apart from modern subjectivists like Bruno de Finetti. The logicist attitude towards probability also characterizes the work of the economist and logician William Stanley Jevons (1835-1882).9 In The Principles of Science (1873) Jevons claims that “probability belongs wholly to the mind ” [Jevons, 1873, 1877, p. 198]. While embracing an epistemic approach, Jevons does not define probability as a “degree of belief”, because he finds this terminology ambiguous. Against Augustus De Morgan, his teacher at University College London, he maintains that “the nature of belief is not more clear [...] than the notion which it is used to define. But an all-sufficient objection is, that the theory does not measure what the belief is, but what it ought to be” [Jevons, 1873, 1877, p. 199]. Jevons prefers “to dispense altogether with this obscure word belief, and to say that the theory of probability deals with quantity of knowledge” [Jevons, 1873, 1877, p. 199]. So defined, probability is seen as a suitable guide of belief and action. In Jevons’ words: “the value of the theory consists in correcting and guiding our belief, and rendering one’s states of mind and consequent actions harmonious with our knowledge of exterior conditions” [Jevons, 1873, 1877, p. 199]. Deeply convinced of the utility and power of probability, Jevons established a close link between probability and induction, arguing “that it is impossible to expound the methods of induction in a sound manner, without resting them upon the theory of probability” [Jevons, 1873, 1877, p. 197]. In this connection he praises Bayes’ method: “No inductive conclusions are more than probable, and [...] the theory of probability is an essential part of logical method, so that the logical value of every inductive result must be determined consciously or unconsciously, according to the principle of the inverse method of probability”. [Jevons, 1873, 1877, p. xxix] 9 See
Keynes, [1936, 1972] for a biographical sketch of Jevons.
158
Maria Carla Galavotti
A controversial aspect of Jevons’ work is his defence of Laplace against various criticisms raised by a number of authors including Boole. While granting Laplace’s critics that the principle of insufficient reason is to a certain extent arbitrary, he still regards it as the best solution available: “It must be allowed that the hypothesis adopted by Laplace is in some degree arbitrary, so that there was some opening for the doubt which Boole has cast upon it. [...] But it may be replied [...] that the supposition of an infinite number of balls treated in the manner of Laplace is less arbitrary and more comprehensive than any other that can be suggested”. [Jevons, 1873, 1877, p. 256] According to Jevons, Laplace’s method is of great help in situations characterized by lack of knowledge, so it “is only to be accepted in the absence of all better means, but like other results of the calculus of probability, it comes to our aid when knowledge is at an end and ignorance begins, and it prevents us from over-estimating the knowledge we possess”. [Jevons, 1873, 1877, p. 269] When reading Jevons, one is impressed by his deeply probabilistic attitude, testified by statements like the following: “the certainty of our scientific inferences [is] to a great extent a delusion” [Jevons, 1873, 1877, p. xxxi], and “the truth or untruth of a natural law, when carefully investigated, resolves itself into a high or low degree of probability” [Jevons, 1873, 1877, p. 217]. Jevons regards knowledge as intrinsically incomplete and calls attention to the shaky foundation of science, which is based on the assumption of the uniformity of nature. He argues that “those who so frequently use the expression Uniformity of Nature seem to forget that the Universe might exist consistently with the laws of nature in the most different conditions” (Jevons [1873, 1877], p. 749). In view of all this, appeal to probability is mandatory. Although probability does not tell us much about what happens in the short run, it represents our best tool for facing the future: “All that the calculus of probability pretends to give, is the result in the long run, as it is called, and this really means in an infinity of cases. During any finite experience, however long, chances may be against us. Nevertheless the theory is the best guide we can have”. [Jevons, 1873, 1877, p. 261] This suggests that for Jevons the ultimate justification of inductive inference is to be sought on pragmatical grounds.
1.3
William Ernest Johnson
William Ernest Johnson (1858–1931), mathematician, philosopher and logician, Fellow of King’s College and lecturer in the University of Cambridge, greatly
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
159
influenced outstanding personalities such as John Maynard Keynes, Frank Plumpton Ramsey and Harold Jeffreys. His most important work is Logic, published in three volumes between 1921 and 1924. By the time of his death he had been working on a fourth volume of Logic, dealing with probability. The drafts of the first three chapters were published posthumously in Mind in 1932, under the title: “Probability: The Relations of Proposal to Supposal”; “Probability: Axioms” and “Probability: The Deductive and the Inductive Problems”. The “Appendix on Eduction”, closing the third volume of Logic, also focuses on probability. Johnson adopts a “philosophical” approach to logic, stressing its epistemic aspects. He regards logic as “the analysis and criticism of thought” [Johnson, 1921, 1964, p. xiii], and takes a critical attitude towards formal approaches. By doing so, he set himself apart from the mainstream of the period. In a sympathetic spirit, Keynes observes that Johnson “was the first to exercise the epistemic side of logic, the links between logic and the psychology of thought. In a school of thought whose natural leanings were towards formal logic, he was exceptionally well equipped to write formal logic himself and to criticize everything which was being contributed to the subject along formal lines”. [Keynes, 1931, 1972, p. 349] Johnson makes a sharp distinction between “the epistemic aspect of thought”, connected with “the variable conditions and capacities for its acquisition”, and its “constitutive aspect”, referring to “the content of knowledge which has in itself a logically analysable form” [Johnson, 1921, 1964, pp. xxxiii-xxxiv]. The epistemic and grammatical aspects of logic are the two distinct albeit strictly intertwined components along which logic is to be analysed. Regarding probability, Johnson embraces a logical attitude that attaches probability to propositions. While taking this standpoint, he rejects the conception of probability as a property of events: “familiarly we speak of the probability of an event — he writes — but [...] such an expression is not justifiable” [Johnson, 1932, p. 2]. By contrast, “Probability is a character, variable in quantity or degree, which may be predicated of a proposition considered in its relation to some other proposition. The proposition to which the probability is assigned is called the proposal, and the proposition to which the probability of the proposal refers is called the supposal”. [Johnson, 1932, p. 8] The terms “proposal” and “supposal” stand for what are usually called “hypothesis” and “evidence”. As Johnson puts it, a peculiar feature of the theory of probability is that when dealing with it “we have to recognise not only the two assertive attitudes of acceptance and rejection of a given assertum, but also a third attitude, in which judgment as to its truth or falsity is suspended; and [...] probability can only be expounded by reference to such an attitude towards a given assertum” [Johnson, 1932, p. 2]. If the act of suspending judgment is a mental
160
Maria Carla Galavotti
fact, and as such is the competence of psychology, the treatment of probability taken in reference to that act is also strongly connected to logic, because logic provides the norms to be imposed on it. The following passage describes in what sense for Johnson probability falls within the realm of logic: “The logical treatment of probability is related to the psychological treatment of suspense of judgment in the same way as the logical treatment of the proposition is related to the psychological treatment of belief. Just as logic lays down some conditions for correct belief, so also it lays down conditions for correcting the attitude of suspense of judgment. In both cases we hold that logic is normative, in the sense that it imposes imperatives which have significance only in relation to presumed errors in the processes of thinking: thus, if there are criteria of truth, it is because belief sometimes errs. Similarly, if there are principles for the measurement of probability, it is because the attitude of suspense towards an assertum involves a mental measurable element, which is capable of correction. We therefore put forward the view, that probability is quantitative because there is a quantitative element in the attitude of suspense of judgment”. [Johnson, 1932, pp. 2-3] Johnson distinguished three types of probability statements according to their form. These three types should not be confused, because they give rise to different problems. They are: “(1) The singular proposition, e.g., that the next throw will be heads, or that this applicant for insurance will die within a year; (2) The class-fractional proposition, e.g., that, of the applicants to an insurance office, 3/4 of consumptives will die within a year; or that 1/2 of a large number of throws will be heads; (3) The universal proposition, e.g., that all men die before the age of 150 years”. [Johnson, 1932, p. 2] In more familiar terminology, Johnson’s worry is to distinguish between propositions referring to (1) a generic individual randomly chosen from a population, (2) a finite sample or population, (3) an infinite population. The distinction is important for both understanding and evaluating statistical inference, and Johnson has the merit of having called attention to it.10 Closely related is Johnson’s view that probability, conceived as the relation between proposal and supposal, presents two distinct aspects: constructional and inferential. Grasping the constructional relation between any two given propositions means that “both the form of each proposition taken by itself, and the process by which one proposition is constructed from the other” [Johnson, 1932, p. 4] are taken into account. In the case of probability, the form of the propositions involved and the way in which the proposal is constructed by modification of the 10 Some remarks on the relevance of the distinction made by Johnson are to be found in Costantini and Galavotti [1987].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
161
supposal will determine the constructional relation between them. On such constructional relation is in turn based the inferential relation, “namely, the measure of probability that should be assigned to the proposal as based upon assurance with respect to the truth of the supposal” [Johnson, 1932, p. 4]. A couple of examples, taken from Johnson’s exposition, will illustrate the point: “Let the proposal be that ‘The next throw of a certain coin will give heads’. Let the supposal be that ‘the next throw of the coin will give heads or tails’. Then the relation of probability in which the proposal stands to the supposal is determined by the relation of the predication ‘heads’ to the predication ‘heads or tails’. Or. To take another example, let the proposal be that ‘the next man we meet will be tall and red-haired’, and the supposal that ‘the next man we meet will be tall’. Then the relation of predication ‘tall and red-haired’ to the predication ‘tall’ will determine the probability to be assigned to the proposal as depending on the supposal. These two cases illustrate the way in which the logical conjunctions ‘or’ and ‘and’ enter into the calculus of probability”. [Johnson, 1932, p. 8] Building on these concepts Johnson developed a theory of logical probability that is ultimately based on a relation of partial implication between propositions. This brings Johnson’s theory close to Carnap’s, with the fundamental difference that Carnap adopted a definition of the “content” of a proposition that relies on the more sophisticated tools of formal semantics. A major aspect of Johnson’s work on probability concerns the introduction of the so-called Permutation postulate, which corresponds to the property better known as exchangeability. This can be described by saying that exchangeable probability functions assign probability in a way that depends on the number of experienced cases, irrespective of the order in which they have been observed. In other words, under exchangeability probability is invariant with respect to permutation of individuals. This property plays a crucial role within Carnap’s inductive logic — where it is named symmetry — and de Finetti’s subjective theory of probability, which will be examined in the second part of this chapter. As we shall see, Johnson’s discovery of this result left some traces in Ramsey’s work. Johnson’s accomplishment was explicitly acknowledged by the Bayesian statistician Irving John Good, whose monograph The Estimation of Probabilities. An Essay on Modern Bayesian Methods opens with the following words: “This monograph is dedicated to William Ernest Johnson, the teacher of John Maynard Keynes and Harold Jeffreys” [Good, 1965, p. v]. In that book Good makes extensive use of what he calls “Johnson’s sufficiency postulate”, a label that he later modified by substituting the term “sufficiency” with “sufficientness”. Sandy Zabell’s article “W.E. Johnson’s ‘Sufficientness’ Postulate” offers an accurate reconstruction of Johnson’s argument, giving a generalisation of it and calling attention to its relevance for Bayesian statistics.11 11 See
Zabell [1982]. Also relevant are Zabell [1988] and [1989]. All three papers are reprinted
162
Maria Carla Galavotti
By contrast, the insight of Johnson’s treatment of probability was not grasped by his contemporaries, and his contribution, including the exchangeability result, remained almost ignored. Charlie Dunbar Broad’s comment on Johnson’s “Appendix on Eduction” testifies to this attitude: “about the Appendix all I can do is, with the utmost respect to Mr Johnson, to parody Mr Hobbes’ remark about the treatises of Milton and Salmasius: ‘very good mathematics; I have rarely seen better. And very bad probability; I have rarely seen worse”’ [Broad, 1924, p. 379].
1.4 John Maynard Keynes: a logicist with a human face The economist John Maynard Keynes (1883–1946), one of the leading celebrities of the last century, embraced the logical view in his A Treatise on Probability (1921). Son of the logician John Neville Keynes, Maynard was educated at Eton and Cambridge, where he later became a scholar and member of King’s College. Besides playing a crucial role in public life as a political advisor, Keynes was an indefatigable supporter of the arts, as testified, among other things, by his contribution to the establishment of the Cambridge Arts Theatre. In “A Cunning Purchase: the Life and Work of Maynard Keynes” Roger Backhouse and Bradley Bateman observe that “Keynes’ role as an economic problem-solver and a patron of the arts would continue through his last decade, despite his poor health” (Backhouse and Bateman [2006], p. 4). In Cambridge, Keynes was member of the “Apostles” discussion society — also known as “The Society”12 — together with personalities of the calibre of Lytton Strachey, Leonard Woolf, Henry Sidgwick, John McTaggart, Alfred North Whitehead, Bertrand Russell, Frank Ramsey and, last but not least, George Edward Moore. The latter exercised a great influence on the group, as well as on the partly overlapping “Bloomsbury group”, of which Maynard was also part. It was in this atmosphere deeply imbued with philosophy that the young Keynes wrote his book on probability. In The Life of John Maynard Keynes Roy Forbes Harrod maintains that the Treatise was written in the years 1906-1911.13 Although by that time the book was all but completed, Keynes could not prompt its final revision until 1920, due to his political commitments. When it finally appeared in print in 1921, the book was very well received, partly because of the fame that Keynes had by that time gained as an economist and political adviser, partly because it was the first systematic work on probability by an English writer after John Venn’s The Logic of Chance, whose first edition had been published forty-five years earlier, namely in 1866. A review of the Treatise by Charlie Dunbar Broad opens with this passage: “Mr Keynes’ long awaited work on Probability is now published, and will at once take its place as the best treatise on the logical foundations of the subject” [Broad, 1922, p. 72], and closes as follows: “I can only conclude by congratulating Mr Keynes in Zabell [2005]. 12 See Levy [1979] and Harrod [1951] for more details on the Apostles Society. 13 See Harrod [1951]. On Keynes’ life see also Skidelsky [1983-1992].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
163
on finding time, amidst so many public duties, to complete this book, and the philosophical public on getting the best work on Probability that they are likely to see in this generation” [Broad, 1922, p. 85]. Referring to this statement, in an obituary of Keynes Richard Bevan Braithwaite observes that “Broad’s prophecy has proved correct” [Braithwaite, 1946, p. 284]. More evidence of the success attained by the Treatise is offered by Braithwaite in the portrait “Keynes as a Philosopher”, included in the collection Essays on John Maynard Keynes, edited by Maynard’s nephew Milo Keynes, where he maintains that “The Treatise was enthusiastically received by philosophers in the empiricist tradition. [...] The welcome given to Keynes’ book was largely due to the fact that his doctrine of probability filled an obvious gap in the empiricist theory of knowledge. Empiricists had divided knowledge into that which is ‘intuitive’ and that which is ‘derivative’ (to use Russell’s terms), and he regarded the latter as being passed upon the former by virtue of there being a logical relationships between them. Keynes extended the notion of logical relation to include probability relations, which enabled a similar account to be given of how intuitive knowledge could form the basis for rational belief which fell short of knowledge”. [Braithwaite, 1975, pp. 237-238] Braithwaite’s remarks remind us that at the time when Keynes’ Treatise was published, empiricist philosophers, under the spell of works like Russell’s and Whitehead’s Principia Mathematica, paid more attention to the deductive aspects of knowledge than to probability. Nevertheless, one should not forget that, as we have seen, the logical approach to probability already counted a number of supporters in Great Britain. Besides, in the same years a similar approach was embraced in Austria by Ludwig Wittgenstein and Friedrich Waismann.14 In the “Preface” to the Treatise Keynes acknowledges his debt to William Ernest Johnson, and more generally to the Cambridge philosophical setting, regarded as an ideal continuation of the great empiricist tradition “of Locke and Berkeley and Hume, of Mill and Sidgwick, who, in spite of their divergencies of doctrine, are united in a preference for what is matter of fact, and have conceived their subject as a branch rather of science than of creative imagination” [Keynes, 1921, pp. v-vi]. Keynes takes the theory of probability to be a branch of logic, more precisely that part of logic dealing with arguments that are not conclusive, but can be said to have a greater or lesser degree of inconclusiveness. In Keynes’ words: “Part of our knowledge we obtain direct; and part by argument. The Theory of Probability is concerned with that part which we obtain by argument, and treats of the different degrees in which the results so obtained are conclusive or inconclusive” [Keynes, 1921, p. 3]. Like the logic of conclusive arguments, the logic of probability investigates the general principles of inconclusive arguments. Both certainty and probability depend on the amount of knowledge that the premisses of 14 See
Galavotti [2005], Chapter 6, for more on Wittgenstein and Waismann.
164
Maria Carla Galavotti
an argument convey to support the conclusion, the difference being that certainty obtains when the amount of available knowledge authorizes full belief, while in all other cases one obtains degrees of belief. Certainty is therefore seen as the limiting case of probability. While regarding probability as the expression of partial belief, or degree of belief, Keynes points out that it is an intrinsically relational notion, because it depends on the information available: “The terms certain and probable describe the various degrees of rational belief about a proposition which different amounts of knowledge authorize us to entertain. All propositions are true or false, but the knowledge we have of them depends on our circumstances; and while it is often convenient to speak of propositions as certain or probable, this expresses strictly a relationship in which they stand to a corpus of knowledge, actual or hypothetical, and not a characteristic of the propositions in themselves. A proposition is capable at the same time of varying degrees of this relationship, depending upon the knowledge to which it is related, so that it is without significance to call a proposition probable unless we specify the knowledge to which we are relating it”. [Keynes, 1921, pp. 3-4] Another passage states the same idea even more plainly: “No proposition is in itself either probable or improbable, just as no place can be intrinsically distant; and the probability of the same statement varies with the evidence presented, which is, as it were, its origin of reference” [Keynes, 1921, p. 7]. The corpus of knowledge on which probability assessments are based is described by a set of propositions that constitute the premisses of an argument, standing in a logical relationship with the conclusion, which describes a hypothesis. Probability resides in this logical relationship, and its value is determined by the information conveyed by the premisses of the arguments involved: “As our knowledge or our hypothesis changes, our conclusions have new probabilities, not in themselves, but relatively to these new premisses. New logical relations have now become important, namely those between the conclusions which we are investigating and our new assumptions; but the old relations between the conclusions and the former assumptions still exist and are just as real as these new ones” [Keynes, 1921, p. 7] On this basis, Keynes developed a theory of comparative probability, in which conditional probabilities are ordered in terms of a relation of “more” or “less” probable, and are combined into compound probabilities. Like Boole, Keynes aimed to develop a theory of the reasonableness of degrees of belief on logical grounds. Within his perspective the logical character of probability goes hand in hand with its rational character. This element is pointed out by Keynes, who maintains that the theory of probability as a logical relation
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
165
“is concerned with the degree of belief which it is rational to entertain in given conditions, and not merely with the actual beliefs of particular individuals, which may or may not be rational” [Keynes, 1921, p. 4]. In other words, Keynes’ logical interpretation gives the theory of probability a normative value: “we assert that we ought on the evidence to prefer such and such a belief. We claim rational grounds for assertions which are not conclusively demonstrated” [Keynes, 1921, p. 5]. The kernel of the logical interpretation of probability lies precisely with the idea Keynes states with great clarity that in the light of the same amount of information the logical relation representing probability is the same for anyone. So conceived probability is objective, its objectivity being warranted by its logical character: “What particular propositions we select as the premisses of our argument naturally depends on subjective factors peculiar to ourselves; but the relations, in which other propositions stand to these, and which entitle us to probable beliefs, are objective and logical”. [Keynes, 1921, p. 4] It is precisely because the logical relations between the premisses and the conclusion of inconclusive arguments provide objective grounds for belief, that belief based on them can qualify as rational. As to the character of the logical relations themselves, Keynes says that “we cannot analyse the probability-relation in terms of simpler ideas” [Keynes, 1921, p. 8]. They are therefore taken as primitive, and their justification is left to our intuition. Keynes’ conception of the objectivity of probability relations and his use of intuition in that connection have been ascribed by a number of authors to Moore’s influence. Commenting on Keynes’ claim that “what is probable or improbable” in the light of a give amount of information is “fixed objectively, and is independent of our opinion” [Keynes, 1921, p. 4], Donald Gillies observes that when “Keynes speaks of probabilities as being fixed objectively [...] he means objective in the Platonic sense, referring to something in a supposed Platonic world of abstract ideas”, and adds that “we can see here clearly the influence of G. E. Moore. [...] In fact, there is a very notable similarity between the Platonic world as postulated by Cambridge philosophers in the Edwardian era and the Platonic world as originally described by Plato. Plato’s world of objective ideas contained the ethical qualities with the idea of the Good holding the principal place, but it also contained mathematical objects. The Cambridge philosophers thought that they had reduced mathematics to logic. So their Platonic world contained, as well as ethical qualities such as ‘good’, logical relations”. [Gillies, 2000, p. 33] The attitude just described is responsible for a most controversial feature of Keynes’ theory, namely his tenet that probability relations are not always measurable, nor comparable. He writes:
166
Maria Carla Galavotti
“By saying that not all probabilities are measurable, I mean that it is not possible to say of every pair of conclusions, about which we have some knowledge, that the degree of our rational belief in one bears any numerical relation to the degree of our rational belief in the other; and by saying that not all probabilities are comparable in respect of more and less, I mean that it is not always possible to say that the degree of our rational belief in one conclusion is either equal to, greater than, or less than the degree of our belief in another”. [Keynes, 1921, p. 34] In other words, Keynes admits of some probability relations which are intractable by the calculus of probabilities. Far from being worrying to him, this aspect testifies to the high value attached by Keynes to intuition. On the same basis, Keynes is suspicious of a purely formal treatment of probability, and of the adoption of mechanical rules for the evaluation of probability. Keynes believes that measurement of probability rests on the equidistribution of priors: “In order that numerical measurement may be possible, we must be given a number of equally probable alternatives” [Keynes, 1921, p. 41]. This admission notwithstanding, Keynes sharply criticizes Laplace’s principle of insufficient reason, which he prefers to call “Principle of Indifference” to stress the role of individual judgment in the ascription of equal probability to all possible alternatives “if there is an absence of positive ground for assigning unequal ones” [Keynes, 1921, p. 42]. To Laplace he objects that “the rule that there must be no ground for preferring one alternative to another, involves, if it is to be a guiding rule at all, and not a petitio principii, an appeal to judgments of irrelevance” [Keynes, 1921, pp. 54-55]. The judgment of indifference among various alternatives has to be substantiated with the assumption that there could be no further information, on account of which one might change such judgment itself. While in the case of games of chance this kind of assumption can be made without problems, most situations encountered in everyday life are characterized by a complexity that makes it arbitrary. For Keynes, the extension of the principle of insufficient reason to cover all possible applications is the expression of a superficial way of addressing probability, regarded as a product of ignorance rather than knowledge. By contrast, Keynes maintains that the judgment of indifference among available alternatives should not be grounded on ignorance, but rather on knowledge, and recommends that the application of the principle in question always be preceded by an act of discrimination between relevant and irrelevant elements of the available information, and by the decision to neglect certain pieces of evidence. A most interesting aspect of Keynes’ treatment is his discussion of the paradoxes raised by Laplace’s principle. As observed by Gillies: “It is greatly to Keynes’ credit that, although he advocates the Principle of Indifference, he gives the best statement in the literature [Keynes, 1921, Chapter 4] of the paradoxes to which it gives rise” [Gillies, 2000, p. 37]. The reader is addressed to Chapter 3 of Gillies’ Philosophical Theories of Probability for a critical account of Keynes’ treatment of the matter. Keynes’ distrust in the practice of unrestrictedly applying principles holding
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
167
within a restricted domain regards not only the Principle of Indifference, but extends to the “Principle of Induction”, taken as the method of establishing empirical knowledge from a multitude of observed cases. Keynes is suspicious of the inference of general principles on an inductive basis, including causal laws and the principle of uniformity of nature. He distinguishes two kinds of generalizations “arising out of empirical argument”. First, there are universal generalizations corresponding to “universal induction”, of which he says that “although such inductions are themselves susceptible of any degree of probability, they affirm invariable relations”. Second, there are those generalizations which assert probable connections, which correspond to “inductive correlation” [Keynes, 1921, p. 220]. Both types are discussed at length, the first in Part III of the Treatise and the second in Part V. Keynes stresses the importance of the connection between probability and induction, a relationship that was clearly seen by Thomas Bayes and Richard Price in the eighteenth century, but was underrated by subsequent literature. After mentioning Jevons, and also “Laplace and his followers”, as representatives of the tendency to use probability to address inductive problems, Keynes adds: “But it has been seldom apprehended clearly, either by these writers or by others, that the validity of every induction, strictly interpreted, depends, not on a matter of fact, but on the existence of a relation of probability. An inductive argument affirms, not that a certain matter of fact is so, but that relative to certain evidence there is a probability in its favour”. [Keynes, 1921, p. 221] In other words, probability is not reducible to an empirical matter: “The validity and reasonable nature of inductive generalisation is [...] a question of logic and not of experience, of formal and not of material laws. The actual constitution of the phenomenal universe determines the character of our evidence; but it cannot determine what conclusions given evidence rationally supports”. [Keynes, 1921, p. 221] Granted that induction has to be based on probability, the objectivity and rationality of probabilistic reasoning rests on the logical character of probability taken as the relation between a proposition expressing a given body of evidence and a proposition expressing a given hypothesis. On the same basis, Keynes criticizes inferential methods entirely grounded on repeated observations, like the calculation of frequencies. Against this attitude, he claims that the similarities and dissimilarities among events must be carefully considered before quantitative methods can be applied. In this connection, a crucial role is played by analogy, which becomes a prerequisite of statistical inductive methods based on frequencies. In Keynes’ words: “To argue from the mere fact that a given event has occurred invariably in a thousand instances under observation, without any analysis of the circumstances accompanying the individual instances, that it is likely
168
Maria Carla Galavotti
to occur invariably in future instances, is a feeble inductive argument, because it takes no account of the Analogy”. [Keynes, 1921, p. 407] The insistence upon analogy is a central feature of the perspective taken by Keynes, who devotes Part III of the Treatise to “Induction and analogy”. In an attempt to provide a logical foundation for analogy, Keynes finds it necessary to assume that the variety encountered in the world has to be of a limited kind: “As a logical foundation for Analogy, [...] we seem to need some such assumption as that the amount of variety in the universe is limited in such a way that there is no one object so complex that its qualities fall into an infinite number of independent groups (i.e., groups which might exist independently as well as in conjunction); or rather that none of the objects about which we generalise are as complex as this; or at least that, though some objects may be infinitely complex, we sometimes have a finite probability that an object about which we seek to generalise is not infinitely complex”. [Keynes, 1921, p. 258] This assumption confers a finitistic character to Keynes’ approach, criticized, among others, by Rudolf Carnap.15 The principle of limited variety is attacked on a different basis by Ramsey. The topic is addressed in two notes included in the collection Notes on Philosophy, Probability and Mathematics, namely “On the Hypothesis of Limited Variety”, and “Induction: Keynes and Wittgenstein”, where Ramsey claims to see “no logical reason for believing any such hypotheses; they are not the sort of things of which we could be supposed to have a priori knowledge, for they are complicated generalizations about the world which evidently may not be true” [Ramsey, 1991a, p. 297]. Another important ingredient of Keynes’ theory is the notion of weight of arguments. Like probability, the weight of inductive arguments varies according to the amount of evidence. But while probability is affected by the proportion between favourable and unfavourable evidence, weight increases as relevant evidence, taken as the sum of positive and negative observations, increases. In Keynes’ words: “As the relevant evidence at our disposal increases, the magnitude of the probability of the argument may either decrease or increase, according as the new knowledge strengthens the unfavourable or the favourable evidence; but something seems to have increased in either case — we have a more substantial basis upon which to rest our conclusion. I express this by saying that an accession of new evidence increases the weight of an argument. New evidence will sometimes decrease the probability of an argument, but it will always increase its ‘weight”’. [Keynes, 1921, p. 71] The concept of weight mingles with that of relevance, because to say that a piece of evidence is relevant is the same as saying that it increases the weight of an 15 See
Carnap [1950], § 62.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
169
argument. Therefore, Keynes’ stress on weight backs the importance of the notion of relevance within his theory of probability. Keynes addresses the issue of whether the weight of arguments should be made to bear upon action choice. As he puts it: “the question comes to this — if two probabilities are equal in degree, ought we, in choosing our course of action, to prefer that one which is based on a greater body of knowledge?” [Keynes, 1921, p. 313]. This issue, he claims, has been neglected by the literature on action choice, essentially based on the notion of mathematical expectation. However, Keynes admits to find the question “highly perplexing”, adding that “it is difficult to say much that is useful about it” [Keynes, 1921, p. 313]. The discussion of these topics leads to a sceptical conclusion, reflecting Keynes’ distrust in a strictly mathematical treatment of the matter, motivated by the desire to leave room for individual judgment and intuition. He maintains that: “The hope, which sustained many investigators in the course of the nineteenth century, of gradually bringing the moral sciences under the sway of mathematical reasoning, steadily recedes — if we mean, as they meant, by mathematics the introduction of precise numerical methods. The old assumptions, that all quantity is numerical and that all quantitative characteristics are additive, can be no longer sustained. Mathematical reasoning now appears as an aid in its symbolic rather that in its numerical character”. [Keynes, 1921, p. 316] Keynes’ notion of weight is the object of a vast literature. Some authors think that such a notion is at odds with the logicist notion of probability put forward by Keynes. For instance, in “Keynes’ Theory of Probability and its Relevance to its Economics” Allin Cottrell argues that “the perplexities surrounding ‘weight’... are important as the symptom of an internal difficulty in the notion of probability Keynes wishes to promote” [Cottrell, 1993, p. 35]. More precisely, Cottrell believes that the idea that some probability judgments are more reliable than others by virtue of being grounded on a larger weight requires that probabilities of probabilities are admitted, while Keynes does not contemplate them. Cottrell thinks that the frequency notion of probability could do the job. As a matter of fact, this clutch of problems has been extensively dealt with by a number of authors operating under the label of “Bayesianism”, mostly of subjective orientation. Keynes’ views on the objectivity of probability relations involves the tenet that the validity of inductive arguments cannot be made to depend on their success, and it is not undermined by the fact that some events which have been predicted do not actually take place. Induction allows us to say that on the basis of a certain piece of evidence a certain conclusion is reasonable, not that it is true. Awareness of this fact should inspire caution towards inductive predictions, and Keynes warns against the danger of making predictions obtained by detaching the conclusion of an inductive argument. This features a typical aspect of the logical interpretation of probability, that has been at the centre of a vast debate, in which Rudolf Carnap
170
Maria Carla Galavotti
also took part.16 The refusal of any attempt to ground probabilistic inference on success, that goes along with Keynes’ insistence on the logical and non-empirical character of probability relations, is stressed by Anna Carabelli, who writes that “Keynes was [...] critical of the positivist a posteriori criterion of the validity of induction, by which the inductive generalization was valid as far as the prevision based on it will prove successful, that is, will be confirmed by subsequent facts. [...] On the contrary, the validity of inductive method, according to Keynes, did not depend on the success of its prediction, or on its empirical confirmation”. [Carabelli, 1988, p. 66] However, Carabelli adds that “notably, that was what made the difference between Keynes’ position and that of those later logico-empiricists, like R. Carnap, who analysed induction from what he called the ‘confirmation theory’ point of view” [Carabelli, 1988, p. 66]. This claim is misleading, because Carnap’s confirmation theory is not so closely linked to the criterion of success as Carabelli claims. In his late writings Carnap appealed to “inductive intuition” to justify induction, thereby embracing a position not so distant from that of Keynes.17 But it should be added that Keynes assigns to intuition a much more substantial role than Carnap. A further difference between these two authors amounts to their different attitude towards formalization. Keynes, as we have seen, distrusted the pervasive use of mathematics and formal methods, whereas Carnap embraced a strictly formal approach. Carnap’s production on probability, which culminated with the publication in 1950 of Logical Foundations of Probability and occupied the last twenty years of his life, until he died in 1970, is no exception. Although his perspective underwent significant changes, Carnap never abandoned the programme of developing an inductive logic aimed at providing a rational reconstruction of probability within a formalized logical system. As described by Richard Jeffrey’s colourful expression, Carnap “died with his logical boots on, at work on the project” [Jeffrey, 1991, p. 259]. In this enterprise, Carnap was inspired by an unwavering faith in the powers of formal logic on the one side, and of experience on the other, in compliance with the logical empiricist creed. By contrast, Keynes embraced a moderate version of logicism, a logicism “with a human face”, imbued with a deeply felt need not to lose sight of ordinary speech and practice, and to assign an essential role to intuition and individual judgment. To conclude this presentation of Keynes’ views on probability, it is worth mentioning the long debated issue of Ramsey’s criticism and Keynes’ reaction to it. Soon after the publication of the Treatise, Ramsey published a critical review in The Cambridge Magazine challenging some of the central issues in the Treatise, like the conviction that there are unknown probabilities, the principle of limited 16 See, for instance, Kyburg [1968] and the discussion following it, with comments by Y. BarHillel, P. Suppes, K.R. Popper, W.C. Salmon, J. Hintikka, R. Carnap, H. Kyburg jr. 17 See Carnap [1968].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
171
variety, and the very idea that probability is a logical relation.18 As will be argued in more detail in what follows, Ramsey is very critical of this point, which also reappears in other writings. For instance, in “Truth and Probability” he objects to Keynes that “there really do not seem to be any such things as the probability relations he describes” [Ramsey, 1990a, p. 57], and in another note called “Criticism of Keynes” he maintains that: “there are no such things as these relations” [Ramsey, 1991a, p. 273]. After Ramsey’s premature death in 1930, Keynes wrote an obituary containing an explicit concession to Ramsey’s criticism. There he writes: “Ramsey argues, as against the view which I had put forward, that probability is concerned not with objective relations between propositions but (in some sense) with degrees of belief, and he succeeds in showing that the calculus of probabilities simply amounts to a set of rules for ensuring that the system of degrees of belief which we hold shall be a consistent system. Thus the calculus of probabilities belongs to formal logic. But the basis of our degrees of belief — or the a priori probabilities, as they used to be called — is part of our human outfit, perhaps given us merely by natural selection, analogous to our perceptions and our memories rather than to formal logic. So far I yield to Ramsey — I think he is right”. [Keynes, 1930, 1972, p. 339] Before adding some comments, it is worth recalling how the above quoted passage continues: “But in attempting to distinguish ‘rational’ degrees of belief from belief in general he [Ramsey] was not yet, I think, quite successful. It is not getting to the bottom of the principle of induction merely to say that it is a useful mental habit”. [Keynes, 1930, 1972, p. 339] As one can see, some ten years after publication of the Treatise, Keynes was still concerned with drawing a sharp boundary between rational belief and actual belief. Undeniably, such an attitude sides him with logicism, as opposed to subjectivism. indexlogicism The literature is divided among those who believe that after Ramsey’s criticism Keynes changed his attitude towards probability, and those who are instead convinced that Keynes never changed his mind in a substantial way. Among others, Anna Carabelli believes that “Keynes did not change substantially his view on probability” [Carabelli, 1988, p. 255]. By contrast, Bradley Bateman in “Keynes’ Changing Conception of Probability” holds that the views on probability retained in the Treatise “underwent at least two significant changes in subsequent years. Keynes first advocated an objective epistemic theory of probability, but later advocated both subjective epistemic and objective aleatory theories of probability” [Bateman, 1987, p. 113]. A still different viewpoint is taken by Donald Gillies, 18 See
Ramsey [1922].
172
Maria Carla Galavotti
who agrees with Bateman that Keynes changed his conception of probability as a consequence of Ramsey’s criticism, but disagrees as to the nature of such change. Gillies argues that “Keynes did realize, in the light of Ramsey’s criticism, that his earlier views on probability needed to be changed, and he may well have had some rough ideas about how this should be done, but he never settled down to work out a new interpretation of probability in detail. What we have to do therefore is not so much try to reconstruct, on the basis of rather fragmentary evidence, Keynes’ exact views on probability in the 1930s. I don’t believe that Keynes had very exact views on probability at that time. I suggest therefore that we should switch to trying to develop an interpretation of probability that fits the economic theory that Keynes presented in 1936 and 1937, but without necessarily claiming that his theory was precisely what Keynes himself had in mind”. [Gillies, 2006, p. 210] Keynes’ works to which Gillies refers are the well known book The General Theory of Employment, Interest and Money published in 1936, and the article “The General Theory of Employment” published in 1937. According to Gillies, Keynes accepted Ramsey’s criticisms to some extent, and moved to a theory of probability that he labels “intersubjective” and describes as intermediate between logicism and subjectivism. Its distinctive feature is that of ascribing degrees of belief not to single individuals, as subjectivists do, but rather to groups. Gillies presents the theory as an extension of the subjective viewpoint, by demonstrating a Dutch Book Theorem holding for groups, which shows the following: “Let B be some social group. Then it is the interest of B as a whole if its members agree, perhaps as a result of rational discussion, on a common betting quotient rather than each member of the group choosing his or her own betting quotient. If a group does in fact agree on a common betting quotient, this will be called the intersubjective or consensus probability of the social group”. [Gillies, 2006, p. 212]19 Gillies argues that this interpretation “fits perfectly with Keynes’ theory of longterm expectation developed in his 1936 and 1937 publications” [Gillies, 2006, p. 212]. The issue of Keynes’ reaction to Ramsey’s criticisms and the relationship between his conception of probability and his views on economic theory is the object of ongoing debate.
1.5 Harold Jeffreys between logicism and subjectivism Professor of astronomy and experimental philosophy at Cambridge University, Harold Jeffreys (1891–1989) is reputedly one of the last century’s most prominent 19 See
also Gillies [2000], Chapter 8, for more on the intersubjective theory of probability.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
173
geophysicists and a pioneer of the study of the Earth. As described by Alan Cook in a memoir of Jeffreys written for the Royal Society, “the major spherically symmetrical elements of the structure of the Earth that he [Jeffreys] did so much to elucidate, are the basis for all subsequent elaboration, and generations of students learnt their geophysics from his book The Earth” [Cook, 1990, p. 303]. Jeffreys’ work also left a mark in other fields, like seismology and meteorology, and, last but not least, probability. 20 His interest in probability and scientific method led to publication of the book Scientific Inference in 1931, followed in 1939 by Theory of Probability. In addition, he published a number of articles on the topic. Jeffreys was a wholehearted inductivist who used to say that Bayes’ theorem “is to the theory of probability what Pythagoras’ theorem is to geometry” [Jeffreys, 1931, p. 7]. He was led to embrace Bayesianism by his own work in geophysics, where he only had access to scarce data, and needed a method for assessing hypotheses regarding unknown situations, like the composition of the Earth. As a practising scientist he was faced with problems of inverse probability, having to explain experimental data by means of different hypotheses, or to evaluate general hypotheses in the light of changing data. This made it natural for Jeffreys to adopt both an epistemic notion of probability and Bayesian methodology, although at the time he started working on this kind of problems Bayesian method was in disgrace among scientists and statisticians, for the most part supporters of frequentism. But as David Howie observed, “restricted to repeated sampling from a well-behaved population, and largely reserved for data reduction” frequentism “could apply neither to the diverse pool of data Jeffreys drew upon nor directly to the sorts of questions he was attempting to address” [Howie, 2002, p. 127]. Jeffreys’ refusal to embrace frequentism is responsible for the fact that his contribution to probability was not fully appreciated by his contemporaries, and he engaged a debate with various authors, including the physicist Norman Campbell, and the statistician Ronald Fisher.21 Against frequentism, Jeffreys holds that “no ‘objective’ definition of probability in terms of actual or possible observations, or possible properties of the world, is admissible” [Jeffreys, 1939, 1961, p. 11].22 As will be argued in what follows, this relationship is actually reversed within Jeffreys’ epistemology, where probability comes before the notions of objectivity, reality and the external world [Jeffreys, 1936a, p. 325]. Jeffreys started working on probability together with Dorothy Wrinch, a mathematician and scientist, at the time fellow of Girton College Cambridge, who had approached epistemological questions under the influence of Johnson and Russell. In three papers written between 1919 and 192323 Jeffreys and Wrinch draw the lines of an inductivist programme that Jeffreys kept firm throughout his long life, and put at the core of a genuinely probabilistic epistemology revolving around the 20 For a scientific portrait of Harold Jeffreys centred on probability and statistics see Lindley [1991]. 21 See Howie [2002] for a detailed reconstruction of the genesis of Jeffreys’ Bayesianism and the polemics he entertained with Fisher. 22 In connection with Jeffreys’ criticism of frequentism see also Jeffreys [1933] and [1934]. 23 See Jeffreys and Wrinch [1919], [1921] and [1923].
174
Maria Carla Galavotti
idea that probability is “the most fundamental and general guiding principle of the whole of science” [Jeffreys, 1931, p. 7]. Jeffreys and Wrinch made the assumption that all quantitative laws form an enumerable set, and their probabilities form a convergent series. This assumption allows for the assignment of significant prior probabilities to general hypotheses. In addition, Jeffreys and Wrinch formulated a simplicity postulate, according to which simpler laws are assigned a greater prior probability.24 According to its proponents, this principle corresponds to the practice of testing possible laws in order of decreasing simplicity. This machinery allows for the adoption of Bayesian method. Jeffreys’ inductivism is grounded in an epistemic view of probability that shares the main features of logicism, but in certain respects comes closer to subjectivism. According to Jeffreys probability “expresses a relation between a proposition and a set of data” [Jeffreys, 1931, p. 9]. Probability is deemed “a purely epistemological notion” [Jeffreys, 1955, p. 283], corresponding to the reasonable degree of belief that is warranted by a certain body of evidence, by which it is uniquely determined. Given a set of data, Jeffreys claims, “a proposition q has in relation to these data one and only one probability. If any person assigns a different probability, he is simply wrong” [Jeffreys, 1931, p. 10]. The conviction that there exists “unique reasonable degrees of belief” [Jeffreys, 1939, p. 36] puts him in line with logicism, while marking a crucial divergence from subjectivism, a divergence described by Bruno de Finetti as that between “necessarists” who affirm and subjectivists who deny “that there are logical grounds for picking out one single evaluation of probability as being objectively special and ‘correct”’ [de Finetti, 1970, English edition 1975, vol. 2, p. 40]. For Jeffreys, the need to define probability objectively is imposed by science itself. He aimed to define probability in a “pure” way suited for scientific applications. This led Jeffreys to criticize the subjective interpretation of probability put forward by Frank Ramsey, with whom he consorted and shared various interests but apparently never discussed probability.25 To Jeffreys’ eyes subjectivism is a “theory of expectation rather than one of “pure probability” [Jeffreys, 1936a, p. 326]. For a scientist like Jeffreys subjective probability is a theory “for business men”. This is not meant as an expression of contempt, for “we have habitually to decide on the best course of action in given circumstances, in other words to compare the expectations of the benefits that may arise from different actions; hence a theory of expectation is possibly more needed than one of pure probability” [Jeffreys, 1939, 1961, p. 326]. But what science requires is a notion of “pure probability”, not the subjective notion in terms of preferences based on expectations. 24 For
a discussion of the simplicity postulate see Howson [1988]. to Howie and Lindley, Jeffreys found out about Ramsey’s work on probability only after Ramsey’s death in 1930; see Howie [2002], p. 117 and Lindley [1991], p. 13. However, Howie provides evidence that both Ramsey and Jeffreys took part in a group discussing psychoanalysis, whose activity is described in Cameron and Forrester [2000]. Strangely enough, during those meetings they did not discuss probability. 25 According
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
175
In order to define probability in a “pure” way, Jeffreys grounds it on a principle, stated by way of an axiom, which says that probabilities are comparable: “given p, q is either more, equally, or less probable that r, and no two of these alternatives can be true” [Jeffreys, 1939, 1961, p. 16]. He then shows that the fundamental properties of probability functions follow from this assumption. By so doing, Jeffreys qualifies as one of the first to establish the rules of probability from basic presuppositions. Although admitting an affinity with Keynes’ perspective, Jeffreys is careful to keep his own position separate from that of Keynes. In the “Preface” to the second edition of his Theory of Probability, Jeffreys complains at having been labelled a “follower” of Keynes, and draws attention to the fact that Keynes’ Treatise on Probability appeared after he and Dorothy Wrinch had published their first contributions to the theory of probability, drawing the lines of an epistemic approach akin to Keynes’ logicism. He also points out that the resemblance between his own theory and that of Keynes depends on the fact that both attended the lectures of William Ernest Johnson [Jeffreys, 1939, 1961, p. v], thereby bringing more evidence of Johnson’s influence on his contemporaries. A major disagreement with Keynes concerns Keynes’ refusal “to admit that all probabilities are expressible by numbers” [Jeffreys, 1931, p. 223].26 In that connection, Jeffreys’ viewpoint coincides with subjectivism. A most interesting aspect of Jeffreys’ thought is that of developing an original epistemology, which is deeply probabilistic in character.27 This is rooted in a phenomenalistic view of knowledge of the kind upheld by Ernst Mach and Karl Pearson. However, for Jeffreys “the pure phenomenalistic attitude is not adequate for scientific needs. It requires development, and in some cases modification, before it can deal with the problems of inference” [Jeffreys, 1931, p. 225]. The crucial innovation to be made with respect to Mach’s phenomenalism amounts to the introduction of probability, or probabilistic inference, to be more precise. Jeffreys’ epistemology is constructivist, in the sense that such crucial ingredients of scientific knowledge as the notions of “empirical law”, “objectivity”, “reality”, and “causality” are established by inference from experience. This is made possible by statistical methodology, seen as the fundamental tool of science. Concerning objectivity, in the “Addenda” to the 1937 edition of Scientific Inference, Jeffreys writes that “the introduction of the word ‘objective’ at the outset seems [...] a fundamental confusion. The whole problem of scientific method is to find out what is objective” [Jeffreys, 1931, 1973, p. 255]. The same idea is expressed in Theory of Probability, where he states: “I should query whether any meaning can be attached to “objective” without a previous analysis of the process of finding out what is objective” [Jeffreys, 1939, p. 336]. Such a process is inductive and probabilistic, it originates in our sensations and proceeds step by step to the construction of abstract notions lying beyond phenomena. Such notions 26 Additional points of disagreement between Jeffreys and Keynes are described in Jeffreys [1922]. 27 This is described in more detail in Galavotti [2003].
176
Maria Carla Galavotti
cannot be described in terms of observables, but are nonetheless admissible and useful, because they permit “co-ordination of a large number of sensations that cannot be achieved so compactly in any other way” [Jeffreys, 1931, 1973, p. 190]. In this way empirical laws, or “objective statements”, are established. To this end, an inductive passage is needed, for it is only after the rules of induction “have compared it with experience and attached a high probability to it as a result of that comparison” that a general proposition can become a law. In this procedure lies “the only scientifically useful meaning of ‘objectivity”’ [Jeffreys, 1939, p. 336]. Similar considerations apply to the notion of reality. According to Jeffreys, a useful notion of reality obtains when some scientific hypotheses receive from the data a probability which is so high, that on their basis one can draw inferences, whose probabilities are practically the same as if the hypotheses in question were certain. Hypotheses of this kind are taken as certain in the sense that all their parameters “acquire a permanent status”. In such cases, we can assert the associations expressed by the hypotheses in question “as an approximate rule”. Jeffreys retains a likewise empirical and constructivist view of causality. His proposal is to substitute the general formulation of the “principle of causality” with “causal analysis”, as performed within statistical methodology. This starts by considering all the variations observed in a given phenomenon at random, and proceeds to detect correlations which allow for predictions and descriptions that are the more precise, the better their agreement with observations. This procedure leads to asserting laws, which are eventually accepted because “the agreement (with observations) is too good to be accidental” [Jeffreys, 1937, p. 62]. Within scientific practice, the principle of causality is “inverted”: “instead of saying that every event has a cause, we recognize that observations vary and regard scientific method as a procedure for analysing the variation” [Jeffreys, 1931, 1957, p. 78]. The deterministic version of the principle of causality is thereby discarded, for “it expresses a wish for exactness, which is always frustrated, and nothing more” [Jeffreys, 1937, pp. 63-64]. Jeffreys’ position regarding scientific laws, reality and causality reveal the same pragmatical attitude underpinning Ramsey’s views on general propositions and causality, the main difference being that Ramsey’s approach is more strictly analytic, whereas Jeffreys grounds his arguments on probabilistic inference and statistical methodology alone. Furthermore, Jeffreys and Ramsey share the conviction that within an epistemic interpretation of probability there is room for notions like chance and physical probability. Jeffreys regards the notion of chance as the “limiting case” of everyday probability assignments. Chance occurs in those situations in which “given certain parameters, the probability of an event is the same at every trial, no matter what may have happened at previous trials” [Jeffreys, 1931, 1957, p. 46]. For instance, chance “will apply to the throw of a coin or a die that we previously know to be unbiased, but not if we are throwing it with the object of determining the degree of bias. It will apply to measurements when we know the true value and the law of error already. [...] It is not numerically assessable except when we know so much about the system already that we need to know no more”
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
177
[Jeffreys, 1936a, p. 329]. Jeffreys also contemplates the possibility of extending the realm of epistemic probability to a robust notion of “physical probability” of the kind encountered in quantum mechanics. He calls attention to those fields where “some scientific laws may contain an element of probability that is intrinsic to the system and has nothing to do with our knowledge of it” [Jeffreys, 1955, p. 284]. This is the case with quantum mechanics, whose account of phenomena is irreducibly probabilistic. Unlike the probability (chance) that a fair coin falls heads, intrinsic probabilities do not belong to our description of phenomena, but to the theory itself. Jeffreys claims to be “inclined to think that there may be such a thing as intrinsic probability. [...] Whether there is or not — he adds – it can be discussed in the language of epistemological probability” [Jeffreys, 1955, p. 284]. We will find similar ideas expressed by Ramsey. The pragmatical attitude that characterizes Jeffreys’ epistemology brings him close to subjectivism, and so does his conviction that science is fallible, together with his admission that empirical information can be “vague and half-forgotten”, a fact that “has possibly led to more trouble than has received explicit mention” [Jeffreys, 1931, 1973, p. 406]. These features of his perspective are somewhat at odds with his definition of probability as a degree of rational belief uniquely determined by experience, and with the idea that the evaluation of probability is an objective procedure, whose application to experimental evidence obeys rules having the status of logical principles.
2 THE SUBJECTIVE INTERPRETATION OF PROBABILITY Modern subjectivism, sometimes also called “personalism”, shares with logicism the conviction that probability is an epistemic notion. As already pointed out, the crucial point of disagreement between the two interpretations comes in connection with the fact that unlike logicists, subjectivists do not believe that probability evaluations are univocally determined by a given body of evidence.
2.1
The starters
William Fishburn Donkin (1814–1869), professor of astronomy at Oxford, fostered a subjective interpretation of probability in “On Certain Questions Relating to the Theory of Probabilities”, published in 1851. There he writes that “the ‘probability’ which is estimated numerically means merely ‘quantity of belief’, and is nothing inherent in the hypothesis to which it refers” [Donkin, 1851, p. 355]. This claim impressed Frank Ramsey, who recorded it in his notes.28 Donkin’s position is actually quite similar to that of De Morgan, especially when he maintains that probability is “relative to a particular state of knowledge or ignorance; but [...] it is absolute in the sense of not being relative to any individual mind; since, the 28 See document 003-13-01 of the Ramsey Collection, held at the Hillman Library, University of Pittsburgh.
178
Maria Carla Galavotti
same information being presupposed, all minds ought to distribute their belief in the same way” [Donkin, 1851, p. 355]. If in view of claims of this kind Donkin qualifies more as a logicist than as a subjectivist, the appearance of his name in the present section on subjectivism is justified by the fact that he addressed the issue of belief conditioning in a way that anticipated the work of Richard Jeffrey a century later. Donkin formulated a principle, imposing a symmetry restriction on updating belief, as new information is obtained. In a nutshell, the principle states that changing opinion on the probabilities assigned to a set of hypotheses, after new information has been acquired, has to preserve the proportionality among the probabilities assigned to the considered options. Under this condition, the new and old opinions are comparable. The principle is introduced by Donkin as follows: “Theorem. If there be any number of mutually exclusive hypotheses, h1 , h2 , h3 ..., of which the probabilities relative to a particular state of information are p1 , p2 , p3 ..., and if new information be gained which changes the probabilities of some of them, suppose of hm+1 and all that follow, without having otherwise any reference to the rest, then the probabilities of these latter have the same ratios to one another, after the new information, that they had before; that is, p1 : p2 : p3 : ... : pm = p1 : p2 : p3 : ... : pm , where the accented letters denote the values after the new information has been acquired”. [Donkin, 1851, p. 356] The method of conditioning known as Jeffrey conditionalization reflects precisely the intuition behind Donkin’s principle.29 ´ The French mathematician Emile Borel (1871–1956), who gave outstanding contributions to the study of the mathematical properties of probability, can be considered a pioneer of the subjective interpretation. In a review of Keynes’ Treatise originally published in 1924 and later reprinted in the last volume of the series of monographs edited by Borel under the title Trait´e du calcul des probabilit´es et ses applications (1939),30 Borel raises various objections to Keynes, blamed for overlooking the applications of probability to science to focus only on the probability of judgments. Borel takes this to be a distinctive feature of the English as opposed to continental literature which he regards as more aware of the developments of science, particularly physics. When making such claims, Borel is likely to have in mind above all Henri Poincar´e, whose ideas exercised a certain influence on him.31 29 See
Jeffrey [1965], [1992a] and [2004]. Trait´ e includes 18 issues, collected in 4 volumes. The review of Keynes’ Treatise appears in the last issue, under the title “Valeur pratique et philosophie des probabilit´es”. 31 See von Plato [1994, p. 36], where Borel is described as a successor of Poincar´ e “in an intellectual sense”. The book by von Plato contains a detailed exposition of Borel’s ideas on probability. See also [Knobloch, 1987]. 30 The
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
179
While agreeing with Keynes in taking probability in its epistemic sense, Borel claims that probability acquires a different meaning depending on the context in which it occurs. Probability has a different value in situations characterized by a different state of information, and is endowed with a “more objective” meaning in science, where its assessment is grounded on a strong body of information, shared by the scientific community. Borel is definitely a subjectivist when he admits that two people, given the same information, can come up with different probability evaluations. This is most common in everyday applications of probability, like horse races, or weather forecasts. In all such cases, probability judgments are of necessity relative to “a certain body of knowledge”, which is not the kind of information shared by everyone, like scientific theories at a certain time. Remarkably, Borel maintains that when talking of this kind of probability the “body of knowledge” in question should be thought of as “necessarily included in a determinate human mind, but not such that the same abstract knowledge constitutes the same body of knowledge in two distinct human minds” [Borel, 1924, English edition 1964, p. 51]. Probability evaluations made at different times, based on different information, ought not be taken as refinements of previous judgments, but as totally new ones. Borel disagrees with Keynes on the claim that there are probabilities which cannot be evaluated numerically. In connection with the evaluation of probability Borel appeals to the method of betting, which “permits us in the majority of cases a numerical evaluation of probabilities” [Borel, 1924, English edition 1964, p. 57]. This method, which dates back to the origin of the numerical notion of probability in the seventeenth century, is regarded by Borel as having “exactly the same characteristics as the evaluation of prices by the method of exchange. If one desires to know the price of a ton of coal, it suffices to offer successively greater and greater sums to the person who possesses the coal; at a certain sum he will decide to sell it. Inversely if the possessor of the coal offers his coal, he will find it sold if he lowers his demands sufficiently”. [Borel, 1924, English edition 1964, p. 57] At the end of a discussion of the method of bets, where he takes into account some of the traditional objections against it, Borel concludes that this method seems good enough, in the light of ordinary experience. Borel’s conception of epistemic probability has a strong affinity with the subjective interpretation developed by Ramsey and de Finetti. In a brief note on Borel’s work, de Finetti praises Borel for holding that probability must be referred to the single case, and that this kind of probability is always measurable sufficiently well by means of the betting method. At the same time, de Finetti strongly disagrees with the eclectic attitude taken by Borel, more particularly with his admission of an objective meaning of probability, in addition to the subjective.32 32 De
Finetti’s commentary on Borel is to be found in de Finetti [1939].
180
Maria Carla Galavotti
2.2 Ramsey and the principle of coherence Frank Plumpton Ramsey (1903-1930), Fellow of King’s College and lecturer in mathematics at Cambridge, made outstanding contributions to a number of different fields, including mathematics, logic, philosophy, probability, and economics.33 In his obituary, Keynes refers to Ramsey’s as “one of the brightest minds of our generation” and praises him for the “amazing, easy efficiency of the intellectual machine which ground away behind his wide temples and broad, smiling face” [Keynes, 1930, 1972, p. 336]. A regular attender at the meetings of the Moral Sciences Club and the Apostles, Ramsey actively interacted with his contemporaries, including Keynes, Moore, Russell and Wittgenstein — whose Tractatus he translated into English — often influencing their ideas. Ramsey is considered the starter of modern subjectivism with his paper “Truth and Probability”, read at the Moral Sciences Club in 1926, and published in 1931 in the collection The Foundations of Mathematics and Other Logical Essays edited by Richard Bevan Braithwaite shortly after Ramsey’s death. Other sources are to be found in the same book, as well as in the other collection, edited by Hugh Mellor, Philosophical Papers (largely overlapping Braithwaite’s), and in addition in the volumes Notes on Philosophy, Probability and Mathematics, edited by Maria Carla Galavotti, and On Truth, edited by Nicholas Rescher and Ulrich Majer. Ramsey regards probability as a degree of belief, and probability theory as a logic of partial belief. Degree of belief is taken as a primitive notion having “no precise meaning unless we specify more exactly how it is to be measured” [Ramsey, 1990a, p. 63]; in other words, degree of belief requires an operative definition that specifies how it can be measured. A “classical” way of measuring degree of belief is the method of bets, endowed with a long-standing tradition dating back to the birth of probability in the seventeenth century with the work of Blaise Pascal, Pierre Fermat and Christiaan Huygens. In Ramsey’s words: “the old established way of measuring a person’s belief is to propose a bet, and see what are the lowest odds which he will accept” (Ramsey [1990a], p. 68). Such a method, however, suffers from well known problems, like the diminishing marginal utility of money, and is to a certain extent arbitrary, due to personal “eagerness or reluctance to bet”, and the fact that “the proposal of a bet may inevitably alter” a person’s “state of opinion” (Ramsey [1990a], p. 68). To avoid such difficulties, Ramsey adopted an alternative method based on the notion of preference, grounded in a “general psychological theory” asserting that “we act in the way we think most likely to realize the objects of our desires, so that a person’s actions are completely determined by his desires and opinions” [Ramsey, 1990a, p. 69]. Attention is called to the fact that 33 On Ramsey’s life, see [Taylor, 2006] and the last chapter of [Sahlin, 1990]. See also “Better than the Stars”, a radio portrait of Frank Ramsey written and presented by Hugh Mellor, with Alfred J. Ayer, Richard B. Braithwaite, Richard C. Jeffrey, Michael Ramsey (Archbishop of Canterbury and Frank’s brother), Lettice Ramsey (Frank’s widow), Ivor A. Richards, originally recorded in 1978, and later published in Mellor, ed. [1995]. More to be found in the Ramsey Archive of King’s College, Cambridge.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
181
“this theory is not to be identified with the psychology of the Utilitarians, in which pleasure had a dominant position. The theory I propose to adopt is that we seek things which we want, which may be our own or other people’s pleasure, or anything else whatever, and our actions are such as we think most likely to realize these goods.” [Ramsey, 1990a, p. 69] After clarifying that “good” and “bad” are not to be taken in an ethical sense, “but simply as denoting that to which a given person feels desire and aversion” [Ramsey, 1990a, p. 70], Ramsey introduces the notion of quantity of belief, by assuming that goods are measurable as well as additive, and that an agent “will always choose the course of action which will lead in his opinion to the greatest sum of good” [Ramsey, [990a, p. 70]. The fact that people hardly ever entertain a belief with certainty, and usually act under uncertainty, is accounted for by appealing to the principle of mathematical expectation, which Ramsey introduces “as a law of psychology”. Given a person who is prepared to act in order to achieve some good, “if p is a proposition about which he is doubtful, any goods or bads for whose realization p is in his view a necessary and sufficient condition enter into his calculation multiplied by the same fraction, which is called the ‘degree of his belief in p’. We thus define degree of belief in a way which presupposes the use of mathematical expectation”. [Ramsey, 1990a, p. 70] An alternative definition of degree of belief is also suggested along the following lines: “Suppose [the] degree of belief [of a certain person] in p is m/n; then his action is such as he would choose it to be if he had to repeat it exactly n times, in m of which p was true, and in the others false” [Ramsey, 1990a, p. 70]. The two accounts point out two different, albeit strictly intertwined, aspects of the same concept, and are taken to be equivalent. Ramsey exemplifies a typical situation involving a choice of action that depends on belief as follows: “I am at a cross-roads and do not know the way; but I rather think one of the two ways is right. I propose therefore to go that way but keep my eyes open for someone to ask; if now I see someone half a mile away over the fields, whether I turn aside to ask him will depend on the relative inconvenience of going out of my way to cross the fields or of continuing on the wrong road if it is the wrong road. But it will also depend on how confident I am that I am right; and clearly the more confident I am of this the less distance I should be willing to go from the road to check my opinion. I propose therefore to use the distance I would be prepared to go to ask, as a measure of the confidence of my opinion”. [Ramsey, 1990a, pp. 70-71]
182
Maria Carla Galavotti
Denoting f (x) the disadvantage of walking x metres, r the advantage of reaching the right destination, and w the disadvantage of arriving at a wrong destination, if I were ready to go a distance d to ask, the degree of belief that I am on the right road is p = 1 − (f (d)/(r − w)). To choose an action of this kind can be considered advantageous if, were I to act n times in the same way, np times out of these n I was on the right road (otherwise I was on the wrong one). In fact, the total good of not asking each time is npr + n(1 − p)w = nw + np(r − w); while the total good of asking each time (in which case I would never go wrong) is nr − nf (x). The total good of asking is greater than the total good of not asking, provided that f (x) ≺ (r − w)(1 − p). Ramsey concludes that the distance d is connected with my degree of belief, p, by the relation f (d) = (r − w)(1 − p), which amounts to p = 1 − (f (d)/(r − w)), as stated above. He then observes that “It is easy to see that this way of measuring beliefs gives results agreeing with ordinary ideas. [...] Further, it allows validity to betting as means of measuring beliefs. By proposing to bet on p we give the subject a possible course of action from which so much extra good will result to him if p is true and so much extra bad if p is false”. [Ramsey, 1990a, p. 72] However, given the already mentioned difficulties connected with the betting scheme, Ramsey turns to a more general notion of preference. Degree of belief is then operationally defined in terms of personal preferences, determined on the basis of the expectation of an individual of obtaining certain goods, not necessarily of a monetary kind. The value of such goods is intrinsically relative, because they are defined with reference to a set of alternatives. The definition of degree of belief is committed to a set of axioms, which provide a way of representing its values by means of real values. Degrees of belief obeying such axioms are called consistent. The laws of probability are then spelled out in terms of degrees of belief, and it is argued that consistent sets of degrees of belief satisfy the laws of probability. Additivity is assumed in a finite sense, since the set of alternatives taken into account is finite. In this connection Ramsey observes that the human mind is only capable of contemplating a finite number of alternatives open to action, and even when a question is conceived, allowing for an infinite number of answers, these have to be lumped “into a finite number of groups” [Ramsey, 1990a, p. 79]. The crucial feature of Ramsey’s theory of probability is the link between probability and degree of belief established by consistency, or coherence — to use the term that is commonly adopted today. Consistency guarantees the applicability of the notion of degree of belief, which can therefore qualify as an admissible interpretation of probability. In Ramsey’s words, the laws of probability can be shown to be “necessarily true of any consistent set of degrees of belief. Any definite set of degrees of belief which broke them would be inconsistent in the sense that it violated the laws of preference between options. [...] If anyone’s mental condition violated these laws, his choice would depend
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
183
on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event. We find, therefore, that a precise account of the nature of partial belief reveals that the laws of probability are laws of consistency. [...] Having any definite degree of belief implies a certain measure of consistency, namely willingness to bet on a given proposition at the same odds for any stake, the stakes being measured in terms of ultimate values. Having degrees of belief obeying the laws of probability implies a further measure of consistency, namely such a consistency between the odds acceptable on different propositions as shall prevent a book being made against you”. [Ramsey, 1990a, p. 78] By arguing that from the assumption of coherence one can derive the laws of probability Ramsey paved the way to a fully-fledged subjectivism. Remarkably, within this perspective the laws of probability “do not depend for their meaning on any degree of belief in a proposition being uniquely determined as the rational one; they merely distinguish those sets of beliefs which obey them as consistent ones” [Ramsey, 1990a, p. 78]. This claim brings us to the core of subjectivism, for which coherence is the only condition that degrees of belief should obey, or, to put it slightly differently, insofar as a set of degrees of belief is coherent there is no further demand of rationality to be met. Having adopted a notion of probability in terms of coherent degrees of belief, Ramsey does not need to rely on the principle of indifference. In his words: “the Principle of Indifference can now be altogether dispensed with” [Ramsey, 1990a, p. 85]. This is a decisive step in the moulding of modern subjectivism. As we will see in the next Section, a further step was made by Bruno de Finetti, who supplied the “static” definition of subjective probability in terms of coherent degrees of belief with a “dynamic” dimension, obtained by joining subjective probability with exchangeability within the framework of the Bayesian method.34 Although this crucial step was actually made by de Finetti, there is evidence that Ramsey knew the property of exchangeability, of which he must have heard from Johnson’s lectures. Evidence for this claim is found in his note “Rule of Succession”, where use is made of the notion of exchangeability, named “equiprobability of all permutations”.35 What apparently Ramsey did not see, and was instead grasped by de Finetti, is the usefulness of applying exchangeability to the inductive procedure, modelled upon Bayes’ rule. Remarkably, in another note called “Weight or the Value of Knowledge”,36 Ramsey was able to prove that collecting evidence pays in expectation, provided that acquiring the new information is free, and shows how much the increase in weight is. This shows he had a dynamic view at least of this 34 This terminology is borrowed from Zabell [1991], containing useful remarks on Ramsey’s contribution to subjectivism. For a comparison between Ramsey and de Finetti on subjective probability, see [Galavotti, 1991]. 35 See Ramsey [1991a], pp. 279-281. For a detailed commentary see Di Maio [1994]. 36 See Ramsey [1990b]; also included in [1991a, pp. 285-287].
184
Maria Carla Galavotti
important process. As pointed out by Nils-Eric Sahlin and Brian Skyrms, Ramsey’s note on weight anticipates subsequent work by Savage, Good, and others.37 Ramsey put forward his theory of probability in open contrast with Keynes. In particular, Ramsey did not share Keynes’ claim that “a probability may [...] be unknown to us through lack of skill in arguing from given evidence” [Ramsey, 1922, 1989, p. 220]. For a subjectivist, the notion of unknown probability does not make much sense, as repeatedly emphasized also by de Finetti. Moreover, Ramsey criticized the logical relations on which Keynes’ theory rests. In “Criticism of Keynes” he writes that: “There are no such things as these relations. a) Do we really perceive them? Least of all in the simplest cases when they should be clearest; can we really know them so little and yet be so certain of the laws which they testify? [...] c) They would stand in such a strange correspondence with degrees of belief” [Ramsey, 1991a, pp. 273-274]. Like Keynes, Ramsey believed that probability is the object of logic, but they disagreed on the nature of that logic. Ramsey distinguished between a “lesser logic, which is the logic of consistency, or formal logic”, and a “larger logic, which is the logic of discovery, or inductive logic” [Ramsey, 1990a, p. 82]. The “lesser” logic, which is the logic of tautologies in Wittgenstein’s sense, can be “interpreted as an objective science consisting of objectively necessary propositions”. By contrast, the “larger” logic, which includes probability, does not share this feature, because “when we extend formal logic to include partial beliefs this direct objective interpretation is lost” [Ramsey, 1990a, p. 83], and can only be endowed with a psychological foundation.38 Ramsey’s move towards psychologism was inspired by Wittgenstein. This is manifest in a paper read to the Apostles in 1922, called “Induction: Keynes and Wittgenstein”, where Wittgenstein’s psychologism is contrasted with Keynes’ logicism. At the beginning of that paper, Ramsey mentions propositions 6.363 and 6.3631 of the Tractatus, where it is maintained that the process of induction “has no logical foundation but only a psychological one” [Ramsey, 1991a, p. 296]. After praising Wittgenstein for his appeal to psychology in order to justify the inductive procedure, Ramsey discusses Keynes’ approach at length, expressing serious doubts on his attempt at grounding induction on logical relations and hypotheses. At the end of the paper, after recalling Hume’s celebrated argument, Ramsey puts forward by way of a conjecture, of which he claims to be too tired “to see clearly if it is sensible or absurd”, the idea that induction could be justified by saying that “a type of inference is reasonable or unreasonable according to the relative frequencies with which it leads to truth and falsehood. Induction is reasonable because it produces predictions which are generally verified, not because of any logical relation between its premisses and conclusions. On this view we should establish by induction that induction was reasonable, and induction being reasonable this would be a 37 See Nils-Eric Sahlin’s “Preamble” to Ramsey [1990b], and Skyrms [1990] and [2006]. See in addition Savage [1954] and Good [1967]. 38 For some remarks on Ramsey’s psychological theory of belief see Suppes [2006].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
185
reasonable argument”. [Ramsey, 1991a, p. 301] This passage suggests that Ramsey had in mind a pragmatic justification of the inductive procedure. A similar attitude reappears at the end of “Truth and Probability”, where he describes his own position as “a kind of pragmatism”, holding that “we judge mental habits by whether they work, i.e. whether the opinions they lead to are for the most part true, or more often true than those which alternative habits would lead to. Induction is such a useful habit, and so to adopt it is reasonable. All that philosophy can do is to analyse it, determine the degree of its utility, and find on what characteristics of nature it depends. An indispensable means for investigating these problems is induction itself, without which we should be helpless. In this circle lies nothing vicious. It is only through memory that we can determine the degree of accuracy of memory; for if we make experiments to determine this effect, they will be useless unless we remember them”. [Ramsey, 1990a, p. 93-94] As testified by a number of Ramsey’s references to William James and Charles Sanders Peirce, pragmatism is a major feature of his philosophy in general, and his views on probability are no exception. A puzzling aspect of Ramsey’s theory of probability are the relations between degree of belief and frequency. In “Truth and Probability” he writes that “it is natural [...] that we should expect some intimate connection between these two interpretations, some explanation of the possibility of applying the same mathematical calculus to two such different sets of phenomena” [Ramsey, 1990a, p. 83]. Such a connection is identified with the fact that “the very idea of partial belief involves reference to a hypothetical or ideal frequency [...] belief of degree m/n is the sort of belief which leads to the action which would be best if repeated n times in m of which the proposition is true” [Ramsey, 1990a, p. 84]. This passage — echoing the previously mentioned conjecture from “Induction: Keynes and Wittgenstein” — reaffirms Ramsey’s pragmatical tendency to associate belief with action, and to justify inductive behaviour with reference to successful conduct. The argument is pushed even further when Ramsey says that “It is this connection between partial belief and frequency which enables us to use the calculus of frequencies as a calculus of consistent partial belief. And in a sense we may say that the two interpretations are the objective and subjective aspects of the same inner meaning, just as formal logic can be interpreted objectively as a body of tautology and subjectively as the laws of consistent thought”. [Ramsey, 1990a, p. 84] However, in other passages the connection between these two “aspects” is not quite so strict:
186
Maria Carla Galavotti
“experienced frequencies often lead to corresponding partial beliefs, and partial beliefs lead to the expectation of corresponding frequencies in accordance with Bernoulli’s Theorem. But neither of these is exactly the connection we want; a partial belief cannot in general be connected uniquely with any actual frequency”. [Ramsey, 1990a, p. 84] Evidence that Ramsey was intrigued by the relation between frequency and degree of belief is found in some remarks contained in the note “Miscellaneous Notes on Probability”, written in 1928. There four kinds of connections are pointed out, namely: “(1) if degree of belief = γ, most prob((able)) frequency is γ (if instances independent). This is Bernoulli’s theorem; (2) if freq((uency)) has been γ we tend to believe with degree γ; (3) if freq((uency)) is γ, degree γ of belief is justified. This is Peirce’s definition; (4) degree γ of belief means acting appropriately to a frequency γ” [Ramsey, 1991a, p. 275]. After calling attention to such possible connections, Ramsey reaches the conclusion that “it is this last which makes calculus of frequencies applicable to degrees of belief”. Remarkably, the result known as de Finetti’s “representation theorem” tells us precisely how to treat relation (4). One might speculate that Ramsey would have found an answer to at least part of what he was looking for in this result, that de Finetti found out in the very same years, but was not available to him.39 Claims like that mentioned above to the effect that partial belief and frequency “are the two objective and subjective aspects of the same inner meaning”, might be taken to suggest that Ramsey admitted of two notions of probability: one epistemic (the subjective view) and one empirical (the frequency view).40 This emerges again at the very beginning of “Truth and Probability” where Ramsey claims that although the paper deals with the logic of partial belief, “there is no intention of implying that this is the only or even the most important aspect of the subject”, adding that “probability is of fundamental importance not only in logic but also in statistical and physical science, and we cannot be sure beforehand that the most useful interpretation of it in logic will be appropriate in physics also” [Ramsey, 1990a, p. 53]. It can be argued that in spite of these claims Ramsey trusted that the subjective interpretation has the resources for accounting for all uses of probability. His writings offer plenty of evidence for this thesis. There is no doubt that Ramsey took seriously the problem of what kind of probability is employed in science. We know from Braithwaite’s “Introduction” to The Foundations of Mathematics that he had planned to write a final section of “Truth and Probability”, dealing with probability in science. We also know from Ramsey’s unpublished notes that by the time of his death he was working on a book bearing the title “On Truth and Probability”, of which he left a number of tables of contents.41 Of the projected book he only wrote the first part, dealing with the notion of truth, which was published in 1991 under the title On Truth. It can be conjectured that he meant to include in the second part of the book the 39 On
this point, see Galavotti [1991] and [1995]. instance, this opinion is upheld in Good [1965, p. 8]. 41 See the “Ramsey Collection” held by the Hillman Library of the University of Pittsburgh. 40 For
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
187
content of the paper “Truth and Probability”, plus some additional material on probability in science. The notes published in The Foundations of Mathematics under the heading “Further Considerations”,42 and a few more published in the volume Notes on Philosophy, Probability and Mathematics, contain evidence that in the years 1928-29 Ramsey was actively thinking about such problems as theories, laws, causality, chance, all of which he regarded as intertwined. A careful analysis of such writings shows that — contrary to the widespread opinion that he was a dualist with regard to probability — in the last years of his life Ramsey was developing a view of chance and probability in physics fully compatible with his subjective interpretation of probability as degree of belief. Ramsey’s view of chance revolves around the idea that this notion requires some reference to scientific theories. Chance cannot be defined simply in terms of laws (empirical regularities) or frequencies — though the specification of chances involves reference to laws, in a way that will soon be clarified. In “Reasonable Degree of Belief” Ramsey writes that “We sometimes really assume a theory of the world with laws and chances and mean not the proportion of actual cases but what is chance on our theory” [Ramsey, 1990a, p. 97]. The same point is emphasized in the note “Chance”, also written in 1928, where the frequency-based views of chance put forward by authors like Norman Campbell is criticized. The point is interesting, because it highlights Ramsey’s attitude to frequentism, which, far from considering a viable interpretation of probability, he deems inadequate. As Ramsey puts it: “There is, for instance, no empirically established fact of the form ‘In n consecutive throws the number of heads lies between n/2 ± ε(n)’. On the contrary we have good reason to believe that any such law would be broken if we took enough instances of it. Nor is there any fact established empirically about infinite series of throws; this formulation is only adopted to avoid contradiction by experience; and what no experience can contradict, none can confirm, let alone establish”. [Ramsey, 1990a, p. 104] To Campbell’s frequentist view, Ramsey opposed a notion of chance ultimately based on degrees of belief. He defines it as follows: “Chances are degrees of belief within a certain system of beliefs and degrees of belief; not those of any actual person, but in a simplified system to which those of actual people, especially the speaker, in part approximate. [...] This system of beliefs consists, firstly, of natural laws, which are in it believed for certain, although, of course, people are not really quite certain of them”. [Ramsey, 1990a, p. 104] In addition, the system will contain statements of the form: “when knowing ψx and nothing else relevant, always expect φx with degree of belief p (what is or 42 In Ramsey [1931, pp. 199-211]. These are the notes called: “Reasonable Degree of Belief”, “Statistics” and “Chance”, all reprinted in [1990a, pp. 97-109].
188
Maria Carla Galavotti
is not relevant is also specified in the system)” [Ramsey, 1990a, p. 104]. Such statements together with the laws “form a deductive system according to the rules of probability, and the actual beliefs of a user of the system should approximate to those deduced from a combination of the system and the particular knowledge of fact possessed by the user, this last being (inexactly) taken as certain” [Ramsey, 1990a, p. 105]. To put it differently, chance is defined with reference to systems of beliefs that typically contain accepted laws. Ramsey stresses that chances “must not be confounded with frequencies”, for the frequencies actually observed do not necessarily coincide with them. Unlike frequencies, chances can be said to be “objective” in two ways. First, to say that a system includes a chance value referred to a phenomenon, means that the system itself cannot be modified so as to include a pair of deterministic laws, ruling the occurrence and non-occurrence of the same phenomenon. As explicitly admitted by Ramsey, this characterization of objective chance is reminiscent of Poincar´e’s treatment of the matter, and typically applies “when small causes produce large effects” [Ramsey, 1990a, p. 106]. Second, chances can be said to be objective “in that everyone agrees about them, as opposed e.g. to odds on horses” [Ramsey, 1990a, p. 106)]. On the basis of this general definition of chance, Ramsey qualifies probability in physics as chance referred to a more complex system, namely to a system making reference to scientific theories. In other words, probabilities occurring in physics are derived from physical theories. They can be taken as ultimate chances, to mean that within the theoretical framework in which they occur there is no way of replacing them with deterministic laws. The objective character of chances descends from the objectivity peculiarly ascribed to theories that are universally accepted. Ramsey’s view of chance and probability in physics is obviously intertwined with his conception of theories, truth and knowledge in general. Within Ramsey’s philosophy the “truth” of theories is accounted for in pragmatical terms. In this connection Ramsey holds the view, whose paternity is usually attributed to Charles Sanders Peirce, but is also found in Campbell’s work, that theories which gain “universal assent” in the long run are accepted by the scientific community and taken as true. Along similar lines he characterized a “true scientific system” with reference to a system to which the opinion of everyone, grounded on experimental evidence, will eventually converge. According to this pragmatically oriented view, chance attributions, like all general propositions belonging to theories — including causal laws — are not to be taken as propositions, but rather as “variable hypotheticals”, or “rules for judging”, apt to provide a tool with which the user meets the future.43 To sum up, for Ramsey chances are theoretical constructs, but they do not express realistic properties of “physical objects”, whatever meaning be attached to this expression. Chance attributions indicate a way in which beliefs in various facts belonging to science are guided by scientific theories. Ramsey’s idea that 43 See
especially “General Propositions and Causality” (1929) in Ramsey [1931] and [1990a].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
189
within the framework of subjective probability one can make sense of an “objective” notion of physical probability has passed almost unnoticed. It is, instead, an important contribution to the subjective interpretation and its possible applications to science.
2.3
de Finetti and exchangeability
With the Italian Bruno de Finetti (1906-1985) the subjective interpretation of probability came to completion. Working in the same years as Ramsey, but independently, de Finetti forged a similar view of probability as degree of belief, subject to the only constraint of coherence. To such a definition he added the notion of exchangeability, which can be regarded as the decisive step towards the edification of modern subjectivism. In fact exchangeability, combined with Bayes’ rule, gives rise to the inferential methodology which is at the root of the so-called neo-Bayesianism. This result was the object of the paper “Funzione caratteristica di un fenomeno aleatorio” that de Finetti read at the International Congress of Mathematicians, held in Bologna in 1928. In 1935, at Maurice Fr´echet’s invitation de Finetti gave a series of lectures at the Institut Henri Poincar´e in Paris, whose text was published in 1937 under the title “La pr´evision: ses lois logiques, ses sources subjectives”. This article, which is one of de Finetti’s best known, allowed dissemination of his ideas in the French speaking community of probabilists. However, de Finetti’s work came to be known to the English speaking community only in the 1950s, thanks to Leonard Jimmie Savage, with whom he entertained a fruitful collaboration. In addition to making a contribution to probability theory and statistics which is universally recognized as seminal, de Finetti put forward an original philosophy of probability, which can be described as a blend of pragmatism, operationalism and what we would today call “anti-realism”.44 indexAliotta, A. Richard Jeffrey labelled de Finetti’s philosophical position “radical probabilism”45 to stress the fact that for de Finetti probability imbues the whole edifice of human knowledge, and that scientific knowledge is a product of human activity ruled by (subjective) probability, rather than truth or objectivity. De Finetti’s outlined his philosophy of probability in the article “Probabilismo” (1931) which he regarded as his philosophical manifesto. Yet another philosophical text bearing the title L’invenzione della verit` a, originally written by de Finetti in 1934 to take part in a competition for a grant from the Royal Academy of Italy, was published in 2006. The two main sources of de Finetti’s philosophy are Mach’s phenomenalism, and pragmatism, namely the version upheld by the so-called Italian pragmatists, including Giovanni Vailati, Antonio Aliotta and Mario Calderoni. The starting point of de Finetti’s probabilism is the refusal of the notion of truth, and the related view that there are “immutable and necessary” laws. In “Probabilismo” he 44 This is outlined in some detail in Galavotti [1989]. For an autobiographical sketch of de Finetti’s the reader is addressed to de Finetti [1982]. 45 See Jeffrey [1992b] and [1992c].
190
Maria Carla Galavotti
writes: “no science will permit us to say: this fact will come about, it will be thus and so because it follows from a certain law, and that law is an absolute truth. Still less will it lead us to conclude skeptically: the absolute truth does not exist, and so this fact might or might not come about, it may go like this or in a totally different way, I know nothing about it. What we can say is this: I foresee that such a fact will come about, and that it will happen in such and such a way, because past experience and its scientific elaboration by human thought make this forecast seem reasonable to me”. [de Finetti, 1931a, English edition 1989, p. 170] Probability makes forecast possible, and since a forecast is always referred to a subject, being the product of his experience and convictions, the instrument we need is the subjective theory of probability. For de Finetti probabilism is the way out of the antithesis between absolutism and skepticism, and at its core lies the subjective notion of probability. Probability “means degree of belief (as actually held by someone, on the ground of his whole knowledge, experience, information) regarding the truth of a sentence, or event E (a fully specified ‘single’ event or sentence, whose truth or falsity is, for whatever reason, unknown to the person)” [de Finetti, 1968, p. 45]. Of this notion, de Finetti wants to show not only that it is the only non contradictory one, but also that it covers all uses of probability in science and everyday life. This programme is accomplished in two steps: first, an operational definition of probability is worked out, second, it is argued that the notion of objective probability is reducible to that of subjective probability. As we have seen discussing Ramsey’s theory of probability, the obvious option to define probability in an operational fashion is in terms of betting quotients. Accordingly, the degree of probability assigned by an individual to a certain event is identified with the betting quotient at which he would be ready to bet a certain sum on its occurrence. The individual in question should be thought of as one in a condition to bet whatever sum against any gambler whatsoever, free to choose the betting conditions, like someone holding the bank at a gambling-casino. Probability is defined as the fair betting quotient he would attach to his bets. De Finetti adopts this method, with the proviso that in case of monetary gain only small sums should be considered, to avoid the problem of marginal utility. Like Ramsey, de Finetti states coherence as the fundamental and unique criterion to be obeyed to avoid a sure loss, and spells out an argument to the effect that coherence is a sufficient condition for the fairness of a betting system, showing that a coherent gambling behaviour satisfies the principles of probability calculus, which can be derived from the notion of coherence itself. This is known in the literature as the Dutch book argument. It is worth noting that for de Finetti the scheme of bets is just a convenient way of making probability readily understandable, but he always held that there are other ways of defining probability. In “Sul significato soggettivo della probabilit` a”
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
191
[de Finetti, 1931b], after giving an operational definition of probability in terms of coherent betting systems, de Finetti introduces a qualitative definition of subjective probability based on the relation of “at least as probable as”. He then argues that it is not essential to embrace a quantitative notion of probability, and that, while betting quotients are apt devices for measuring and defining probability in an operational fashion, they are by no means an essential component of the notion of probability, which is in itself a primitive notion, expressing “an individual’s psychological perception” [de Finetti, 1931b, English edition 1992, p. 302]. The same point is stressed in Teoria delle probabilit` a, where de Finetti describes the betting scheme as a handy tool leading to “simple and useful insights” [de Finetti, 1970, English edition 1975, vol. 1, p. 180], but introduces another method of measuring probability, making use of scoring rules based on penalties. Remarkably, de Finetti assigns probability an autonomous value independent from the notion utility, thereby marking a difference between his position and that of Ramsey and other supporters of subjectivism, like Savage. The second step of de Finetti’s programme, namely the reduction of objective to subjective probability, relies on what is known as the “representation theorem”. The pivotal notion in this context is that of exchangeability, which corresponds to Johnson’s “permutation postulate” and Carnap’s “symmetry”.46 Summarizing de Finetti, events belonging to a sequence are exchangeable if the probability of h successes in n events is the same, for whatever permutation of the n events, and for every n and h ≤ n. The representation theorem says that the probability of exchangeable events can be represented as follows. Imagine the events were probabilistically independent, with a common probability of occurrence p. Then the probability of a sequence e, with h occurrences in n, would be ph (1−p)n−h . But if the events are exchangeable, the sequence has a probability P (e), represented according to de Finetti’s representation theorem as a mixture over the ph (1−p)n−h with varying values of p: 1 P (e) = ph (1 − p)n−h dF (p) 0
where the distribution function F (p) is unique. The above equation involves two kinds of probability, namely the subjective probability P (e) and the “objective” (or “unknown”) probability p of the events considered. This enters into the mixture associated with the weights assigned by the function F (p) representing a probability distribution over the possible values of p. Assuming exchangeability then amounts to assuming that the events considered are equally distributed and independent, given any value of p. In order to understand de Finetti’s position, it is useful to start by considering how an objectivist would proceed when assessing the probability of an unknown 46 In his “farewell lecture”, delivered at the University of Rome before his retirement, de Finetti says that the term “exchangeability” was suggested to him by Maurice Fr´echet in 1939. Before adopting this terminology, de Finetti had made use of the term “equivalence”. See de Finetti [1976, p. 283].
192
Maria Carla Galavotti
event. An objectivist would assume an objective success probability p. But its value would in general remain unknown. One could give weights to the possible values of p, and determine the weighted average. The same applies to the probability of a sequence e, with h successes in n independent repetitions. Note that because of independence it does not matter where the successes appear. De Finetti focuses on the latter, calling exchangeable those sequences where the places of successes do not make a difference in probability. These need not be independent sequences. An objectivist who wanted to explain subjective probability, would say that the weighted averages are precisely the subjective probabilities. But de Finetti proceeds in the opposite direction with his representation theorem: starting from the subjective judgment of exchangeability, one can show that there is only one way of giving weights to the possible values of the unknown objective probabilities. According to this interpretation, objective probabilities become useless and subjective probability can do the whole job. De Finetti holds that exchangeability represents the correct way of expressing the idea that is usually conveyed by the expression “independent events with constant but unknown probability”. If we take an urn of unknown composition, says de Finetti, the above phrase means that, relative to each of all possible compositions of the urn, the events can be seen as independent with constant probability. Then he points out that “what is unknown here is the composition of the urn, not the probability: this latter is always known and depends on the subjective opinion on the composition, an opinion which changes as new draws are made and the observed frequency is taken into account”. [de Finetti, 1995, English edition 2008, p. 163] It should not pass unnoticed that for the subjectivist de Finetti probability, being the expression of the feelings of the subjects who evaluate it, is always definite and known. From a philosophical point of view, de Finetti’s reduction of objective to subjective probability is to be seen pragmatically; it follows the same pragmatic spirit inspiring the operational definition of subjective probability, and complements it. From a more general viewpoint, the representation theorem gives applicability to subjective probability, by bridging the gap between degrees of belief and observed frequencies. Taken in connection with Bayes’ rule, exchangeability provides a model of how to proceed in such a way as to allow for an interplay between the information on frequencies and degrees of belief. By showing that the adoption of Bayes’ method, taken in conjunction with exchangeability, leads to a convergence between degrees of belief and frequencies, de Finetti indicates how subjective probability can be applied to statistical inference. According to de Finetti, the representation theorem answers Hume’s problem because it justifies “why we are also intuitively inclined to expect that frequency observed in the future will be close to frequency observed in the past” [de Finetti, 1972a, p. 34]. De Finetti’s argument is pragmatic and revolves around the task of induction: to guide inductive reasoning and behavior in a coherent way. Like
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
193
Hume, de Finetti thinks that it is impossible to give a logical justification of induction, and answers the problem in a psychologistic fashion. De Finetti’s probabilism is deeply Bayesian: to his eyes statistical inference can be entirely performed by exchangeability in combination with Bayes’ rule. From this perspective, the shift from prior to posterior, or, as he preferred to say, from initial to final probabilities, becomes the cornerstone of statistical inference. In a paper entitled “Initial Probabilities: a Prerequisite for any Valid Induction” de Finetti takes a “radical approach” by which “all the assumptions of an inference ought to be interpreted as an overall assignment of initial probabilities” [de Finetti, 1969, p. 9]. The shift from initial to final probabilities receives a subjective interpretation, in the sense that it means going from one subjective probability to another, although objective factors, like frequencies, are obviously taken into account, when available. As repeatedly pointed out by de Finetti, updating one’s mind in view of new evidence does not mean changing opinion: “If we reason according to Bayes’ theorem, we do not change our opinion. We keep the same opinion, yet updated to the new situation. If yesterday I was saying “It is Wednesday”, today I would say “It is Thursday”. However I have not changed my mind, for the day after Wednesday is indeed Thursday” [de Finetti, 1995, English edition 2008, p. 43]. In other words, the idea of correcting previous opinions is alien to his perspective, and so is the notion of a self-correcting procedure, retained by other authors, like Hans Reichenbach. The following passage from the book Filosofia della probabilit` a, recently published in English under the title Philosophical Lectures on Probability, highlights de Finetti’s deeply felt conviction that subjective Bayesianism is the only acceptable way of addressing probabilistic inference, and the whole of statistics. The passage also gives the flavour of de Finetti’s incisive prose: “The whole of subjectivistic statistics is based on this simple theorem of calculus of probability [Bayes’ theorem]. This provides subjectivistic statistics with a very simple and general foundation. Moreover, by grounding itself on the basic probability axioms, subjectivistic statistics does not depend on those definitions of probability that would restrict its field of application (like, e.g., those based on the idea of equally probable events). Nor, for the characterization of inductive reasoning, is there any need — if we accept this framework — to resort to empirical formulae. Objectivistic statisticians, on the other hand, make copious use of empirical formulae. The necessity to resort to them only derives from their refusal to allow the use of the initial probability. [...] they reject the use of the initial probability because they reject the idea that probability depends on a state of information. However, by doing so, they distort everything: not only as they turn probability into an objective thing [...] but they go so far as to turn it into a theological entity: they pretend that the ‘true’ probability exists, outside ourselves, independently of a person’s own judgement”.
194
Maria Carla Galavotti
[de Finetti, 1995, English edition 2008, p. 43] For de Finetti objective probability is not only useless, but meaningless, like all metaphysical notions. This attitude is epitomized by the statement “probability does not exist”, printed in capital letters in the “Preface” to the English edition of Teoria delle probabilit` a. A similar statement opens the article “Probabilit` a” in the Enciclopedia Einaudi : “Is it true that probability ‘exists’ ? What could it be? I would say no, it does not exist” [de Finetti, 1980, p. 1146]. Such aversion to the ascription of an objective meaning to probability is a direct consequence of de Finetti’s anti-realism, and is inspired by the desire to keep the notion of probability free from metaphysics. Unfortunately, de Finetti’s statement has fostered the feeling that subjectivism is surrounded by a halo of arbitrariness. Against this suspicion, it must be stressed that de Finetti’s attack on objective probability did not prevent him from taking seriously the issue of objectivity. In fact he struggled against the “distortion” of “identifying objectivity and objectivism”, deemed a “dangerous mirage” [de Finetti, 1962a, p. 344], but did not deny the problem of the objectivity of probability evaluations. To clarify de Finetti’s position, it is crucial to keep in mind de Finetti’s distinction between the definition and the evaluation of probability. These are seen by de Finetti as utterly different concepts which should not be conflated. To his eyes, the confusion between the definition and the evaluation of probability imprints all the other interpretations of probability, namely frequentism, logicism and the classical approach. Upholders of these viewpoints look for a unique criterion — be it frequency, or symmetry — and use it as grounds for both the definition and the evaluation of probability. In so doing, they embrace a “rigid” attitude towards probability, which consists “in defining (in whatever way, according to whatever conception) the probability of an event, and in univocally determining a function” [de Finetti, 1933, p. 740]. By contrast, subjectivists take an “elastic” attitude, according to which the choice of one particular function is not committed to a single rule or method: “the subjective theory [...] does not contend that the opinions about probability are uniquely determined and justifiable. Probability does not correspond to a self-proclaimed ‘rational’ belief, but to the effective personal belief of anyone” [de Finetti, 1951, p. 218]. For subjectivists there are no “correct” probability assignments, and all coherent functions are admissible. The choice of one particular function is regarded as the result of a complex and largely context-dependent procedure. To be sure, the evaluation of probability should take into account all available evidence, including frequencies and symmetries. However, it would be a mistake to put these elements, which are useful ingredients of the evaluation of probability, at the basis of its definition. De Finetti calls attention to the fact that the evaluation of probability involves both objective and subjective elements. In his words: “Every probability evaluation essentially depends on two components: (1) the objective component, consisting of the evidence of known data and facts; and (2) the subjective component, consisting of the opinion concerning unknown facts based on known evidence’ [de Finetti, 1974, p. 7]. The subjective component is seen as unavoidable, and for
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
195
de Finetti the explicit recognition of its role is a prerequisite for the appraisal of objective elements. Subjective elements in no way “destroy the objective elements nor put them aside, but bring forth the implications that originate only after the conjunction of both objective and subjective elements at our disposal” [de Finetti, 1973, p. 366]. De Finetti calls attention to the fact that the collection and exploitation of factual evidence, the objective component of probability judgments, involves subjective elements of various kinds, like the judgment as to what elements are relevant to the problem under consideration, and should enter into the evaluation of probabilities. In practical situations a number of other factors influence probability evaluations, including the degree of competence of the evaluator, his optimistic or pessimistic attitudes, the influence exercised by most recent facts, and the like. Equally subjective for de Finetti is the decision on how to let belief be influenced by objective elements. Typically, when evaluating probability one relies on information regarding frequencies. Within de Finetti’s perspective, the interaction between degrees of belief and frequencies rests on exchangeability. Assuming exchangeability, whenever a considerable amount of information on frequencies is available this will strongly constrain probability assignments. But information on frequencies is often scant, and in this case the problem of how to obtain good probability evaluations becomes crucial. This problem is addressed by de Finetti in a number of writings, partly fruit of his cooperation with Savage.47 The approach adopted is based on penalty methods, of the kind of the well known “Brier’s rule”. Scoring rules like Brier’s are devised to oblige those who make probability evaluations to be as accurate as they can and, if they have to compete with others, to be honest. Such rules play a twofold role within de Finetti’s approach. In the first place, they offer a suitable tool for an operational definition of probability, which is in fact adopted by de Finetti in his late works. In addition, these rules offer a method for improving probability evaluations made both by a single person and by several people, because they can be employed as methods for exercising “self-control”, as well as a “comparative control” over probability evaluations [de Finetti, 1980, p. 1151].48 The use of such methods finds a simple interpretation within de Finetti’s subjectivism: “though maintaining the subjectivist idea that no fact can prove or disprove belief” — he writes — “I find no difficulty in admitting that any form of comparison between probability evaluations (of myself, of other people) and actual events may be an element influencing my further judgment, of the same status as any other kind of information” [de Finetti, 1962a, p. 360]. De Finetti’s work in this connection is in tune with a widespread attitude, especially among Bayesian statisticians, that has given rise to a vast literature on “well-calibrated” estimation methods. Having clarified that de Finetti’s refusal of objective probability is not tantamount to a denial of objectivity, it should be added that such a refusal leads him to overlook notions like “chance” and “physical probability”. Having embraced 47 See
48 For
Savage [1971] where such a cooperation is mentioned. further details, the reader is addressed to Dawid and Galavotti [2009].
196
Maria Carla Galavotti
the pragmatist conviction that science is just a continuation of everyday life, de Finetti never paid much attention to the use made of probability in science, and held that subjective probability can do the whole job. Only the volume Filosofia della probabilit` a includes a few remarks that are relevant to the point. There de Finetti admits that probability distributions belonging to scientific theories — he refers specifically to statistical mechanics — can be taken as “more solid grounds for subjective opinions” [de Finetti, 1995, English edition 2008, p. 63]. This allows for the conjecture that late in his life de Finetti must have entertained the idea that probabilities encountered in science derive a peculiar “robustness” from scientific theories.49 Unlike Ramsey, however, de Finetti did not feel the need to include in his theory a notion of probability specifically devised for application in science. With de Finetti’s subjectivism, the epistemic conception of probability is committed to a theory that could not be more distant from Laplace’s perspective. Unsurprisingly, de Finetti holds that “the belief that the a priori probabilities are distributed uniformly is a well defined opinion and is just as specific as the belief that these probabilities are distributed in any other perfectly specified manner” [de Finetti, 1951, p. 222]. But what is more important is that the weaker assumption of exchangeability allows for a more flexible inferential method than Laplace’s method based on independence. Last but not least, unlike Laplace de Finetti is not a determinist. He believes that in the light of modern science, we have to admit that events are not determined with certainty, and therefore determinism is untenable.50 For an empiricist and pragmatist like de Finetti, both determinism and indeterminism are unacceptable, when taken as physical, or even metaphysical, hypotheses; they can at best be useful ways of describing certain facts. In other words, the alternative between determinism and indeterminism “is undecidable and (I should like to say) illusory. These are metaphysical diatribes over ‘things in themselves’; science is concerned with what ‘appears to us’, and it is not strange that, in order to study these phenomena it may in some cases seem more useful to imagine them from this or that standpoint” [de Finetti, 1976, p. 299].
CONCLUDING REMARKS The epistemic approach is a strong trend in the current debate on probability. Of the two interpretations that have been outlined, namely logicism and subjectivism, subjectivism seems by far more popular, at least within economics and more generally in the social sciences. This can be imputed to a number of reasons, the most obvious being that in the social sciences and economics personal opinions 49 This
is argued in some detail in Galavotti [2001] and [2005]. issue of determinism is addressed in de Finetti [1931c] and in the “Appendix” contained in de Finetti [1970]. Some comments on the de Finetti’s attitude towards determinism are to be found in Suppes [2009] and Zabell [2009]. 50 The
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
197
and expectations enter directly into the information used to support forecasts, forge hypotheses and build models. The work of Ramsey and de Finetti has exercised a formidable influence on subsequent literature. Under the spell of their ideas, novel research fields have been explored, including the theory of decision and the so-called dynamics of belief developed by authors like Richard Jeffrey, Brian Skyrms and many others.51 Equally impressive is the impact of Ramsey and de Finetti on the literature on exchangeability and Bayesian inference, with the work of L. J. Savage, I. J. Good, Dennis Lindley52 and many others working in their wake. From a philosophical point of view, the pragmatism and pluralism characterizing the subjective approach, especially its insistence on the role of various contextual factors, including the individual judgment of experts in the evaluation of probability, have gained considerable consensus. In the realm of natural sciences the prevailing tendency has always been to regard probability as an empirical notion and to assign it a frequentist interpretation. Exceptions to this tendency are the authors who have sided with logicism. One such exception is Harold Jeffreys, whose perspective was considered in Part I. In a similar vein, under the influence of Boole and Keynes the physicist Richard T. Cox derived the laws of probability from a set of postulates formulated in algebraic terms, introduced as plausibility conditions.53 In addition, Cox investigated the possibility of relating probability to entropy, taken as a measure of information and uncertainty, an idea shared by another physicist, namely Edwin T. Jaynes. A strong supporter of Bayesianism and an admirer of Jeffreys’ work, Jaynes put forward a “principle of maximum entropy” as an “objective” criterion for the choice of priors.54 The problem of suggesting objective criteria for the choice of prior probabilities is a burning topic within recent debate revolving around Bayesianism. This has given rise to a specific trend of research, labelled “objective Bayesianiam”.55 Work in this connection tends to transpose the fundamental divergence between logicism and subjectivism, which essentially amounts to the tenet shared by logicism but not subjectivism that a degree of belief should be univocally determined by a given body of evidence, to the framework of Bayesianism. It is on this ground that the influence of logicism on contemporary debates seems more tangible.
BIBLIOGRAPHY [Backhouse and Bateman, 2006] R. E. Backhouse and B. Bateman. A Cunning Purchase: the Life and Work of Maynard Keynes. In [Backhouse and Bateman, 2006, pp. 1-18]. 51 In addition to the references reported in footnotes 29 and 37, see Skyrms [1996] and Jeffrey [2004]. 52 See Savage [1954], Good [1965] and [1983] and Lindley [1965]. 53 See Cox [1946] and [1961]. 54 See Jaynes [1983] and [2003]. 55 See Williamson [2009].
198
Maria Carla Galavotti
[Backhouse and Bateman, 2006] R. E. Backhouse and B. Bateman, eds. The Cambridge Companion to Keynes, Cambridge: Cambridge University Press, 2006. [Bateman, 1987] B. Bateman. Keynes’ Changing Conception of Probability. Economics and Philosophy, III, pp. 97-120, 1987. [Bolzano, 1837] B. Bolzano. Wissenschaftslehre. Sulzbach: Seidel 1837. English partial edition Theory of Science, ed. by Jan Berg. Dordrecht: Reidel, 1973. [Boole, 1851] G. Boole. On the Theory of Probabilities, and in Particular on Mitchell’s Problem of the Distribution of Fixed Stars. The Philosophical Magazine, Series 4, I, pp. 521-530, 1851. Reprinted in Boole [1952], pp. 247-259. [Boole, 1854a] G. Boole. An Investigation of the Laws of Thought, on which are Founded the Mathematical Theories of Logic and Probabilities. London: Walton and Maberly, 1854. Reprinted as George Boole’s Collected Works, vol. 2. Chicago-New York: Open Court, 1916. Reprinted New York: Dover, 1951. [Boole, 1854b] G. Boole. On a General Method in the Theory of Probabilities. The Philosophical Magazine, Series 4, VIII, pp. 431-44, 1854. Reprinted in Boole [1952], pp. 291-307. [Boole, 1952] G. Boole. Studies in Logic and Probability, ed. by Rush Rhees. London: Watts and Co, 1952. [Boole, 1997] G. Boole. Selected Manuscripts on Logic and its Philosophy, ed. by Ivor GrattanGuinness and G´ erard Bornet. Berlin: Birkh¨ auser, 1997. ´ Borel. A ` propos d’un trait´e des probabilit´es. Revue Philosophique XCVIII, pp. [Borel, 1924] E. 321-36, 1924. Reprinted in Borel [1972], vol. 4, pp. 2169-2184. English edition Apropos of a Treatise on Probability. In Kyburg and Smokler, eds. [1964], pp. 45-60, (not included in the 1980 edition). ´ Borel. Oeuvres de Emile ´ ´ [Borel, 1972] E. Borel. 4 volumes. Paris: Editions du CNRS, 1972. [Braithwaite, 1946] R. B. Braithwaite. John Maynard Keynes, First Baron Keynes of Tilton. Mind LV, pp. 283-284, 1946. [Braithwaite, 1975] R. B. Braithwaite. Keynes as a Philosopher. In [Keynes, 1975, pp. 237-246]. [Broad, 1922] C. D. Broad. Critical Notices: A Treatise on Probability by J.M. Keynes. Mind XXXI, pp. 72-85, 1922. [Broad, 1924] C. D. Broad. Mr. Johnson on the Logical Foundations of Science. Mind XXXIII, pp. 242-269 (part 1), pp. 367-384 (part 2), 1924. [Cameron and Forrester, 2000] L. Cameron and J. Forrester. Tansley’s Psychoanalytic Network: An Episode out of the Early History of Psychoanalysis in England. Psychoanalysis and History II, pp. 189-256, 2000. [Carabelli, 1988] A. Carabelli. On Keynes’ Method. London: Macmillan 1988. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. Chicago: Chicago University Press, 1950. Second edition with modifications 1962, reprinted 1967. [Carnap, 1968] R. Carnap. Inductive Logic and Inductive Intuition. In [Lakatos, 1968, pp. 258267]. [Cook, 1990] A. Cook. Sir Harold Jeffreys. Biographical Memoirs of Fellows of the Royal Society XXXVI, pp. 303-333, 1990. [Costantini and Galavotti, 1987] D. Costantini and M. C. Galavotti. Johnson e l’interpretazione degli enunciati probabilistici. In L’epistemologia di Cambridge 1850-1950, ed. by Raffaella Simili. Bologna: Il Mulino, pp. 245-62, 1987. [Costantini and Galavotti, 1997] D. Costantini and M. C. Galavotti, eds. Probability, Dynamics and Causality. Dordrecht-Boston: Kluwer, 1997. [Cottrell, 1993] A. Cottrell. Keynes’ Theory of Probability and its Relevance to his Economics. Economics and Philosophy IX, pp. 25-51, 1993. [Cox, 1946] R. T. Cox. Probability, Frequency, and Reasonable Expectation, American Journal of Physics XIV, pp. 1-13, 1946. [Cox, 1961] R. T. Cox. The Algebra of Probable Inference. Baltimore: The Johns Hopkins University, 1961. [Dawid and Galavotti, 2009] A. P. Dawid and M. C. Galavotti. De Finetti’s Subjectivism, Objective Probability, and the Empirical Validation of Probability Assessments. In [2009, pp. 97-114]. [de Finetti, 1929] B. de Finetti. Funzione caratteristica di un fenomeno aleatorio. In Atti del Congresso Internazionale dei Matematici. Bologna: Zanichelli, pp. 179-190, 1929. Also in de Finetti [1981], pp. 97-108.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
199
[de Finetti, 1931a] B. de Finetti. Probabilismo. Logos, pp.163-219 1931. English edition Probabilism. In Erkenntnis XXXI, pp.169-223, 1989. [de Finetti, 1931b] B. de Finetti. Sul significato soggettivo della probabilit` a. Fundamenta mathematicae XVII, pp. 298-329, 1931. English edition On the Subjective Meaning of Probability. In de Finetti [1992], pp. 291-321. [de Finetti, 1931c] B. de Finetti. Le leggi differenziali e la rinuncia al determinismo. Rendiconti del Seminario Matematico della R. Universit` a di Roma, serie 2, VII, pp. 63-74, 1931. English edition Differential Laws and the Renunciation of Determinism. In de Finetti [1992], pp. 323-334. [de Finetti, 1933] B. de Finetti. Sul concetto di probabilit` a. Rivista italiana di statistica, economia e finanza V, pp. 723-47, 1933. English edition On the Probability Concept. In de Finetti [1992], pp. 335-352. [de Finetti, 1937] B. de Finetti. La pr´evision: ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincar´ e VII, pp.1-68, 1937. English edition Foresight: its Logical Laws, its Subjective Sources. In Kyburg and Smokler, eds. [1964], pp. 95-158. Also in the second edition (1980), pp. 53-118. ´ [de Finetti, 1939] B. de Finetti. Punti di vista: Emile Borel. Supplemento statistico ai Nuovi problemi di Politica, Storia, ed Economia V, pp. 61-71, 1939. [de Finetti, 1951] B. de Finetti. Recent Suggestions for the Reconciliation of Theories of Probability. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, ed. by Jerzy Neyman. Berkeley: University of California Press, pp. 217-225, 1951. [de Finetti, 1962a] B. de Finetti. Obiettivit` a e oggettivit` a: critica a un miraggio. La Rivista Trimestrale 1: pp. 343-367 1962. [de Finetti, 1962b] B. de Finetti. Does it Make Sense to Speak of ‘Good Probability Appraisers’ ?. In The Scientist Speculates. An Anthology of Partly-Baked Ideas, ed. by Irving John Good et al., New York: Basic Books, pp. 357-364, 1962. [de Finetti, 1968] B. de Finetti. Probability: the Subjectivistic Approach. In La philosophie contemporaine, ed. by Raymond Klibansky. Florence: La Nuova Italia, pp. 45-53, 1968. [de Finetti, 1969] B. de Finetti. Initial Probabilities: a Prerequisite for any Valid Induction. Synth` ese XX, pp. 2-16, 1969. [de Finetti, 1970] B. de Finetti. Teoria delle probabilit` a, Torino: Einaudi, 1970. English edition Theory of Probability. New York: Wiley, 1975. [de Finetti, 1972a] B. de Finetti. Subjective or Objective Probability: is the Dispute Undecidable?. Symposia Mathematica IX, pp. 21-36, 1972. [de Finetti, 1972b] B. de Finetti. Probability, Induction and Statistics. New York: Wiley, 1972. [de Finetti, 1973] B. de Finetti. Bayesianism: Its Unifying Role for Both the Foundations and the Applications of Statistics. Bulletin of the International Statistical Institute, Proceedings of the 39 th Session, pp. 349-68, 1973. [de Finetti, 1974] B. de Finetti. The Value of Studying Subjective Evaluations of Probability. In The Concept of Probability in Psychological Experiments, ed. by Carl-Axel Sta¨el von Holstein. Dordrecht-Boston: Reidel, pp. 1-14, 1974. [de Finetti, 1976] B. de Finetti. Probability: Beware of Falsifications!. Scientia LXX, pp. 283303, 1976. Reprinted in Kyburg and Smokler, eds. [1964], second edition 1980, pp. 194-224 (not in the first edition). [de Finetti, 1980] B. de Finetti. Probabilit` a. In Enciclopedia Einaudi. Torino: Einaudi. Vol. 10, pp. 1146-87. 1980. [de Finetti, 1981] B. de Finetti. Scritti (1926-1930). Padua: CEDAM, 1981. [de Finetti, 1982] B. de Finetti. Probability and my Life. In The Making of Statisticians, ed. by Joseph Gani. New York: Springer, pp. 4-12, 1982. [de Finetti, 1992] B. de Finetti. Probabilit` a e induzione (Induction and Probability), ed. by Paola Monari and Daniela Cocchi. Bologna: CLUEB, 1992. (A collection of de Finetti’s papers both in Italian and in English.) [de Finetti, 1995] B. de Finetti. Filosofia della probabilit` a, ed. by Alberto Mura. Milan: Il Saggiatore, 1995. English edition Philosophical Lectures on Probability, ed. by Alberto Mura. Dordrecht: Springer, 2008. [de Finetti, 2006] B. de Finetti. L’invenzione della verit` a. Milan: Cortina, 2006. [De Morgan, 1837] A. De Morgan. Theory of Probabilities. In Encyclopaedia Metropolitana, 1837.
200
Maria Carla Galavotti
[De Morgan, 1838] A, De Morgan. An Essay on Probabilities, and on their Applications to Life, Contingencies and Insurance Offices. London: Longman, 1838. [De Morgan, 1847] A. De Morgan. Formal Logic: or, The Calculus of Inference, Necessary and Probable. London: Taylor and Walton, 1847. Reprinted London: Open Court, 1926. [De Morgan, 1882] S. E. De Morgan. Memoir of Augustus De Morgan. London: Longman, 1882. [Di Maio, 1994] M. C. Di Maio. Review of F.P. Ramsey, Notes on Philosophy, Probability and Mathematics. Philosophy of Science LXI, pp. 487-489, 1994. [Donkin, 1851] W. Donkin. On Certain Questions Relating to the Theory of Probabilities. The Philosophical Magazine, Series IV, I, pp. 353-368, 458-466; II, pp. 55-60, 1851. [Dummett, 1993] M. Dummett. Origins of Analytical Philosophy. London: Duckworth, 1993. [Gabbay and Woods, 2008] D. Gabbay and J. Woods, eds. Handbook of the History of Logic. Volume IV: British Logic in the Nineteenth Century. Amsterdam: Elsevier, 2008. [Galavotti, 1989] M. C. Galavotti. Anti-realism in the Philosophy of Probability: Bruno de Finetti’s Subjectivism. Erkenntnis XXXI, pp. 239-261, 1989. [Galavotti, 1991] M. C. Galavotti. The Notion of Subjective Probability in the Work of Ramsey and de Finetti. Theoria LVII, pp. 239-259, 1991. [Galavotti, 1995] M. C. Galavotti. F.P. Ramsey and the Notion of ‘Chance’. In The British Tradition in the 20 th Century Philosophy. Proceedings of the 17 th International Wittgenstein Symposium, ed. by Jaakko Hintikka and Klaus Puhl. Vienna: Holder-Pichler-Tempsky, pp. 330-340, 1995. [Galavotti, 1999] M. C. Galavotti. Some Remarks on Objective Chance (F.P. Ramsey, K.R. Popper and N.R. Campbell). In Language, Quantum, Music, ed. by Maria Luisa Dalla Chiara, Roberto Giuntini and Federico Laudisa. Dordrecht-Boston: Kluwer, pp. 73-82, 1999. [Galavotti, 2001] M. C. Galavotti. Subjectivism, Objectivism and Objectivity in Bruno de Finetti’s Bayesianism. In Foundations of Bayesianism, ed. by David Corfield and Jon Williamson. Dordrecht-Boston: Kluwer, pp. 161-174, 2001. [Galavotti, 2003] M. C. Galavotti. Harold Jeffreys’ Probabilistic Epistemology: Between Logicism and Subjectivism. British Journal for the Philosophy of Science LIV, pp. 43-57, 2003. [Galavotti, 2005] M. C. Galavotti. Philosophical Introduction to Probability. Stanford: CSLI, 2005. [Galavotti, 2006] M. C. Galavotti, ed. Cambridge and Vienna. Frank P. Ramsey and the Vienna Circle. Dordrecht: Springer, 2006. [Galavotti, 2009] M. C. Galavotti, ed.. Bruno de Finetti, Radical Probabilist. London: College Publications, 2009. [Gillies, 2000] D. Gillies. Philosophical Theories of Probability. London-New York: Routledge, 2000. [Gillies, 2006] D. Gillies. Keynes and Probability. In [Backhouse and Bateman, 2006, pp. 199216]. [Good, 1965] I. J. Good. The Estimation of Probabilities. An Essay on Modern Bayesian Methods. Cambridge, Mass.: MIT Press, 1965. [Good, 1967] I. J. Good. On the Principle of Total Evidence. British Journal for the Philosophy of Science XVIII, pp. 319-321, 1967. Reprinted in Good [1983], pp. 178-180. [Good, 1983] J. I. Good. Good Thinking. The Foundations of Probability and its Applications. Minneapolis: University of Minnesota Press, 1983. [Hacking, 1971] I. Hacking. The Leibniz-Carnap Program for Inductive Logic. The Journal of Philosophy LXVIII, pp. 597-610, 1971. [Hacking, 1975] I. Hacking. The Emergence of Probability. Cambridge: Cambridge University Press, 1975. [Hailperin, 1976] T. Hailperin. Boole’s Logic and Probability. Amsterdam: North Holland, 1976. [Harrod, 1951] R. F. Harrod. The Life of John Maynard Keynes. London: Macmillan, 1951. [Howie, 2002] D. Howie. Interpreting Probability. Cambridge: Cambridge University Press, 2002. [Howson, 1988] C. Howson. On the Consistency of Jeffreys’ Simplicity Postulate, and its Role in Bayesian Inference. The Philosophical Quarterly XXXVIII, pp. 68-83, 1988. [Howson, 2006] C. Howson. Scientific Reasoning and the Bayesian Interpretation of Probability. In Contemporary Perspectives in Philosophy and Methodology of Science, eds. Wenceslao J. Gonzalez and Jesus Alcolea, La Coru˜ na: Netbiblo, pp. 31-45, 2006. [Jaynes, 1983] E. T. Jaynes. Papers on Probability, Statistics and Statistical Physics, ed. R. Rosenkrantz. Dordrecht: Reidel, 1983.
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
201
[Jaynes, 2003] E. T. Jaynes. Probability Theory: The Logic of Science, ed. G.L. Bretthorst. Cambridge: Cambridge University Press, 2003. http://bayes.wustl.edu. [Jeffrey, 1965] R. C. Jeffrey. The Logic of Decision. Chicago: The University of Chicago Press, 1965. Second edition Chicago: The University of Chicago Press, 1983. [Jeffrey, 1991] R. C. Jeffrey. After Carnap. Erkenntnis XXXV, pp. 255-62, 1991. [Jeffrey, 1992a] R. C. Jeffrey. Probability and the Art of Judgment. Cambridge: Cambridge University Press, 1992. [Jeffrey, 1992b] R. C. Jeffrey. Radical Probabilism (Prospectus for a User’s Manual). In Rationality in Epistemology, ed. by Enrique Villanueva. Atascadero, Cal.: Ridgeview, pp. 193-204, 1992. [Jeffrey, 1992c] R. C. Jeffrey. De Finetti’s Radical Probabilism. In [de Finetti, 1992, pp. 263-275]. [Jeffrey, 2004] R. C. Jeffrey. Subjective Probability: The Real Thing. Cambridge: Cambridge University Press, 2004. [Jeffreys, 1922] H. Jeffreys. Review of J.M. Keynes, A Treatise on Probability. Nature CIX, pp. 132-3, 1922. Also in Collected Papers VI, pp. 253-6. [Jeffreys, 1931] H. Jeffreys. Scientific Inference. Cambridge: Cambridge University Press, 1931. Reprinted with Addenda 1937, 2nd modified edition 1957, 1973. [Jeffreys, 1933] H. Jeffreys. Probability, Statistics and the Theory of Errors. Proceedings of the Royal Society, Series A, CXL, pp. 523-535.. 1933. [Jeffreys, 1934] H. Jeffreys. Probability and Scientific Method. Proceedings of the Royal Society, Series A, CXLVI, pp. 9-16, 1934. [Jeffreys, 1936a] H. Jeffreys. The Problem of Inference. Mind XLV, pp. 324-333, 1936. [Jeffreys, 1936b] H. Jeffreys. On Some Criticisms of the Theory of Probability. Philosophical Magazine XXII, pp. 337-359, 1936. [Jeffreys, 1937] H. Jeffreys. Scientific Method, Causality, and Reality. Proceedings of the Aristotelian Society, New Series, XXXVII, pp. 61-70, 1937. [Jeffreys, 1939] H. Jeffreys. Theory of Probability. Oxford, Clarendon Press, 1939. 2nd modified edition 1948, 1961, 1983. [Jeffreys, 1955] H. Jeffreys. The Present Position in Probability Theory. The British Journal for the Philosophy of Science V, pp. 275-89. Also in Collected Papers VI, pp. 421-435, 1955. [Jeffreys and Wrinch, 1919] H. Jeffreys and D. Wrinch. On Certain Aspects of the Theory of Probability. Philosophical Magazine XXXVII 38, pp. 715-731, 1919. [Jeffreys and Wrinch, 1921] H. Jeffreys and D. Wrinch. On Certain Fundamental Principles of Scientific Inquiry. Philosophical Magazine XLII 42, pp. 369-90 (part I); XLV, pp. 368-74 (part II), 1921. [Jeffreys and Wrinch, 1923] H. Jeffreys and D. Wrinch. The Theory of Mensuration. Philosophical Magazine XLVI, pp. 1-22, 1923. [Jeffreys and Swirles, 1971–1977] J. Jeffreys and B. Swirles, eds. Collected Papers of Sir Harold Jeffreys on Geophysics and Other Sciences. 6 volumes. London-Paris-New York: Gordon and Breach Science Publishers, 1971–1977. [Jevons, 1873] W. S. Jevons. The Principles of Science. London: Macmillan, 1873. Second enlarged edition 1877. Reprinted New York 1958. [Johnson, 1921, 1922, 1924] W. E. Johnson. Logic. Cambridge: Cambridge University Press. Part I, 1921; Part II, 1922; Part III, 1924. Reprinted New York: Dover, 1964. [Johnson, 1932] W. E. Johnson. Probability: The Relations of Proposal to Supposal; Probability: Axioms; Probability: The Deductive and the Inductive Problems. Mind XLI, pp. 1-16, 281-296, 409-423, 1932. [Keynes, 1921] J. M. Keynes. A Treatise on Probability. London: Macmillan. 1921. Reprinted in Keynes [1972], vol. 8. [Keynes, 1930] J. M. Keynes. Frank Plumpton Ramsey. The Economic Journal, 930, 1930. Reprinted in Keynes [1933] and [1972], pp. 335-346. [Keynes, 1931] J. M. Keynes. W.E. Johnson. The Times, 15 January 1931. Reprinted in Keynes [1933] and [1972], pp. 349-350. [Keynes, 1933] J. M. Keynes. Essays in Biography. London: Macmillan, 1933. Second modified edition 1951. Third modified edition in Keynes [1972], vol. 10. [Keynes, 1936] J. M. Keynes. William Stanley Jevons. Journal of the Royal Statistical Society Part III. 1936. Reprinted in the second edition of Keynes [1933] and in [1972], pp. 109-160. [Keynes, 1972] J. M. Keynes. The Collected Writings of John Maynard Keynes. Cambridge: Macmillan, 1972.
202
Maria Carla Galavotti
[Keynes, 1975] M. Keynes, ed. Essays on John Maynard Keynes. Cambridge: Cambridge University Press, 1975. [Kneale, 1948] W. Kneale. Boole and the Revival of Logic. Mind LVII, pp. 149-175, 1948. ´ [Knobloch, 1987] E. Knobloch. Emile Borel as a Probabilist. In Kr¨ uger, Gigerenzer and Morgan, eds. (1987), vol. 1, pp. 215-33, 1987. [Kr¨ uger et al., 1987] L. Kr¨ uger, G. Gigerenzer, and M. Morgan, eds. The Probabilistic Revolution. 2 volumes. Cambridge, Mass.: MIT Press, 1987. [Kyburg, 1968] H. Kyburg, jnr. The Rule of Detachment in Inductive Logic. In [Lakatos, 1968, pp. 98-165]. [Kyburg and Smolker, 1964] H. Kyburg, jnr. and H. Smokler, eds. Studies in Subjective Probability. New York-London-Sydney: Wiley, 1964. Second modified edition Huntington (N.Y.): Krieger, 1980. [Lakatos, 1968] I. Lakatos, ed. The Problem of Inductive Logic. Amsterdam: North-Holland, 1968. [Levy, 1979] P. Levy. G.E. Moore and the Cambridge Apostles. Oxford-New York: Oxford University Press, 1979. [Lindley, 1991] D. Lindley. Sir Harold Jeffreys. Chance IV, pp. 10-21. 1991. [Lindley, 1965] D. Lindley. Introduction to Probability and Statistics. Cambridge: Cambridge University Press, 1965. [MacHale, 1985] D. MacHale. George Boole. His Life and Work. Dublin: Boole Press, 1985. [Mellor, 1995] H. Mellor, ed. Better than the Stars. Philosophy LXX, pp. 243-262, 1995. [von Plato, 1994] J. von Plato. Creating Modern Probability. Cambridge-New York: Cambridge University Press, 1994. [Ramsey, 1922] F. P. Ramsey. Mr. Keynes on Probability. The Cambridge Magazine XI, pp. 3-5. Reprinted in The British Journal for the Philosophy of Science XL (1989), pp. 219-222, 1922. [Ramsey, 1931] F. P. Ramsey. The Foundations of Mathematics and Other Logical Essays, ed. by Richard Bevan Braithwaite. London: Routledge and Kegan Paul, 1931. [Ramsey, 1990a] F. P. Ramsey. Philosophical Papers, ed. by Hugh Mellor. Cambridge: Cambridge University Press, 1990. [Ramsey, 1990b] F. P. Ramsey. Weight or the Value of Knowledge. British Journal for the Philosophy of Science XLI, pp. 1-4, 1990. [Ramsey, 1991a] F. P. Ramsey. Notes on Philosophy, Probability and Mathematics, ed. by Maria Carla Galavotti. Naples: Bibliopolis, 1991. [Ramsey, 1991b] F. P. Ramsey. On Truth, ed. by Nicholas Rescher and Ulrich Majer. DordrechtBoston: Kluwer, 1991. [Sahlin, 1990] N.-E. Sahlin. The Philosophy of F.P. Ramsey. Cambridge: Cambridge University Press, 1990. [Savage, 1954] L. J. Savage. Foundations of Statistics. New York: Wiley, 1954. [Savage, 1971] L. J. Savage. Elicitation of Personal Probabilities and Expectations. Journal of the American Statistical Association LXVI, pp. 783-801. 1971. [Skidelsky, 1983-1992] R. Skidelsky. John Maynard Keynes. 2 volumes. London: Macmillan, 1983-1992. [Skyrms, 1990] B. Skyrms. The Dynamics of Rational Deliberation. Cambridge, Mass.: Harvard University Press, 1990. [Skyrms, 1996] B. Skyrms. The Structure of Radical Probabilism. Erkenntnis XLV, pp. 286-297, 1996. Reprinted in Costantini and Galavotti, eds. (1997), pp. 145-157. [Skyrms, 2006] B. Skyrms. Discovering ‘Weight, or the Value of Knowledge’. In [Galavotti, 2006, pp. 55-66]. [Skyrms and Harper, 1988] B. Skyrms and W. L. Harper, eds. Causation, Chance, and Credence, 2 volls., Dordrecht-Boston: Kluwer, 1988. [Suppes, 2006] P. Suppes. Ramsey’s Psychological Theory of Belief. In [Galavotti, 2006, pp. 55-66]. [Suppes, 2009] P. Suppes. Some philosophical reflections on de Finetti’s thought. In [Galavotti, 2009, pp. 19-40]. [Taylor, 2006] G. Taylor. Frank Ramsey — A Biographical Sketch. In [Galavotti, 2006, pp. 1-18].
The Modern Epistemic Interpretations of Probability: Logicism and Subjectivism
203
[Williamson, 2009] J. Williamson. Philosophies of Probability: Objective Bayesianism and its Challenges. In Handbook of the Philosophy of Mathematics, volume IX of the Handbook of the Philosophy of Science, ed. A. Irvine, Amsterdam: Elsevier. 2009. [Zabell, 1982] S. Zabell. W.E. Johnson’s ‘Sufficientness’ Postulate. The Annals of Statistics X, pp. 1091-9. 1982. Reprinted in Zabell [2005], pp. 84-95. [Zabell, 1988] S. Zabell. Symmetry and its Discontents. In [Skyrms and Harper, 1988, vol I, pp. 155-190]. Reprinted in [Zabell, 2005, pp. 3-37]. [Zabell, 1989] S. Zabell. The Rule of Succession. Erkenntnis XXXI, pp. 283-321, 1989. Reprinted in [Zabell, 2005, pp. 38-73]. [Zabell, 1991] S. Zabell. Ramsey, Truth and Probability. Theoria LVII, pp. 210-238, 1991. Reprinted in [Zabell, 2005, pp. 119-141]. [Zabell, 2005] S. Zabell. Symmetry and its Discontents. Cambridge: Cambridge University Press, 2005. [Zabell, 2009] S. Zabell. De Finetti, chance, quantum physics. In [Galavotti, 2009, pp. 59–83].
POPPER AND HYPOTHETICO-DEDUCTIVISM Alan Musgrave Popper famously declared that induction is a myth. This thesis, if true, makes nonsense of the current volume. But is the thesis true? And, before we get to that, what precisely does it mean? Popper is a deductivist. He thinks that whenever we reason, we reason deductively or are best reconstructed as reasoning deductively. Most philosophers disagree. Most philosophers think that most reasoning is non-deductive. To understand why most philosophers think this, we have to look at the functions of reason or argument, and see that deduction seems quite unsuited to serve some of those functions. What are the functions of argument? Why do people reason or argue? One function of reason or argument is to form new beliefs or come up with new hypotheses. Another is to prove or justify or give reasons for the beliefs or hypotheses that we have formed. A third is to explore the consequences of our beliefs or hypotheses in order to try to criticise them. We need a logic of discovery, a logic of justification, and a logic of criticism. It is usually accepted that deductive logic is fine so far as the logic of criticism goes. “Exploring the consequences of our hypotheses” means exploring the deductive consequences of our hypotheses. Criticism proceeds by deducing some conclusion, showing that it is not true (because it does not square with experience, experiment, or something else that we believe), and arguing that some premise must therefore be false as well. Criticism only works if the reasoning is deductively valid, if the conclusion is ‘contained in’ the premises, if the reasoning is not ‘ampliative’. That is what entitles us to say that if the conclusion is false some premise must be false as well. If our argument were ampliative, criticism would not work. Showing that the conclusion is false would not entitle us to say that some premise must be false as well. But deduction’s strength so far as criticism is concerned seems to be a weakness as far as discovery and justification are concerned. In a valid deduction the conclusion is contained in the premises, does not ‘amplify’ them, says nothing new. If we want to come up with new beliefs or hypotheses, deduction obviously cannot help us. And if we want to justify a belief, deduction cannot help us once more. Deducing the belief we want to justify from another stronger belief is bound to be question-begging. The logics of discovery and justification must be non-deductive or ampliative. The conclusions of the arguments involved cannot be contained in the premises, but must ‘amplify’ them and say something new. Or so said the critics of deductive logic, down the ages. Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
206
Alan Musgrave
ENTHYMEMES AND THEIR DEDUCTIVIST RECONSTRUCTIONS Deductivists deem ampliative reasoning invalid. If most reasoning is ampliative, then deductivists deem most reasoning invalid. That unpleasant consequence may seem reason enough to reject deductivism. To be sure, logic has a critical function. The task of the logician is not just to describe or ‘model’ how people do in fact reason, but also to prescribe how people ought to reason if they are to reason well. But if most reasoning is ampliative, then deductivists seem committed to the view that most of the time we do not reason well. Deductivism is a utopian ethic according to which most ordinary logical behaviour is thoroughly immoral! Deductivists have a way of avoiding this unpleasant consequence. Folk seldom spell out all the premises of their arguments. Most reasoning, including most everyday reasoning, is in enthymemes, arguments with unstated or ‘missing’ premises. An argument which is invalid as stated can often be validated by spelling out its missing premise. But deductivists must be careful here. Any invalid argument from premise(s) P to conclusion C can be validated if we count it an enthymeme and add the missing premise “If P then C”. If deductivists are not careful, they will end up saying that people never argue invalidly at all! Deductive logic will be deprived of any critical function! Deductivists will now have an ethical gospel of relaxation — “Reason how you will, I will validate it”. There is an obvious deductivist response to this. Not every argument should be counted as having a missing premise. It must be clear from the context of production of an argument that it has a missing premise and what that is. Suppose somebody argues that it must be raining, because when it rains the streets get wet and the streets are wet. Deductivists do not validate this argument. They say it is a fallacy, an example of the fallacy of affirming the consequent. As to the uncertainty about what the missing premise of an argument is, if there is one, logic teachers have for generations asked students to supply the missing premises of arguments presented to them — and have marked their answers right or wrong. Still, this is not a logical issue, but a pragmatic one (for want of a better word). It must be admitted that in some cases it may not be clear from the context whether there is a missing premise and what it is. But then, if it matters, we can try to find out. Most philosophers do not like deductivist reconstructions. Most philosophers say that ampliative reasoning is not to be validated in this way. Ampliative reasoning is deductively invalid, to be sure, but it is perfectly good reasoning nevertheless. Or at least, some of it is. All of it is invalid, but some of it is cogent and some of it is not. Thus the inductive logicians, or at least the best of them, try to work out when an ampliative argument is cogent and when not. Here they face the same problem as the deductivists. If it is not clear what the missing premise is, that would convert a real-life argument into a valid deduction, then it is equally unclear what the ampliative rule or principle is, that the real-life arguer is supposed to be employing. Unclarity about enthymemes cannot comfort inductivists.
Popper and Hypothetico-deductivism
207
Furthermore, inductivists do not think that all invalid arguments are ampliative. They agree with deductivists that you commit a fallacy if you argue that it must be raining because when it rains the streets get wet and the streets are wet. They do not say that this argument, although it is not deductively valid, is a perfectly cogent argument in some fancy ampliative logic. Yet when it comes to other deductively invalid arguments, this is precisely what inductivists do say. This is puzzling. ‘AUTOMOBILE LOGIC’ To see how puzzling it is, suppose somebody produces the following argument: American cars are better than Japanese cars. Therefore, Cadillacs are better than Japanese cars. What to say of this argument? You might quarrel with the premise, or with the conclusion, or with both. But what to say if you are a logician? Obviously, if you are a deductive logician, you will say that the argument is invalid, that the conclusion does not follow from the premise. That verdict may seem a bit harsh. You might soften it by suspecting an enthymeme. Perhaps our arguer also has an unstated or missing premise to the effect that Cadillacs are American cars. If we spell that premise out, we get the perfectly valid argument: [Cadillacs are American cars.] American cars are better than Japanese cars. Therefore, Cadillacs are better than Japanese cars. Having reconstructed the argument this way, we can turn to the more interesting question of whether the premises and conclusion of the argument are true. This is all trivial, and familiar to countless generations of logic students. [By the way, in my deductivist reconstructions of enthymemes, I place the missing premise in square brackets. As we will see, these missing premises are often general claims. But it is not to be assumed, as it often is, that missing premises are always general. Ordinary reasoners often suppress particular premises as well as general ones. A real-life arguer might well say “We are all mortal — so one day George Bush will die” as well as “George Bush is only human — so one day he will die”. In the former case the suppressed premise (“George Bush is human”) is particular rather than general.] There is another way to soften the harsh verdict that the argument with which we started is invalid. You might say that though the argument is deductively invalid, it is a perfectly cogent argument in a special non-deductive or inductive or ampliative logic that deals with arguments about automobiles. This automobile logic is characterised by special rules of inference such as “Cadillacs are American cars”. (Do not object that “Cadillacs are American cars” is not a rule of inference,
208
Alan Musgrave
but a sociological hypothesis about automobile manufacture. Recast it as a rule of inference: “From a premise of the form ‘x is a Cadillac’, infer a conclusion of the form ‘x is an American car”’.) Unlike the formal or topic-neutral rules of deductive logic, the rules of automobile logic are material or topic-specific. The formally valid arguments of deductive logic are boring, for their conclusions can contain nothing new. But the ‘materially valid’ or cogent arguments of automobile logic are exciting, because their conclusions do contain something new. So, the deductivist ploy regarding an invalid argument she wishes to appropriate is to reconstruct it as an enthymeme and supply its missing premise. And the inductivist ploy regarding a valid argument he wishes to appropriate is to reconstruct it (perhaps ‘deconstruct it’ would be better) as an ampliative argument where some necessary premise becomes a material rule of inference. Both ploys risk being applied trivially, as the example of automobile logic makes clear. Enough! Automobile logic is silly. Has any serious thinker ever gone in for anything like automobile logic? Yes, they have, and on a massive scale. But before I document that claim, a complication needs to be considered. FORMAL AND SEMANTIC VALIDITY Suppose somebody produces the following argument: Herbert is a bachelor. Therefore, he is unhappy. This argument is invalid. As before, deductivists will treat it as an enthymeme with a missing premise to the effect that bachelors are unhappy, to obtain: [Bachelors are unhappy.] Herbert is a bachelor. Therefore, he is unhappy. This argument is valid, and has a dubious hypothesis about the virtues of matrimony as its missing premise. As with automobile logic, we would think ill of a philosopher who said that the original argument, though formally invalid, is a materially valid or cogent argument in matrimonial logic, which has “Bachelors are unhappy” as one of its interesting inference-licenses. But now suppose somebody produces the argument: Herbert is a bachelor. Therefore, he is unmarried. As before, deductivists might insist upon treating this as an enthymeme and supplying its missing premise to obtain: [Bachelors are unmarried.] Herbert is a bachelor. Therefore, Herbert is unmarried.
Popper and Hypothetico-deductivism
209
But it may be objected that the original argument is already valid: it is impossible for its premise to be true and its conclusion false. This is because its missing premise is analytically or necessarily true, true by virtue of the meaning of the word ‘bachelor’. The argument is not formally valid, to be sure, but it is semantically valid. Deductivists, it may be objected, simply overlook the fact that there are semantically valid arguments as well as formally valid ones. This objection raises the vexed question of whether the analytic/synthetic distinction is viable. Are there, as well as logical truths that are true in virtue of their logical form, also analytic truths that are true in virtue of the meanings of their non-logical terms? Quine said NO. He was a logical purist who insisted that so-called ‘semantically valid’ arguments are to be treated as enthymemes and converted into formally valid arguments by spelling out their missing premises. This purist policy can be defended by pointing to the notorious vagueness of the category of analytic truths. There are a few clear cases (like “Bachelors are unmarried”?), many clear non-cases (like “Bachelors are unhappy”?), and an enormous grey area in between. Or so most philosophers think. Quine thinks that the grey area is all encompassing, and that there is no analytic/synthetic distinction to be drawn, not even a vague one. Quine thinks the purist policy is the only game in town. But this purist policy hides a deep problem. What are formal validity and logical truth? They are validity and truth in virtue of logical form. And what is logical form? You get at it by distinguishing logical words from non-logical or descriptive words. But formal validity and logical truth turn on the meanings of the logical words. Formal validity is just a species of semantic validity, and logical truth just a species of analytic truth. (As is well-known, formal validity of the argument “P , therefore C” is tantamount to logical truth of the conditional statement “If P , then C” corresponding to the argument.) The deep problem is that there seems no sharp distinction between logical and descriptive words. There are a few clear cases of logical words, and the deductive logician’s usual list gives them: propositional connectives, quantifiers, the ‘is’ of prediction, and the ‘is’ of identity. There are many clear cases of non-logical or descriptive words. But then there is a grey area, the most notorious examples being comparatives — ‘taller than’, older than’, generally ‘X-er than’. These seem to be mixed cases, with a descriptive component, the X, and a logical component, the ‘-er’. What are we to say of statements expressing the transitivity of comparatives, statements of the form “If a is X-er than b, and b is X-er than c, then a is X-er than c”? Do we count these logical truths, and the arguments corresponding to them formally valid? Or do we count them analytic truths, and the arguments corresponding to them semantically valid? Or do we count them as synthetic truths, and the arguments corresponding to them simply invalid unless we add the transitivity of ‘X-er than’ as a suppressed premise? Quine favours the last option. His famous attack on the analytic/synthetic distinction proceeds for the most part by taking the notion of logical truth for granted. Logical truth is truth by virtue of the meanings of . . . (here comes
210
Alan Musgrave
the usual list of logical words). Quine then contrasts the precision of this notion with the vagueness of the notion of analytic truth. But then, all of a sudden and without much argument, we are told that all truths depend for their truth on the way the world is, that the whole of our knowledge faces the tribunal of experience as a corporate body, and that experience might teach us that even so-called ‘logical truths’ like “Either P or it is not the case that P ” are false. Yet elsewhere Quine tells us that the deviant logician who wants to reject the law of excluded middle has a problem: he wants to convict us of an empirical error, but all he can really do is propose a new concept of negation. In short, the logical truths are restored: the law of excluded middle is true by virtue of the meanings of ‘and’ and ‘not’, and to reject it is to change the meaning of one of these terms. Fortunately, these foundational issues are irrelevant to our main concern, which is with non-deductive or inductive logic. Recognition of the category of analytic but non-logical truth, and of semantic but non-logical validity, does not take us outside the realm of deduction. It does not get us to inductive or ampliative inferences. So let us return to that issue. (Later I will consider the view that inductive arguments are actually semantically valid deductive arguments.) HISTORICAL INTERLUDE: MILL VERSUS ARISTOTLE As we all know, deductive logic was founded by Aristotle, who worked out the logic of categorical propositions and the theory of the syllogism. Aristotle found out that there were 256 possible syllogisms that folk might use, and determined that only 24 of these were valid. So precious were these valid syllogisms that each was given its own name, and in the 13th century Pope John XXI put all their names into a rhyme. Thereafter every educated person had to learn the rhyme and remember the valid syllogisms so that they might reason well. Yet down the centuries people often complained that Aristotle’s logic was trivial, uninformative or useless — precisely because valid syllogisms were not ampliative. The grumbling grew loud during the Scientific Revolution. The philosophers of the Scientific Revolution wanted to put Aristotle behind them. They dreamt of inferences that would not be trivial or uninformative or useless, ampliative inferences that would lead to something new. And they dreamt of a logic or method that might systematise such inferences, and tell us which of them were good ones, just as Aristotle had told us which syllogisms were good ones. Bacon and Descartes and Locke are but three examples of philosophers who criticised Aristotle’s logic in this way. In the nineteenth century, John Stuart Mill was another. Aristotle’s critics had a point. Aristotelian logic is trivial in the sense that it deals only with relatively trivial arguments. There are lots of valid deductive arguments that Aristotelian logic cannot deal with. Perhaps the most famous everyday example is “All horses are animals. Therefore, all heads of horses are heads of animals”. More important, it is hopeless to try to capture mathematical reasoning in syllogistic form. But beginning in about 1850 with Boole, deductive logic pro-
Popper and Hypothetico-deductivism
211
gressed way beyond Aristotelian logic. There is an irony of history here. While deductive logic was poor, one could forgive people for thinking that it needed supplementing with some non-deductive logic. But after deductive logic became rich, as it has in the last 150 years, one might suppose that anti-deductivist tendencies might have withered away. But nothing could be further from the truth. Which suggests that the belief in non-deductive or ampliative inference stems from a deeper source than Aristotle’s inability to deal adequately with “All horses are animals. Therefore, all heads of horses are heads of animals”. It does indeed stem from a deeper source - it stems from the idea that the logics of discovery and justification must be ampliative. Mill argued (syllogistically, by the way!) that all genuine inferences lead to conclusions that are new, while syllogisms lead to nothing new, so that syllogisms are not genuine inferences at all. All genuine inferences, according to Mill, are inductive or ampliative inferences: All inference is from particulars to particulars. General propositions are merely registers of such inferences already made and short formulae for making more. The major premise of a syllogism, consequently, is a formula of this description; the conclusion is not an inference drawn from the formula, but an inference drawn according to the formula; the real logical antecedent or premise being the particular facts from which the general proposition was collected by induction. [Mill, 1843: II, iii, 4] What did Mill mean by induction? He meant arguments from experience. He was an empiricist, who thought that knowledge came from experience. But knowledge transcends experience. So we need ampliative or inductive reasoning to get us from premises that experience supplies, to conclusions that transcend those premises. The paradigmatic kind of inductive reasoning is inductive generalisation. Here is an example: All observed emeralds were green. Therefore, all emeralds are green. This argument is invalid. But if that verdict seems harsh, deductivists might soften it by reflecting that people seldom state all of their premises. Perhaps this argument has a missing premise, to the effect that unobserved cases resemble observed cases. If we spell that premise out, we get the perfectly valid argument: [Unobserved cases resemble observed cases.] All observed emeralds were green. Therefore, all emeralds are green. (By the way, if we change the conclusion from a generalisation about all emeralds to a prediction about the next case, we get so-called singular predictive inductive inference. That is what Mill had in mind when he said “All inference is from particulars to particulars”.)
212
Alan Musgrave
I said earlier that automobile logic is silly. But nearly everybody thinks that inductive logic is not silly. Why? The situation with the emeralds argument is symmetrical with the situation with the argument about Cadillacs. Yet most philosophers insist that they be treated differently. Most philosophers would not touch automobile logic with a barge-pole, yet insist that inductive logic must exist. Why? What is the difference between the two cases? One obvious difference is that the missing premise of the Cadillac argument is true, while the missing premise of the emeralds argument is false. Another difference, connected with the first, is that “Unobserved cases resemble observed cases” is a much more general hypothesis than “Cadillacs are American cars”. So what? It will hardly do to say that arguments with true missing premises are to be reconstructed as deductive, while arguments with false missing premises are not. Where, in the continuum of generality, will we draw a line below which we have empirical hypotheses, and above which we have material rules of inductive logic? And what do we gain by disguising a bit of false and human chauvinistic metaphysics like “Unobserved cases [always] resemble observed cases” as a principle of some fancy ampliative inductive logic? We gain nothing, of course, and inductive logicians are smart enough to realise this. So they warm to the task of getting more plausible inductive rules than “Unobserved cases resemble observed cases”. With characteristic clarity, Mill put his finger precisely on one big problem they face. Mill said that in some cases a single observation is “sufficient for a complete induction” (as he put it), while in other cases a great many observations are not sufficient. Why? Mill wrote: Whoever can answer this question knows more of the philosophy of logic than the wisest of the ancients, and has solved the problem of induction. [1843: III, iii, 3] The answer to Mill’s question is obvious. In the first kind of case, we are assuming that what goes for one instance goes for all, whereas in the second kind of case we are not. But philosophers do not like this obvious answer. Peter Achinstein discusses Mill’s own example as follows: . . . we may need only one observed instance of a chemical fact about a substance to validly generalise to all instances of that substance, whereas many observed instances of black crows are required to [validly?] generalise about all crows. Presumably this is due to the empirical fact that instances of chemical properties of substances tend to be uniform, whereas bird coloration, even in the same species, tend[s] not to be. [Achinstein, 2009: 8] Quite so. And if we write these empirical facts (or rather, empirical assumptions or hypotheses) as explicit premises, then our arguments become deductions. In the first case we have a valid deduction from premises we think true, in the second case we have a valid deduction from premises one of which we think false (namely,
Popper and Hypothetico-deductivism
213
that what goes for the colour of one or many birds of a kind goes for all of them). Mill’s question is answered, and his so-called ‘problem of induction’ is solved. Other inductive logicians say that an inductive generalisation will only be ‘cogent’ if the observed cases are typical cases, or only if the observed cases are a representative sample of all the cases. But reflection on what ‘typical’ or ‘representative’ mean just yields a more plausible deductivist reconstruction of the emeralds argument. To say that the emeralds we have observed are ‘typical’ or ‘representative’ is just to say that their shared features are common to all emeralds. Spelling that out yields: [Observed emeralds are typical or representative emeralds: their shared features are common to all emeralds.] All observed emeralds were green. Therefore, all emeralds are green. Or perhaps the generalizer about emeralds had something even more restricted in mind. Perhaps the hidden assumption was that emeralds form a ‘natural kind’, and that colour is one of the essential or ‘defining’ features of things of that kind. Spelling that out yields: [All emeralds have the same colour.] All observed emeralds were green. Therefore, all emeralds are green. Of course, the second premise of this argument is now much stronger than it need be. Once we have assumed that all emeralds have the same colour, we need observe only one emerald and then can argue thus: [All emeralds have the same colour.] This emerald is green. Therefore, all emeralds are green. This is an example of what old logic books called demonstrative induction. It is not induction, but (valid) deduction. Aristotle called it epagoge. He insisted that one observed case is enough for you to “intuit the general principle” provided that the observation yields the essence of the thing [Prior Analytics, 67a22]. Aristotle also described this as “a valid syllogism which springs out of induction [that is, observation]” [Prior Analytics, 68b15]. There is also what people call ‘perfect’ or ‘complete’ or ‘enumerative’ induction, where it is tacitly assumed that we have observed all the instances. Again, this is not induction but deduction. An example is: [The observed emeralds are all the emeralds.] All the observed emeralds were green. Therefore, all emeralds are green. Popper mistakenly takes Aristotle’s epagoge to be a complete induction [1963: 12, footnote 7].
214
Alan Musgrave
Finally, there is so-called ‘eliminative induction’. Once again, this is not induction, but just a special kind of deduction. An example is: [Either the wife or the mistress or the butler committed the murder.] The wife did not do it. The mistress did not do it. Therefore, the butler did it. As the example indicates, this is a typical pattern of argument from detective stories. Sherlock Holmes argues that way all the time. So does your car mechanic to find out what is wrong with your car. So does your doctor to find out what is wrong with you. It is also the ‘form of induction’ advocated by Frances Bacon, High Priest of the experimental method of science, to find out the causes of things. Of course, an eliminative induction, though perfectly valid, is only as good as its major and often suppressed premise. You may get a false conclusion, if that premise does not enumerate all the ‘suspects’. As you can see, it is child’s play, philosophically speaking, to reconstruct patterns of so-called ‘inductive reasoning’ as valid deductions. We have done it with inductive generalisation, singular predictive inference, enumerative induction, demonstrative induction, and eliminative induction. (I did not even mention mathematical induction — everybody agrees that that is deduction.) As I shall show later, one can also do it with abduction, and with its intellectual descendant, inference to the best explanation. Yet most philosophers do not like these deductivist reconstructions of so-called inductive or ampliative arguments. They prefer to say, with Mill, that whether an ‘inductive generalisation’ is valid or cogent depends on the way the world is. This means that inductive logic, which sorts out which inductive arguments are cogent and which not, becomes an empirical science. Deductive logic is not empirical. Empirical inquiry can tell you that the conclusion of a valid argument is false (and hence the premises as well). Empirical inquiry can tell you that the premises of a valid argument are false (though not necessarily, of course, its conclusion as well). But neither finding out that the premises are false nor finding out that the conclusion is false, shows that the argument is invalid. Empirical research can produce ‘premise defeaters’ and/or ‘conclusion defeaters’, but it cannot produce ‘argument defeaters’. However, when it comes to inductive arguments, empirical research can provide ‘argument defeaters’ as well. An argument defeater casts doubt on the cogency of the argument, without necessarily impugning the truth of either the premises or the conclusion. The error of ‘psychologism’ was to suppose that logic describes how people think and is a part of empirical psychology. Despite its invocation of ‘cogency’ to parallel the notion of validity, inductive logic is also descriptive, part of empirical science in general, since whether an inductive argument is cogent depends on the way the world is. Deductive validity is an all-or-nothing business, it does not come in degrees. If you have a valid argument, you cannot make it more valid by adding premises. Inductive logic is different. Inductive cogency does come in degrees. You can make
Popper and Hypothetico-deductivism
215
an inductive argument more cogent or less cogent by adding premises. If I have observed ten green emeralds, I can pretty cogently conclude that all emeralds are green. But my inference will be more cogent if I add a further premise about having observed ten more green emeralds. However, if I add instead the premise that I observed my ten green emeralds in the collections of a friend of mine who has a fetish about collecting green things (green bottles, green postage stamps, green gemstones, and so forth), then my argument becomes less cogent or perhaps not cogent at all. Deductive logic, as well as being non-empirical, is monotonic. You cannot make a valid argument invalid by adding a premise. (Here I ignore the relevance logicians, who think that any valid argument can be invalidated by adding the negation of a premise as another premise.) Inductive logic is non-monotonic. An ‘inductively valid’ or cogent argument can be made invalid or non-cogent by adding a premise. Deductivists prefer to keep logic and empirical science separate. They stick to deductive logic, monotonous and monotonic though it may be. Are they just stickin-the-mud - or worse, closet logical positivists? Why not go in for inductive logic, which is exciting rather than monotonous and monotonic? Yet as we have seen, it is child’s play to do without induction, and to reconstruct so-called inductive arguments as hypothetico-deductive ones. So why is everybody except Karl Popper reluctant to do that? Why does everybody believe in induction, and in ampliative reasoning? The answer lies in the fact that our deductivist reconstructions of so-called inductive or ampliative arguments turn them into hypothetico-deductive arguments, whose missing premises are hypotheses of one kind or another — Cadillacs are American cars, Bachelors are unhappy, Unobserved cases resemble observed cases, All emeralds have the same colour, Instances of chemical properties of substances are uniform, Observed Xs are typical or representative Xs, and so forth. Hypothetico-deductive reasoning is no use to us if we want to justify the conclusions we reach. (It is perfectly good, however, if we want just to arrive at interesting new hypothetical conclusions, if we want a logic of discovery. I shall return to this.) If our interest is justification, then why not render hypotheses invisible by resisting deductivist reconstructions? But this is just to hide the problem of induction, not to solve it. WITTGENSTEINIAN INSTRUMENTALISM Mill inaugurated the view that general hypotheses are not premises of our arguments, but rules by which we infer particulars from particulars. The logical positivists said the same thing. They read in Wittgenstein’s Tractatus (1921): Suppose I am given all elementary propositions: then I can simply ask what propositions I can construct out of them. And then I have all propositions, and that fixes their limits. (4.51)
216
Alan Musgrave
A proposition is a truth-function of elementary propositions. (5) All propositions are the results of truth-operations on elementary propositions. (5.3) All truth-functions are the results of successive applications to elementary propositions of a finite number of truth-operations. (5.12) Schlick took the ‘elementary propositions’ to be particular observation statements. As usual, there is some dispute whether this reading was correct. But given that reading, general propositions are not genuine propositions at all. They are not (finite) truth-functions of particular observation statements, and so are not verifiable by observation. Given the verifiability theory of meaning, general propositions are meaningless. Thus Schlick on the general laws of science: It has often been remarked that, strictly, we can never speak of the absolute verification of a law. . . the above-mentioned fact means that a natural law, in principle, does not have the logical character of a statement, but is, rather, a prescription for the formation of statements. The problem of induction consists in asking for a logical justification of universal statements about reality . . . We recognise, with Hume, that there is no logical justification: there can be none, simply because they are not genuine statements. (Schlick, as translated by Popper 1959: 37, note 7) And Ramsay: Variable hypotheticals are not judgements but rules for judging. . . . when we assert a causal law we are asserting not a fact, nor an infinite conjunction, nor a connection of universals, but a variable hypothetical which is not strictly a proposition at all but a formula from which we derive propositions. [Ramsey 1931: 241, 251] If general statements, including general principles and (putative) laws of science, are not true or false propositions, what are they? What exactly are “prescriptions for the formation of statements” (Schlick) or “rules for judging” (Ramsey)? Wittgenstein’s Tractatus did not help much, with its vaguely Kantian suggestions: Newtonian mechanics, for example, imposes a unified form on the description of the world. . . . Mechanics determines one form of description of the world . . . (6.341) . . . the possibility of describing the world by means of Newtonian mechanics tells us nothing about the world: but what it does tell us something about is the precise way in which it is possible to describe it by those means. (6.342) The whole modern conception of the world is founded on the illusion that the so-called laws of nature are the explanations of natural phenomena. (6.371)
Popper and Hypothetico-deductivism
217
W. H. Watson, a physicist who sat at the master’s feet, put it thus: It should be clear that the laws of mechanics are the laws of our method of representing mechanical phenomena, and since we actually choose a method of representation when we describe the world, it cannot be that the laws of mechanics say anything about the world. [Watson 1938: 52]; this is parroted by [Hanson, 1969: 325] Toulmin and Hanson attempt to clarify this Kantian view by saying that theories are like maps, and that general laws are like the conventions of map-making or ‘laws of projection’: Our rules of projection control what lines it is permissible to draw on the [map]. Our rules of mechanics control what formulae it is permissible to construct as representing phenomena . . . Perhaps what we have called “the laws of nature” are only the laws of our method of representing nature. Perhaps laws show nothing about the natural world. But it does show something about the world that we have found by experience how accurate pictures of the world . . . can be made with the methods we have learned to use. [Hanson, 1969: 325-6]; this parrots [Toulmin, 1953: 108-9] As well as flirting with these vaguely Kantian suggestions, the Wittgensteinians revert (without knowing it) to Mill’s idea that general statements, including the general principles or laws of science, are ‘material’ rules of non-deductive inference. Ryle insisted that “the most ‘meaty’ and determinate hypothetical statements” like “If today is Monday, then tomorrow is Tuesday” or “Ravens are black” are not premises of arguments but material rules of inference or ‘inference-licences’ [Ryle, 1950, 328]. Harre said the same of empirical generalisations: The natural process of prediction of an instance is to state the instance as a consequence of another instance, for example, that a creature is herbivorous follows from the fact that it’s a rabbit. The justification of this move . . . takes us back to the generalization or its corresponding conditional . . . These are not premises since they validate but do not belong in the argument that expresses the deduction. It is natural to call them the rules of the deduction. We infer a particular not from a generalization but in accordance with it. [Harre, 1960: 79-80] Toulmin and Hanson say the same of the hypotheses or principles or (putative) laws of science: . . . the role of deduction in physics is not to take us from the more abstract levels of theory to the more concrete . . . Where we make strict, rule-guided inferences in physics is in working out, for instance, where a planet will be next week from a knowledge of its present position, velocity, and so on: this inference is not deduced from the laws of
218
Alan Musgrave
motion, but drawn in accordance with them, that is, as an application of them. [Toulmin, 1953, 84-5]; see also [Hanson, 1969, 337-8] The idea that universal statements about reality are not genuine statements at all enabled Schlick to solve, or rather sidestep, the problem of induction. Watson agreed: It seems that the expression ‘the correct law of nature’ is not a proper grammatical expression because, not knowing how to establish the truth of a statement employing this form of speech, we have not given it a meaning. [Watson, 1938, 51]; see also [Hanson, 1969, 324] But this just hides the problem — it does not solve it. If observation cannot establish the truth of a universal statement, then neither can observation establish the soundness of a material rule of inference. Humean sceptical questions about the certainty of general hypotheses or the reliability of predictions drawn from them, can simply be rephrased as sceptical questions about the usefulness of inferencelicenses or the reliability of predictions drawn according to them. If answering Hume was the aim, it has not been achieved. The other arguments or motivations for the inference-licence view are equally broken-backed (as I show in my [1980]). I said earlier that automobile logic is silly, and asked whether any serious philosopher has gone in for anything like it. Well, as we have seen, Wittgensteinian and his followers went in for it, on a massive scale. (I leave the reader to judge whether they count as serious philosophers.) I also said earlier that the widespread belief in non-deductive logic stems from the view that that deductive logic is useless either as a logic of discovery or as a logic of justification. Well, let us see.
‘LOGIC OF DISCOVERY’ — DEDUCTIVE OR INDUCTIVE? The distinction between the contexts of discovery and justification is due to the logical positivists and Popper. They were sceptical about there being any logic of discovery. They regarded the ‘context of discovery’ as belonging to the province of psychology rather than logic. Popper famously declared (1959: 31): “The initial stage, the act of conceiving or inventing a theory, seems to me to neither call for logical analysis nor to be susceptible of it”. That this statement occurs in a book called The Logic of Scientific Discovery has astonished many readers. The oddity can be partially relieved. ‘Discover’ is a success-word. One cannot discover that the moon is made of green cheese, because it isn’t. To discover that p one must come up with the hypothesis that p, or guess that p, and then show that p is true. It is consistent to maintain that the initial ‘guessing’ stage is not susceptible of logical analysis, and that logic only plays a role in the second stage, where we show that p is true, prove it or justify it. This only partially relieves the oddity of Popper’s claim, because he famously claims that there is no proving or justifying
Popper and Hypothetico-deductivism
219
our hypotheses either. All he gives us in The Logic of Scientific Discovery is a logical analysis of the process of empirical testing. Because ‘discover’ is a success word, it is odd to speak of discovering a false hypothesis. It would be better to speak, not of the context of discovery, but of the context of invention. Then we can separate the question of inventing a hypothesis from the question of justifying it. But ‘context of justification’ is not a happy phrase either, at least in Popper’s case. He thinks that while scientists can rationally evaluate or appraise hypotheses, they can never justify or prove them. So as not to beg the question against that view, it would be better to speak of the context of appraisal. These terminological suggestions are due to Robert McLaughlin [1982, p. 71]. Were the positivists and Popper right that there is no logic of invention, no logical analysis of the initial stage of inventing a hypothesis? No. People do not typically invent hypotheses at random or through flashes of mystical intuition or in their dreams. People typically invent new hypotheses by reason or argument. But, the pervasive thought is, these reasonings or arguments cannot be deductive, for the conclusion of a valid deduction contains nothing new. Hence we need an inductive or ampliative logic of invention (discovery). But that, too, is wrong. We already saw that it is child’s play to reconstruct inductive generalisation, singular predictive inference, enumerative induction, demonstrative induction, and eliminative induction as valid deductions. And when we did that, we did not say whether we were reconstructing ‘discovery arguments’ or ‘justification arguments’. Let us suppose the former, and revisit one trivial example. Suppose you want to know what colour emeralds are. Do you lie on your couch, close your eyes, and somehow dream up conjectures that you will then subject to test? No. You observe an emerald and perform a trivial ‘demonstrative induction’ (deduction): [Emeralds share a colour.] This emerald is green. Therefore, all emeralds are green. Your major premise, perhaps left un-stated, is a presupposition of the question “What colour are emeralds?”. Here is another trivial example of the same thing. Suppose you want to know what the relationship is between two measurable quantities P and Q. You have the hunch that it might be linear, or decide to try a linear relationship first. Do you lie on your couch, think up some linear equations between P and Q (there are infinitely many of them!), and then put them to the test? No. You make a couple of measurements and perform a trivial deduction: [P = aQ + b, for some a and b.] When Q = 0, P = 3, so that b = 3. When Q = 1, P = 10, so that a = 7. Therefore, P = 7Q + 3.
220
Alan Musgrave
This is called ‘curve-fitting’. It is supposed to be induction. But of course, it is really deduction. These are trivial examples of the ‘logic of invention (discovery)’. Other examples are less trivial. Newton spoke of arriving at scientific theories by deduction from the phenomena. Newton was right to speak of deduction here, not of induction, abduction, or anything like that. He was wrong to speak of deduction from phenomena alone. The premises of his arguments do not just contain statements of the observed phenomena. They also contain general metaphysical principles, heuristic principles, hunches. Newton first called them ‘Hypotheses’. Then, anxious to make it seem that there was nothing hypothetical in his work, he rechristened them ‘Rules of Reasoning in Philosophy’. (As we can see, disguising hypothetical premises as rules of ampliative reasoning, as in automobile logic, has a fine pedigree - it goes all the way back to Newton!) Newton had four ‘Rules of Reasoning’: RULE I We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. To this purpose the philosophers say that Nature does nothing in vain, and more is vain, when less will serve; for Nature is pleased with simplicity, and affects not the pomp of superfluous causes. RULE II Therefore to the same natural effects we must, as far as possible, assign the same causes. RULE III The qualities of bodies, which neither admit intensification nor remission of degrees, and which are found to belong to all bodies within the reach of our experiments, are to be esteemed the universal qualities of all bodies whatsoever. RULE IV In experimental philosophy we are to look upon propositions inferred by general induction from phenomena as accurately or very nearly true, notwithstanding any contrary hypotheses that may be imagined, till such time as other phenomena occur, by which they may be made either more accurate, or liable to exceptions. This rule we must follow, that the argument of induction may not be evaded by hypotheses. (Principia, Book III; Newton 1934, 398-400.) Rule III enables Newton to arrive at the law of universal gravitation: Lastly, if it universally appears, by experiments and astronomical observations, that all bodies about the earth gravitate towards the earth,
Popper and Hypothetico-deductivism
221
and that in proportion to the quantity of matter that they severally contain; that the moon likewise, according to the quantity of its matter, gravitates toward the earth; that, on the other hand, our sea gravitates toward the moon; and all the planets one toward another; and the comets in like manner toward the sun: we must, in consequence of this rule, universally allow that all bodies whatsoever are endowed with a principle of mutual gravitation . . . [Newton, 1934, 399] Here Newton lists the ‘phenomena’ that experiment and astronomical observation have revealed to him. (These ‘phenomena’ are highly theory-laden, of course, but that is not the issue here.) He then applies Rule III “The qualities . . . which are found to belong to all bodies within the reach of our experiments, are . . . the universal qualities of all bodies whatsoever” and deduces “all bodies whatsoever are endowed with a principle of mutual gravitation”. So, Newton did deduce things, but not just from ‘phenomena’, also from general metaphysical principles disguised as ‘Rules of Reasoning’. Or so it seems — as we will see shortly, careful reading reveals another interpretation, in which the ‘Rules of Reasoning’ are not metaphysical principles at all but rather epistemic principles. Here what matters is that once we spell out Newton’s so-called ‘Rules of Reasoning’ as explicit premises, whether metaphysical or epistemic, his arguments all become deductive. Newtonian deduction from the phenomena is ubiquitous in science. There is now quite a body of literature analysing real episodes from the history of science and demonstrating the fact. Examples include Cavendish’s deduction of the electrostatic inverse square law (see [Dorling, 1973a; 1973b]), Einstein’s deduction of the photon hypothesis (see [Dorling, 1971]), Rutherford’s deduction of the Rutherford model of the atom (see [McLaughlin, 1982; Musgrave, 1989]), and Einstein’s deductions of the special and general theories of relativity (see [Zahar, 1973; 1983]). Sometimes the major and often ‘missing’ premises of these deductions are general metaphysical principles like Newton’s. Sometimes they are more specific hypotheses that make up the ‘hard core’ of a particular scientific research programme. Imre Lakatos and his followers have produced many case-studies of the latter kind (see the papers collected in [Howson, 1973; Latsis, 1976]). What all the examples show is that there is a logic of discovery, despite positivist/Popperian orthodoxy, and that it is deductive logic, despite philosophic orthodoxy in general. What of the argument that logic of discovery must be non-deductive or ampliative because discovery is by definition coming up with something new and the conclusion of a valid deduction contains nothing new? Here we must distinguish logical novelty from psychological novelty. True, the conclusion of a valid deduction is not ‘logically new’, which is just a fancy way of saying that it is logically contained in the premises. But the conclusion of a valid deduction can be psychologically new. We can be surprised to discover the consequences of our assumptions. When Wittgenstein said that in logic there are no surprises, he was just wrong: Hobbes was astonished that Pythagoras’s Theorem could be deduced from Euclid’s axioms. Moreover, the conclusion of a valid deduction can have
222
Alan Musgrave
interesting new properties not possessed by any of the premises taken singly — being empirically falsifiable, for example. Nor, finally, do deductivist reconstructions of inventive arguments take all the inventiveness out of them and render hypothesis-generation a matter of dull routine. The originality or inventiveness lies in assembling a new combination of premises that may yield a surprising conclusion. It also lies in obtaining that conclusion, which in interesting cases is no trivial or routine task. The positivists and Popper were wrong. There is a logic of invention (discovery). And it is deductive logic, or is best reconstructed as such. It will be objected that hypothetico-deductive inventive arguments are only as good as their premises. And further, that the ‘missing’ premises of inventive arguments, the general metaphysical or heuritstic principles that lie behind them, are very often false. It is not true that unobserved cases always resemble observed cases, that the relationship between two measurable quantities is always linear, that Nature is simple, that like causes have like effects, and like effects like causes, and so forth. Moreover, the objector might continue, scientists know this. Is it plausible to think that scientists argue from premises that they know to be false? There is no evading this objection by viewing the arguments as non-deductive arguments which proceed according to rules of inductive reasoning. It is equally implausible to think that scientists reason according to rules that they know to be unsound. Of course, that a hypothetico-deductive inventive argument contains a premise that is false, or at least not known to be true, is fatal to the idea that the argument proves or establishes its conclusion. But we should not mix up discovery and proof, or the logic of invention and the logic of justification. There is nothing wrong with getting new hypotheses from general heuristic principles that are not known to be true. There is not even anything wrong with getting new hypotheses from general principles that are known to be false, though they have some true cases. This may fail to convince. If so, we can if we wish recast hypothetico-deductive inventive arguments so that they become arguments that are not only sound but known to be so (at least so far as their general heuristic principles are concerned). To see how, let us return to Newton. So far we have had Newton deducing things, not just from ‘phenomena’, but also from general metaphysical principles disguised as ‘Rules of Reasoning’. But careful reading reveals another interpretation, in which the ‘Rules of Reasoning’ are not metaphysical principles at all but rather epistemic principles, about what we ought to admit, assign, esteem, or look upon to be the case. (It was, so far as I know, John Fox who first drew attention to this reading in his 1999.) On this reading, Newton does not deduce the law of universal gravitation (G) — what he deduces is that we must “allow that” or “esteem that” or “look upon it that” G is the case. In short, Newton’s conclusion is that it is reasonable for us to conjecture that G. And his ‘Rules of Reasoning’ are general epistemic or heuristic principles like “It is reasonable for us to conjecture that the qualities . . . which are found to belong to all bodies within the reach of our experiments, are . . . the universal qualities of all bodies whatsoever”. This
Popper and Hypothetico-deductivism
223
epistemic principle is, I submit, true and known to be true. Its truth is not impugned by the fact that a conjecture reached by employing it might subsequently get refuted. If we reasonably conjecture something and later find it to be false, we find out that our conjecture is wrong, not that we were wrong to have conjectured it. This epistemic interpretation makes sense of Newton’s Rule IV, in which Newton admits that any conclusion licensed by or reached from his first three Rules might be refuted. The purpose of Rule IV is to deny that sceptical proliferation of alternative hypotheses counts as genuine criticism (“This rule we must follow, that the argument of induction may not be evaded by hypotheses”). This is not trivial. It is important to see that the sceptical proliferation of alternative hypotheses is no criticism of any hypothesis we might have. It is only an excellent criticism of the claim that the hypothesis we have is proved or established by the data which led us to it. Return to so-called ‘inductive generalisation’, for example: All observed emeralds were green. Therefore, all emeralds are green. We validated this argument by spelling out its missing premise, to obtain: [Unobserved cases resemble observed cases.] All observed emeralds were green. Therefore, all emeralds are green. We then objected that this missing premise is a piece of false and human chauvinistic metaphysics, and that nothing is gained by replacing obvious invalidity by equally obvious unsoundness. But here is a better deductivist reconstruction of the argument: [It is reasonable to conjecture that unobserved cases resemble observed cases.] All observed emeralds were green. Therefore, it is reasonable to conjecture that all emeralds are green. This argument does not, of course, establish that emeralds are all green. But our interest here is invention (discovery), not proof. Or consider analogical reasoning, another alleged pattern of inductive inference which in its simplest form goes like this: a and b share property P . a also has property Q. Therefore, b also has property Q. We might validate this by adding an obviously false metaphysical missing premise: [If a and b share property P , and a also has property Q, then b also has property Q.] a and b share property P .
224
Alan Musgrave
a also has property Q. Therefore, b also has property Q. But if our interest is invention rather than proof, we can reconstruct analogical reasoning as a sound deductive argument: [If a and b share property P , and a also has property Q, then it is reasonable to conjecture that b also has property Q.] a and b share property P . a also has property Q. Therefore, it is reasonable to conjecture that b also has property Q. I submit that the missing premises of these deductivist reconstructions, premises about what it is reasonable for us to conjecture, are true and known to be true. That might be disputed. These missing premises simply acknowledge that scientists are inveterate ‘generalisers from experience’. The same applies to ordinary folk in the common affairs of life. Small children who have once burned themselves on a hot radiator do not repeat the experiment — they jump to conclusions and avoid touching the radiator again. The same applies to animals, as well. Popper tells the nice story of the anti-smoking puppy who had a lighted cigarette held under his nose. He did not like it, and after that one nasty experience he always ran away sneezing from anything that looked remotely like a cigarette. It seems that ‘jumping to conclusions’ is ‘hard-wired’ into us, part of the hypothesis-generating ‘software’ that Mother Nature (a.k.a. Natural Selection) has provided us with. Sometimes the ‘hard-wiring’ runs deep, being built into the perceptual system, and may be quite specific. A famous example concerns the visual system of the frog, which contains special mechanisms for detecting flies. A fly gets too close to a frog and triggers the mechanism, whereupon the frog catches and eats the fly. The frog’s eye is specially designed (by Natural Selection) for detecting flies. Similar discoveries have been made about the eyes of monkeys. Monkey eyes have special cells or visual pathways that are triggered by monkey hands. Of course, all this is unconscious. But if we adopt Dennett’s ‘intentional stance’ and attribute ‘as if’ beliefs to frogs and monkeys, we can see that they form beliefs on the basis of visual stimuli combined with general principles that are hard-wired into their visual systems. The beliefs that are formed in this way may be false. Experimenters can fool the frog into trying to eat a small metal object introduced into the visual field and moved around jerkily as a fly might move. Experimenters can fool baby monkeys into reaching out for a cardboard cut-out shaped roughly like a monkey-hand. But is it reasonable for us to proceed in this way? Philosophers have long been aware of inbuilt generalising tendencies. Frances Bacon said: “The human understanding is of its own nature prone to suppose the existence of more order and regularity in the world than it finds” (Novum Organum, Book I, Aphorism xlv). Bacon deplored this and tried to get rid of it. Hume deplored it, thought it could not be got rid of, and deemed us all irrational. But in deploring our generalising tendencies, Bacon and Hume mixed up invention (discovery) and proof (appraisal).
Popper and Hypothetico-deductivism
225
Hypotheses arrived at by ‘jumping to conclusions’ are not thereby shown to be true. But we need to navigate ourselves around the world, and forming beliefs in this way is a perfectly reasonable way to begin. Does this mean that positivist-Popperian orthodoxy was basically correct? Is the context of invention (wrongly ascribed to the province of psychology and deemed incapable of logical analysis, but no matter) irrelevant to the context of appraisal? No. To describe a conjecture as a reasonable conjecture is to make a minimal appraisal of it. And such minimal appraisals are important. Philosophers tediously and correctly point out that infinitely many possible hypotheses are consistent with any finite body of data. Infinitely many curves can be drawn through any finite number of data-points. We observe nothing but green emeralds and hypothesize that all emeralds are green — why not hypothesise that they are grue or grack or grurple? Scientists, not to mention plain folk, are unimpressed with the philosopher’s point, and do not even consider the gruesome hypotheses produced in support of it. What enables them to narrow their intellectual horizons in this way, and it is reasonable for them to do so? What enables them to do it are epistemic principles about reasonable conjecturing, which are, so far as we know, true. These principles are not necessarily true. We can imagine possible worlds in which they would deliver more false hypotheses than true ones, and thus be unreliable. But just as it may be reasonable to persist in a false belief until it is shown to be false, so also it may be reasonable to persist in an unreliable belief-producing mechanism until it is shown to be unreliable. And nobody has shown that the belief-producing mechanisms I have been discussing are unreliable. So much for the context of invention (discovery). I have resisted the idea that the logic of invention (discovery) must be inductive or ampliative. But what about the context of justification? Surely justification requires ampliative reasoning. Which brings me to the last, and most important, reason for the widespread belief in inductive logic. ‘LOGIC OF JUSTIFICATION’ — DEDUCTIVE OR INDUCTIVE? People reason or argue not just to arrive at new beliefs, or to invent new hypotheses. People also argue for what they believe, reason in order to give reasons for what they believe. In short, people reason or argue to show that they know stuff. Knowledge is not the same as belief, not even the same as true belief — knowledge is justified true belief. People reason or argue to justify their beliefs. Seen from this perspective, deductive arguments are sadly lacking. To be sure, in a valid deductive argument the premises are a conclusive reason for the conclusion — if the premises are true, the conclusion must be true as well. But if we want to justify a belief, producing a valid deductive argument for that belief is always question-begging. The argument “C, therefore C” is as rigorously valid as an argument can be. But it is circular, and obviously question-begging. Non-circular valid deductive arguments for C simply beg the question in a less obvious way. Moreover, the premises of a non-circular valid argument for C might be false,
226
Alan Musgrave
even if C is true. Such is the case with some of our deductivist reconstructions of arguments from experience. An inveterate generaliser observes a few ravens, notices that they are all black, and jumps to the conclusion that all ravens are black. His argument is invalid. But it does not help to validate it by adding the false inductive principle that unobserved cases always resemble observed cases. He might have tacitly assumed that to arrive at his new belief — remember, he is an inveterate generaliser. But if our interest is justification, there is no point replacing obvious invalidity by equally obvious unsoundness. And even where it is not obvious that the heuristic principle is false, spelling it out in a deductivist reconstruction does no good if we want justification. It does no good because it is not known to be true. We want a reason for our conclusion C, and produce a valid deductive argument “P , therefore C” to obtain one. But now we need a reason for the stronger claim P which logically contains C. And so on, ad infinitum, as sceptics tirelessly and rightly point out. But what if we start from premises for which no further reason is required, premises whose truth we know directly from observation or experience? And what if there are inductive or ampliative arguments from our observational premises to our conclusions? These arguments are not deductively valid, to be sure. Induction is not deduction. But inductive arguments might be cogent, they might give us good though defeasible reasons for their conclusions. We do not need ampliative arguments in the logic of criticism. We might not even need them in the logic of invention (discovery). But we surely do need them in the logic of justification. Or so everybody except Popper and me thinks. We are here confronted with the problem of induction. I think Popper has solved this problem. Let me briefly explain how. (What follows is a controversial reading of Popper, which is rejected by many self-styled ‘Popperians’. For more details, see my [2004] and [2007].) The key to Popper’s solution is to reject justificationism. What is that? As everybody knows, the term ‘belief’ is ambiguous between the content of a belief, what is believed, the proposition or hypothesis in question, and the act of believing that content. I shall call a belief-content just a ‘belief’, and a belief-act a ‘believing’. Talk of ‘justifying beliefs’ inherits this ambiguity. Do we seek to justify the belief or the believing of it? It is obvious, I think, that we seek to justify believings, not beliefs. One person can be justified in believing what another person is not. I can be justified in believing today what I was not justified in believing yesterday. The ancients were justified in believing that the earth does not move, though of course we are not. Despite these platitudes, justificationism is the view that a justification for a believing must be a justification for the belief. Given justificationism, we must provide some sort of inductive or ampliative logic leading us from evidential premises to evidence-transcending conclusions. At least, we must provide this if any evidence-transcending believings are to be justified believings. But if we reject justificationism, we need no inductive or ampliative reasoning. Our evidence-transcending believings might be justified even though our evidence-transcending beliefs cannot be. Of course, we need a theory of when an evidence-transcending believing is justified. Popper’s general story is
Popper and Hypothetico-deductivism
227
that an evidence-transcending believing is justified if the belief in question has withstood criticism. As we saw, the logic of criticism is entirely deductive. Popper’s critics object that he smuggles in inductive reasoning after all. In saying that having withstood criticism is a reason for believing, Popper must be assuming that it is a reason for belief as well. But these critics smuggle in precisely the justificationist assumption that Popper rejects. This is all terribly abstract. To make it concrete, let us consider abduction, and its intellectual descendant, inference to the best explanation (IBE). Abduction is generally regarded as the second main type of ampliative reasoning (the other being induction). Abduction was first set forth by Charles Sanders Peirce, as follows: The surprising fact, C, is observed. But if A were true, C would be a matter of course. Hence, . . . A is true. [C.S. Peirce 1931-1958, Vol. 5, p. 159] Here the second premise is a fancy way of saying “A explains C”. By the way, abduction was originally touted, chiefly by Hanson, as a long neglected contribution to the ‘logic of discovery’. It is no such thing. The explanatory hypothesis A figures in the second premise as well as the conclusion. The argument as a whole does not generate this hypothesis. Rather, it seeks to justify it. The same applies, despite its name, to ‘inference TO the best explanation’ (IBE). Abduction and IBE both belong in the context of appraisal (justification) rather than in the context of invention (discovery). Abduction is invalid. We can validate it by viewing it as an enthymeme and supplying its missing premise “Any explanation of a surprising fact is true”. But this is no use — it merely trades obvious invalidity for equally obvious unsoundness. The missing premise is obviously false. Nor is any comfort to be derived from weakening it to “Any explanation of a surprising fact is probably true” or to “Any explanation of a surprising fact is approximately true”. (Philosophers have cottageindustries devoted to both of these!) It is a surprising fact that marine fossils are found on mountain-tops. One explanation of this is that Martians came and put them there to surprise us. But this explanation is not true, or probably true, or approximately true. IBE attempts to improve upon abduction by requiring that the explanation is the best explanation that we have. It goes like this: F is a fact. Hypothesis H explains F . No available competing hypothesis explains F as well as H does. Therefore, H is true. [William Lycan, 1985, p. 138] This is better than abduction, but not much better. It is also invalid. We can validate it by viewing it as an enthymeme and supplying its missing premise “The
228
Alan Musgrave
best available explanation of a (surprising) fact is true”. But this missing premise is also obviously false. Nor, again, will going for probable truth or approximate truth help matters. But wait! Peirce’s original abductive scheme was not quite what we have considered so far. Peirce’s original scheme went like this: The surprising fact, C, is observed. But if A were true, C would be a matter of course. Hence, there is reason to suspect that A is true. This is also invalid. But to validate it the missing premise we need is “There is reason to suspect that any explanation of a surprising fact is true”. This missing premise is, I suggest, true. After all, the epistemic modifier “There is reason to suspect that . . . ” weakens the claim considerably. In particular, “There is reason to suspect that A is true” can be true even though A is false. So we have not traded obvious invalidity for equally obvious unsoundness. Peirce’s original scheme may be reconstructed so as to be both valid and sound. Why does everybody misread Peirce’s scheme and miss this obvious point? Because everybody accepts justificationism, and assumes that a reason for suspecting that something is true must be a reason for its truth. IBE can be rescued in a similar way. I even suggest a stronger epistemic modifier than “There is reason to suspect that . . . ”, namely “There is reason to believe (tentatively) that . . . ” or equivalently, “It is reasonable to believe (tentatively) that . . . ”. What results when this missing premise is spelled out is: [It is reasonable to believe that the best available explanation of a fact is true.] F is a fact. Hypothesis H explains F . No available competing hypothesis explains F as well as H does. Therefore, it is reasonable to believe that H is true. This is valid and instances of it might well be sound. Inferences of this are employed in the common affairs of life, in detective stories, and in the sciences. Why does everybody misread IBE and miss this obvious point? Because everybody accepts justificationism, and assumes that a reason for believing that something is true must be a reason for its truth. (The cottage industries devoted to probable truth and approximate truth stem from the same source.) All the criticisms of IBE presuppose justificationism. People object that the best available explanation might be false. Quite so — and so what? It goes without saying that any explanation might be false, in the sense that it is not necessarily true. But it is absurd to suppose that the only things we can reasonably believe are necessary truths. People object that being the best available explanation of a fact does not show that something is true (or probably true or approximately true). Quite so — and again, so what? This assumes the justificationist principle that a reason for believing something must be a reason for what is believed. People
Popper and Hypothetico-deductivism
229
object that the best available explanation might be the “best of a bad lot” and actually be false. Quite so — and again, so what? It can be reasonable to believe a falsehood. Of course, if we subsequently find out that that the best available explanation is false, it is no longer reasonable for us to believe it. But what we find out is that what we believed was wrong, not that it was wrong or unreasonable for us to have believed it. What goes for IBE goes for so-called ‘inductive arguments’ in general. They can be turned into sound deductive enthymemes with epistemic principles among their premises and epistemic modifiers prefacing their conclusions. Let us confine ourselves to inductive generalisation or singular predictive inference. In the context of justification we require a stronger epistemic modifier than “It is reasonable to conjecture that . . . ”. We need “It is reasonable to believe that . . . ”. For singular predictive inference we obtain: [It is reasonable to believe that unobserved cases resemble observed cases.] All observed emeralds have been green. Therefore, it is reasonable to believe that the next observed emerald will be green. Robert Pargetter and John Bigelow (1997) suggest an improved version of this, in which a tacit total evidence assumption is made explicit: All observed emeralds have been green. This is all the relevant evidence available. Therefore, it is reasonable to believe that the next observed emerald will be green. However, as in the above formulation, Pargetter and Bigelow do not spell out or make explicit the general epistemic principle involved here — “If all observed As have been B, and if this is all the relevant evidence available, then it is reasonable to believe that the next observed A will be B”. They do not spell this principle out because they regard it is analytically or necessarily true, true by virtue of the meaning of the term ‘reasonable’, so that the argument as it stands is semantically though not logically valid. They say, of arguments like the emeralds argument as set out above: They are, of course, not formally valid . . . They are valid just in the sense that it is not possible for their premises to be true while their conclusions are false. They are valid in the way that arguments like these are valid: ‘This is red, so it is coloured’, ‘This is square, so it is extended’, and so on. The validity of the emeralds argument rests not just on its logical form but on the nature of rationality. [Pargetter and Bigelow, 1997, p. 70] Now I do not want to quarrel about whether ‘Anything red is coloured’ or ‘Anything square is extended’ are analytic or necessary truths, as Pargetter and Bigelow
230
Alan Musgrave
evidently think. But I do wonder whether it is analytic that “If all observed As have been B, and if this is all the relevant evidence available, then it is reasonable to believe that the next observed A will be B”. This principle conflicts with the following Humean justificationist principle: “It is reasonable to believe a conclusion only if your premises establish that it is true or probably true (more likely true than not)”. I do not think this Humean principle is ‘conceptually confused’. So neither do I think the anti-Humean principle an analytic truth. But this is a family quarrel among deductivists, so I shall say no more about it (there is more in my [1999]). Deductivists have a different family quarrel with John Fox. Fox is generally sympathetic to deductivist reconstructions of so-called inductive arguments as deductive arguments with epistemic principles among their premises. He says that one can be a “deductivist without being an extreme inductive sceptic, by holding that the best analysis of why inductive beliefs are rational when they are displays no inferences but deductively valid ones as acceptable”, where the inferences “conclude not to predictions or generalisations, that is, not to inductive beliefs, but to judgements about their reasonableness” [Fox, 1999, pp. 449, 456]. But Fox thinks that this is not enough: . . . real-life arguers conclude to something further, which is not a deductive consequence of their premises: to the generalisations or predictions themselves. . . . In his primary concern to establish how surprisingly much can be reached simply by deduction, Musgrave seems simply to have overlooked both this further step and its non-deductive character. [Fox, 1999, 456] Fox says that all we need to get from a conclusion of the form “It is reasonable to believe that P ” to P itself is a further very simple non-deductive inference which he calls an epistemic syllogism. Examples are: It is reasonable to believe that P . Therefore, P . One should accept that P . Therefore, P . Epistemic syllogisms are obviously invalid, and could only be validated by invoking absurd metaphysical principles like “Anything that it is reasonable to believe is true” or “Anything that one should accept as true is true”. But here Fox makes an ingenious suggestion. He does not try to validate epistemic syllogisms but he does think that they can be trivially ‘vindicated’. To vindicate an argument is to show that, given its premise(s), it is reasonable to accept its conclusion. The premise of the epistemic syllogism is that it is reasonable to believe that P. If this is correct, then trivially it is reasonable to conclude, further, that P. Which vindicates epistemic syllogisms: “Indeed, precisely if these deductively drawn conclusions are correct, it is reasonable so to conclude” [Fox, 1999, p. 456].
Popper and Hypothetico-deductivism
231
This is clever — but is it correct? The matter turns on the word ‘conclude’, which is ambiguous between inferring and believing. Fox distinguishes a weak sense of ‘infer’ whereby one infers a conclusion from some premises without coming to believe it, from a strong sense of ‘infer’ whereby “to infer a conclusion from premises is to come to accept it on their basis” [Fox, 1999, p. 451]. I say that you infer in the strong sense if you first infer in the weak sense and then, as a result of having made that inference, come to accept or believe the conclusion. This can happen. But being caused to believe a conclusion, by inferring it from premise(s) that you believe, is not some special ‘strong’ kind of inferring. Making the inference is one mental act, coming to believe its conclusion is another. The former can cause the latter. But coming to believe something is not the conclusion of the inference, it is the effect of making it. Aristotle’s so-called ‘practical syllogism’, whose premises are statements and whose conclusion is an action, is an oxymoron. Fox agrees, but thinks his epistemic syllogisms are different: Aristotle’s ‘practical syllogism’ was not an inference at all. Its ‘premise’ was a proposition, to the effect that one should do x; its conclusion was the action of doing x. When the premise is that one should accept p, coming to accept p is doing just what the premise says one should, the ‘conclusion’ of an Aristotelian practical syllogism. But doing this is precisely (strongly) inferring in accordance with the pattern I vindicated above. Because here inference is involved, the term ‘syllogism’ is more apt than in most practical syllogisms. [Fox, 1999, p. 451] I can see no difference between Aristotle’s practical syllogism and Fox’s epistemic syllogism. Both involve or are preceded by inferences. In Aristotle’s case, you infer that you should do x from some premise(s). In the epistemic case, you infer that you should accept or believe P from some premise(s). The further steps, actually doing x or accepting P , are actions rather than the conclusions of arguments. Fox’s ‘vindication’ of his epistemic syllogisms seems trivial: from the premise “It is reasonable to believe that P ” the conclusion “It is reasonable to believe that P ” trivially follows. But “It is reasonable to believe that P ” does not say that any way of arguing for P is reasonable. It says nothing about any way of arguing for P — it speaks only of P . In particular, it does not say that “It is reasonable to believe that P . Therefore, P ” is a reasonable way to argue for P . Why does Fox think his obviously invalid epistemic syllogisms are necessary? Why does he think that “real life arguers” need to argue, not just that it is reasonable to believe some evidence-transcending hypothesis, but also for that hypothesis itself? Well, if you assume that a reason for believing P must be a reason for P itself, then you will need to invoke epistemic syllogisms to get you (invalidly) from a reason for believing P to a reason for P . But we should get rid of that justificationist assumption.
232
Alan Musgrave
GETTING STARTED — ‘FOUNDATIONAL BELIEFS’ In discussing induction, I talked of evidence and evidence-transcending hypotheses. And in discussing abduction and IBE, I talked of having ‘facts’ that require explanation. What is the source of this evidence or of these facts? There are two main sources, sense-experience and testimony. Justificationism bedevils discussion of these matters, too. My nose itches and I scratch it. The itch causes (or helps cause) the scratching. The itch is also a reason for the scratching (or part of the reason). In cases like this, we are happy with the thought that causes of actions are reasons for them. The experience (the itch) is both a cause and a reason for the action (the scratching). I see a tree and I form the belief that there is a tree in front of me. The treeexperience causes (or helps cause) the believing. Is the tree experience also a reason (or part of the reason) for the believing? The two cases seem symmetrical. Yet many philosophers treat them differently. Many philosophers are unhappy with the thought that the tree-experience is both a cause and a reason for the treebelieving. Why the asymmetry? Justificationism lies behind it. Justificationism says that a reason for believing something must be a reason for what is believed. What is believed is a statement or proposition. Only another proposition can be a reason for a proposition. But perceptual experiences are not propositions, any more than itches or tickles are. So my tree-experience cannot be a reason for my tree-belief, and cannot be a reason for my tree-believing either. If we reject justificationism, we can allow that perceptual experiences are reasons as well as causes of perceptual believings (though not, of course, for the perceptual beliefs, the propositions believed). We can even allow that they are good reasons. They are not conclusive reasons, of course, but defeasible ones. There is the ever-present possibility of illusion or hallucination. The tree-belief transcends the tree-experience, and future experiences may indicate that it is false. Still, it is reasonable to “trust your senses” unless you have a specific reason not to. In support of this, we can regard perceptual belief as a case of IBE. A simple example, formulated in the usual way, is: “I see a cat in the corner of the room. The best explanation of this is that there is a cat in the corner of the room. Therefore, there is a cat in the corner of the room”. But this formulation is wrong. The question was not “Why is there a cat in the corner of the room?”, but rather “Why do you believe that there is a cat in the corner of the room?”. What we are trying to justify or give a reason for is not the statement that there is a cat in the corner of the room, but rather my coming to believe this. So the conclusion ought to be “It is reasonable to believe that there is a cat in the corner of the room”. And the missing premise required to convert the argument into a perfectly valid deduction is “It is reasonable to believe the best explanation of any fact”. Of course, my reasonable perceptual belief might turn out to be false. If evidence comes in of hallucination or some less radical kind of perceptual error, I may concede that my perceptual belief was wrong — but that does not mean that I was wrong to have believed it.
Popper and Hypothetico-deductivism
233
Much the same applies to testimony. Somebody tells me something and I come to believe it. Is the testimony a reason as well as a cause for my believing? Many philosophers are unhappy with the thought that it is. Again, justificationism lies behind this. My hearing the testimony is not a proposition, any more than an itch or a tickle is. So my hearing the testimony cannot be a reason for my belief, and cannot if justificationism is right be a reason for my believing either. If we reject justificationism, we can allow that testimony is a reason as well as a cause of believing (though not, of course, for what is believed). We can even allow that it is a good reason. It is not a conclusive reason, of course, but a defeasible one. There is the ever-present possibility that my informant is misinformed or even lying to me. Future experience may indicate that the belief I acquired from testimony is false. Still, it is reasonable to “trust what other folk tell you” unless you have a specific reason not to. These reflections on the role of sense-experience and testimony are really no more than common sense. These ‘sources of knowledge’ — or rather, sources of reasonable believings - are simply ways of getting started. Sense-experience and testimony yield foundational beliefs, ‘foundational’ not in the sense that they are certain and incorrigible but only in the sense that they do not arise by inference from other beliefs. (For more on all this, see my [2009].) And so I conclude. We do not need inductive or ampliative logic anywhere — not in the context of criticism, not in the context of invention, and not in the context of appraisal either. BIBLIOGRAPHY [Achinstein, 2006] P. Achinstein. Mill’s Sins, or Mayo’s errors?, in D. Mayo and A. Spanos, eds., Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, London: Cambridge University Press, 2009. [Dorling, 1971] J. Dorling. Einstein’s Introduction of Photons: Argument by Analogy or Deduction from the Phenomena?, British Journal for the Philosophy of Science, 22, 1-8, 1971. [Dorling, 1973a] J. Dorling. Henry Cavendish’s Deduction of the Electrostatic Inverse Square Law from the Result of a Single Experiment, Studies in History and Philosophy of Science, 4, 327-348, 1973. [Dorling, 1973b] J. Dorling. Demonstrative Induction: Its Significant Role in the History of Physics, Philosophy of Science, 40, 360-372, 1973. [Fox, 1999] J. Fox. Deductivism Surpassed, Australasian Journal of Philosophy, 77, 447-464, 1999. [Hanson, 1969] N. R. Hanson. Perception and Discovery, San Francisco, CA: Freeman, Cooper & Co, 1969. [Harre, 1960] R. Harre. An Introduction to the Logic of the Sciences, London: McMillan & Co, 1960. [Howson, 1976] C. Howson, ed. Method and Appraisal in the Physical Sciences, London: Cambridge University Press, 1976. [Latsis, 1976] S. J. Latsis, ed. Method and Appraisal in Economics, London: Cambridge University Press, 1976. [Lycan, 1985] W. Lycan. Epistemic Value, Synthese, 64: 137-164, 1985. [McLaughlin, 1982] R. McLaughlin. Invention and Appraisal, in R. McLaughlin (ed), What? Where? When? Why?: Essays on Induction, Space and Time, and Explanation, Dordrecht: Reidel, 69-100, 1982.
234
Alan Musgrave
[Mill, 1843] J. S. Mill. A System of Logic Ratiocinative and Inductive, London: Longmans, Green and Co, 1843. [Musgrave, 1980] A. E. Musgrave. Wittgensteinian Instrumentalism, Theoria, 46, 65-105, 1980. [Reprinted in his Essays on Realism and Rationalism, Amsterdam — Atlanta, GA: Rodopi, 1999, 71-105.] [Musgrave, 1989] A. E. Musgrave. Deductive Heuristics, in K. Gavroglu et.al. (eds), Imre Lakatos and Theories of Scientific Change, Dordrecht/Boston/London: Kluwer Academic Publishers, 15-32, 1989. [Musgrave, 1999] A. E. Musgrave. How To Do Without Inductive Logic, Science and Education, 8, 395-412, 1999. [Musgrave, 2004] A. E. Musgrave. How Popper [might have] solved the problem of induction, Philosophy, 79: 19-31. [Reprinted in Karl Popper: Critical Assessments of Leading Philosophers. Anthony O’Hear (ed). London: Routledge (2003), Volume II, 140 — 151; and in Karl Popper: Critical Appraisals. P. Catton and G. Macdonald (eds). London: Routledge (2004) 16-27.] [Musgrave, 2007] A. E. Musgrave. Critical Rationalism’, in E. Suarez-Iniguez (ed), The Power of Argumentation (Poznan Studies in the Philosophy of the Sciences and the Humanities, vol. 93), Amsterdam/New York, NY: Rodopi, 171-211, 2007. [Musgrave, 2009] A. E. Musgrave. Experience and Perceptual Belief, in Z. Parusnikova & R. S. Cohen (eds), Rethinking Popper (Boston Studies in the Philosophy of Science), Springer Science & Business Media, 5-19, 2009. [Newton, 1934] I. Newton. Sir Isaac Newton’s Mathematical Principles of Natural Philosophy and his System of the World, Motte’s translation, revised by Cajori, Berkeley & Los Angeles: University of California Press, 1934. [Pargetter and Bigelow, 1997] R. Pargetter and J. Bigelow. The Validation of Induction, Australasian Journal of Philosophy, 75, 62-76, 1997. [Peirce, 1931-58] C. S. Peirce. The Collected Papers of Charles Sanders Peirce, ed. C. Hartshorne & P. Weiss, Cambridge, MA: Harvard University Press, 1931–1958. [Popper, 1963] K. R. Popper. Conjectures and Refutations, London: Routledge & Kegan Paul, 1963. [Popper, 1959] K. R. Popper. The Logic of Scientific Discovery, London: Hutchinson & Sons, 1959. [Ramsey, 1931] F. P. Ramsey. The Foundations of Mathematics, London: Routledge & Kegan Paul, 1931. [Ryle, 1950] G. Ryle. ”If”, “so”, and “because”, in M. Black (ed), Philosophical Analysis, New York: Cornell University Press, 323-340, 1950. [Toulmin, 1953] S. E. Toulmin. Philosophy of Science: An Introduction, London: Hutchinson & Co, 1953. [Watson, 1938] W. H. Watson. On Understanding Science, London: Cambridge University Press, 1938. [Wittgenstein, 1961] L. Wittgenstein. Tractatus-Logico-Philosophicus, translated by D.F.Pears & B. F. McGuiness, London: Routledge & Kegan Paul, 1961. [Zahar, 1973] E. G. Zahar. Why did Einstein’s Programme supersede Lorentz’s?, British Journal for the Philosophy of Science, 24, 95-123 & 223-262, 1973. [Zahar, 1983] E. G. Zahar. Logic of Discovery or Psychology of Invention? British Journal for the Philosophy of Science, 34, 243-261, 1983.
HEMPEL AND THE PARADOXES OF CONFIRMATION Jan Sprenger
1
TOWARDS A LOGIC OF CONFIRMATION
The beginning of modern philosophy of science is generally associated with the label of logical empiricism, in particular with the members of the Vienna Circle. Some of them, as Frank, Hahn and Neurath, were themselves scientists, others, as Carnap and Schlick, were philosophers, but deeply impressed by the scientific revolutions at the beginning of the 20th century. All of them were unified in admiration for the systematicity and enduring success of science. This affected their philosophical views and led to a sharp break with the “metaphysical” philosophical tradition and to a re-invention of empiricist epistemology with a strong emphasis on science, our best source of high-level knowledge. Indeed, the members of the Vienna Circle were scientifically trained and used to the scientific method of test and observation. For them, metaphysical claims were neither verifiable nor falsifiable through empirical methods, and therefore neither true nor false, but meaningless. Proper philosophical analysis had to separate senseless (metaphysical) from meaningful (empirical) claims and to investigate our most reliable source of knowledge: science.1 The latter task included the development of formal frameworks for discovering the logic of scientific method and progress. Rudolf Carnap, who devoted much of his work to this task, was in fact one of the most influential figures of the Vienna Circle. In 1930, Carnap and the Berlin philosopher Hans Reichenbach took over the journal ‘Annalen der Philosophie’ and renamed it ‘Erkenntnis’. Under that name, it became a major publication organ for the works of the logical empiricists. The German-Austrian collaboration in the editorial board was no matter of chance: Congenial to the Vienna group, several similar-minded researchers based in Berlin gathered in the ‘Berlin Society for Empirical Philosophy’, among them Reichenbach. It was here that a young German student of mathematics, physics and philosophy — Carl Gustav Hempel — got into contact with empiricist philosophy. On the 1929 conference on the epistemology of the exact sciences in Berlin, he got to know Carnap and soon moved to Vienna himself. Nevertheless, he obtained his doctorate degree in Berlin in 1934, but faced with Nazi rule, Hempel 1 Cf.
[Friedman, 1999; Uebel, 2006].
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
236
Jan Sprenger
soon opted for emigration and later became Carnap’s assistant at the University of Chicago. Thus it is no matter of chance that the contents of Carnap’s and Hempel’s philosophy are so close to each other. Similar to Carnap, Hempel was interested in the logic of science and in particular in the problem of inductive inference. Similar to Carnap, Hempel thought that the introduction of rigorous methods would help us to establish a logic of induction and confirmation. Carnap’s life project consisted in developing a probabilistic logic of induction (Carnap [1950], cf. [Zabell, 2009]), similar to a deductive calculus for truth-preserving inferences. Indeed, the success of the calculus of deductive logic suggests a similar calculus for inductive, ampliative inferences that could be applied to the confirmation of scientific hypotheses. Having a logic of confirmation would thus contribute to the central aims of empiricist philosophy: to understand the progress and success of science, and in particular the replacement of old by new theories and the testability of abstract hypotheses by empirical observations. While in principle cherishing Carnap’s probabilistic work in that area, Hempel had some subtle methodological reservations: Prior to explicating the concept of confirmation in a probabilistic framework, we are supposed to clarify our qualitative concept of confirmation and to develop general adequacy criteria for an explication of confirmation. Therefore my essay deals less with probabilistic than with qualitative approaches to confirmation theory in modern philosophy of science. Hempel’s main contribution, the essay ‘Studies in the Logic of Confirmation’, was published in 1945, right after first pioneer works in the area (e.g. [Hossiasson-Lindenbaum, 1940]), but before Carnap’s [1950; 1952] major monographs. On the way, we will also stumble over Hempel’s famous paradoxes of confirmation, which pose, or so I will argue, a great challenge for any account of confirmation.2 Let us begin with some preliminary thoughts. In science, confirmation becomes an issue whenever science interacts with the world, especially when scientific hypotheses are subjected to empirical tests. Where exactly can a logic of induction and confirmation help us? Hempel distinguishes three stages of empirical testing [Hempel, 1945/1965, pp. 40-41]: First, we design, set up and carefully conduct scientific experiments, we try to avoid misleading observations, double-check the data, clear them up and finally bring them into a canonical form that we can use in 2 Confirmation is generally thought to hold between a hypothesis and pieces of evidence — a piece of evidence proves, confirms, undermines, refutes or is irrelevant to a hypothesis. At first sight, it sounds plausible to think of confirmation as a semantic relation between a scientific theory on the one side and a real-world object on the other side. For instance, a black raven seems to confirm the hypothesis that all ravens are black. But recall that we would like to assimilate confirmation theory to deductive logic and to find a system of syntactic rules for valid inductive inference. Therefore we should frame the evidence into sentences of a (formal) language, in order to gain access to powerful logical tools, e.g. checking deducibility and consistency relations between evidence and hypothesis. Thus, Hempel argues, a purely semantic account of confirmation is inadequate. We should set up a syntactic relation between hypothesis and evidence where both relata are (sets of) first-order sentences. (Cf. [Hempel, 1945/1965, pp. 21-22]). When I nevertheless say that ‘a black raven confirms hypothesis H’, this is just a matter of convenience and means the corresponding observation report ‘there is a black raven’.
Hempel and the Paradoxes of Confirmation
237
the next stage.3 In the second stage, these data are brought to bear on the hypothesis at stake — do they constitute supporting or undermining evidence? Third and last, the hypothesis is re-assessed on the basis of a judgment of confirmation or disconfirmation: we decide to accept it, to reject it or to suspend judgment and to collect further evidence. — In these three stages, only the second stage is, or so Hempel argues, accessible to a logical analysis: the first and the third stage are full of pragmatically loaded decisions, e.g. which experiment to conduct, how to screen off the data against external nuisance factors, or which strength of evidence is required for accepting a hypothesis. Evidently, those processes cannot be represented by purely formal means. That’s different for the second stage which compares observational sentences (in which the evidence is framed) with theoretical sentences which represent the hypothesis or theory. This is the point where logical tools can help to analyze the relation between both kinds of sentences and to set up criteria for successful scientific confirmation. A fundamental objection against a logic of confirmation holds that scientists frequently disagree whether an empirical finding really confirms a theoretical hypothesis, and this phenomenon is too common to ascribe it to irrationality on behalf of the researchers. Common scientific sense may not be able to decide such questions, first because the case under scrutiny might be very complicated and second, because people might have different ideas of common sense in a specific case. Formal criteria of confirmation help to settle the discussion, and once again, it is helpful to consider the analogy to deductive logic. For each valid deductive inference, there is a deduction of the conclusion from the logical axioms (that is the completeness theorem for first-order logic). Hence, in case there is a disagreement about the validity of a deductive inference, the formal tools can help us to settle the question. In the same way that the validity of a deductive inference can be checked using formal tools (deductions), it is desirable to have formal tools which examine the validity of an inductive inference. Sometimes this project is deemed futile because scientists do not always make their criteria of confirmation explicit. But that objection conflates a logical with a psychological point [Hempel, 1945/1965, pp. 9-10] — the lack of explicit confirmation criteria in scientific practice does not refute their existence. The objection merely shows that if such criteria exist, scientists are often not aware of them. But since scientists make, in spite of all disagreement in special cases, in general consistent judgments on evidential relevance, this is still a fruitful project. Confirmation theory thus aims at a rational reconstruction of inductive practice that is not only descriptively adequate, but also able to correct methodological mistakes in science. Thus confirmation theory is vastly more than a remote philosophical subdiscipline, it is actually a proper part of the foundations of science, in the very spirit of logical empiricism. Later œuvres where debates about proper scientific method interfere with confirmation-theoretic problems (e.g. [Royall, 1997]) vindicate this view. Let us now review Hempel’s pioneer work. 3 Suppes
[1962/1969] refers to this activity as building ‘models of data’.
238
Jan Sprenger
2
ADEQUACY CRITERIA
Carnap and Hempel both worked on an explication of confirmation, but their methods were quite different. While Carnap connected confirmation to probability by proposing ‘degree of confirmation’ as an interpretation of probability, Hempel pursued a non-probabilistic approach which precedes the quantitative analysis. His method can be described thus: At the beginning, general considerations yield adequacy criteria for every sensible account of confirmation [Hempel, 1945/1965, pp. 30-33], considerably narrowing down the space of admissible accounts. Out of the remaining accounts, Hempel selects the one that also captures a core intuition about confirmation, namely that hypotheses are confirmed by their instances. Let us now see which criteria Hempel develops. The first criterion which he suggests is the Entailment Condition (EnC): If the observation report E logically implies the hypothesis H then E confirms H. For example, if the hypothesis reads ‘there are white ravens’ then, obviously, the observation of a white raven proves it and a fortiori, confirms it: Logical implication is the strongest possible form of evidential support. So the Entailment Condition sounds very reasonable. Then, if a theory is confirmed by a piece of evidence, it seems strange to deny that consequences of the theory are not confirmed by the evidence. For instance, if observations confirm Newton’s law of gravitation, they should confirm Kepler’s laws, too, since the latter’s predictions have to agree with the gravitation law. In other words, we demand satisfaction of the Consequence Condition (CC): If an observation report E confirms every member of a set of sentences S, then it confirms every consequence of S (e.g. every sentence H for which S |= H). In fact, the consequence condition is quite powerful, and several natural adequacy criteria follow from it. For instance, the Equivalence Condition (EC): If H and H are logically equivalent sentences, then the observation report E confirms H if and only if E confirms H .4 It is straightforward to see that (EC) follows from (CC): If a sentence H is confirmed by E and H is equivalent to H , then H is a logical consequence of {H} and the Consequence Condition can be applied, yielding that H is also confirmed by E, and vice versa. 4 This condition can naturally be extended to a condition for the evidence, asserting that the confirmation relation is invariant under replacing the evidence statement by logically equivalent statements.
Hempel and the Paradoxes of Confirmation
239
Certainly, the equivalence condition is a minimal constraint on any account of confirmation. We have already said that scientific hypotheses are usually framed in the logical vocabulary of first-order logic (or a reduct thereof). That allows us to state them in different, but logically equivalent forms.5 The idea of the equivalence condition is that ‘saying the same with different words’ does not make a difference with regard to relations of confirmation and support: Hypotheses which express the same content in different words are equally supported and undermined by a piece of evidence, independent of the chosen formulation. To see this in more detail, note that for deductive relations, the Equivalence Condition holds by definition: If A logically implies B, A also implies any B that is logically equivalent to B. An account of confirmation should contain relations of deduction and entailment as special cases: If an observation entailed the negation of a hypothesis, in other words, if the hypothesis were falsified by actual evidence, this would equally speak against all equivalent versions and formulations of that hypothesis. Deduction and logical entailment do not make a difference between equivalent sentences, and logical and mathematical axiomatizations are typical of the modern exact sciences (e.g. the propagation of sound is described by a general theory of mechanic waves). If the Equivalence Condition did not hold, the degree of support which a hypothesis got would depend on the specific formulation of the hypothesis. But that would run counter to all efforts to introduce exact mathematical methods into science, thereby making scientific analysis more precise, and ultimately more successful. Obviously, the Consequence Condition also implies the Special Consequence Condition (SCC): If an observation report E confirms a hypothesis H, then it confirms every consequence of H. However, there is an important confirmation intuition that contradicts (SCC) and stems from the link between prediction, test and confirmation. When a theory makes a prediction and this prediction is indeed observed, those observations lend empirical support to the theory. Abstract theories, like the General Theory of Relativity (GTR), are often not directly testable. We have to focus on parts of them and to use those parts for deriving observational consequences. This agrees with falsificationist methodology (e.g. [Popper, 1963]) — we derive conjectures and predictions from a theory and test them versus the empirical world. For instance, Eddington’s observations of the solar eclipse in 1919 did not prove GTR, but merely confirmed one of its predictions — namely the bending of light by massive bodies. Had the outcome been different, GTR (or one of the auxiliary assumptions) would have been falsified. Evidently, the stronger a theory, the higher its predictive power. In particular, if the theory T predicts the observation sentence E, E is also a prediction of any stronger theory T . This line of reasoning suggests the Converse Consequence Condition (CCC): If an observation report E confirms a hypothesis H, then it confirms every hypothesis H 5 For instance, the definition of compactness for sets of real numbers can be stated in topological or in analytical terms.
240
Jan Sprenger
that logically implies H (i.e. H |= H). Obviously, the Converse Consequence Condition (CCC) stands in sharp contrast to the Special Consequence Condition (SCC). Indeed, accepting both adequacy conditions at once would trivialize the concept of confirmation: Every observation report E trivially implies itself, so by (EnC), E confirms E. By (CCC), E also confirms E.H for any hypothesis H since E.H logically implies H. Since E.H implies H and is confirmed by E, E confirms H by (SCC). Note that this derivation holds for an arbitrary hypothesis H and arbitrary observations E! Our paradoxical result reveals that we have to make a decision between the prediction/observationbased scheme of inference (CCC) and the ‘conservative’ (SCC). Hempel believed that the idea of predictive confirmation expressed in (CCC) is not an adequate image of confirmation in science. Sure, general laws as the law of gravitation are tested by observable consequences, such as the planetary motions. Indeed, successful tests of Kepler’s three laws are also believed to support the law of gravitation. But the evidence transfers from Kepler’s laws to the gravitation law because it is also an instance of the gravitation law — and not because the law of gravitation is logically stronger than Kepler’s laws. For instance, even the hypothesis ‘There is extraterrestrian life and Kepler’s laws hold’ is logically stronger than Kepler’s laws alone, but we would not like to say that this hypothesis can be confirmed by, let’s say, observing the orbit of Jupiter. If (CCC) is accepted, any hypothesis whatsoever (X) can be tacked to the confirmed hypothesis (H), and the new hypothesis H.X is still confirmed by the evidence. These are the paradoxes of hypothetico-deductive confirmation, the tacking paradoxes.6 Moreover, (CCC) licenses the confirmation of mutually incompatible hypotheses: Let H be confirmed by E. Then both H.X and H.¬X are, according to (CCC), confirmed by E. This sounds strange and arbitrary — the content of X is not at all relevant to our case, and if both hypotheses (H.X and H.¬X) are equally confirmed, it is not clear what we should believe in the end. There are now two ways to proceed: Either we can try to restrict (CCC) to logically stronger hypotheses that stand in a relevance relation to the evidence. Then, the paradoxes vanish. Several authors have tried to mitigate the paradoxes of hypothetico-deductive confirmation along these lines, namely by the additional requirements that the tacked hypothesis H.X or H.¬X be a content part of the hypothesis [Gemes, 1993] or that the inference to the evidence be ‘premise-relevant’ [Schurz, 1991]. So arbitrary hypotheses are no longer confirmed together with H. Hempel, however, chooses the other way — he rejects (CCC) in favor of (SCC). Contradictory hypotheses should, or so he argues, not be confirmed by one and the same evidence, in opposition to (CCC). We can put this view into another adequacy condition: hypotheses confirmed by a piece of evidence E must be consistent with each other. Consistency Condition (CnC): If an observation report E confirms 6 Cf.
[Musgrave, 2009; Weisberg, 2009].
Hempel and the Paradoxes of Confirmation
241
the hypotheses H and H , then H is logically consistent with H (i.e. there is at least one model of H that is also a model of H ). Finally, we summarize the three conditions that are essential to Hempel’s account: 1. Entailment Condition (EnC): If E |= H, then E confirms H. 2. Consequence Condition (CC): If E confirms S and S |= H, then E confirms H. (Note: (CC) contains the Equivalence Condition (EC) and the Special Consequence Condition (SCC) as special cases.) 3. Consistency Condition (CnC): If E confirms H and H , then H is logically consistent with H . 3
THE SATISFACTION CRITERION
What should we demand of a piece of evidence in order to confirm a hypothesis? In general, logical entailment between evidence and hypothesis is too strong as a necessary criterion for confirmation. In particular, if the hypothesis is a universal conditional, no finite set of observations will ever be able to prove the hypothesis. But the evidence should certainly agree with those parts of the hypothesis that it is able to verify. Hempel suggests that, if an observation report says something about the singular terms a, b and c, the claims a hypothesis makes about a, b and c should be satisfied by the evidence. From such an observation report we could conclude that the hypothesis is true of the class of objects that occur in E. That is all we can demand of an confirming observation report, or so Hempel argues. In other words, we gain instances of a hypothesis from the evidence, and such instances confirm the hypothesis. To make this informal idea more precise, we have to introduce some definitions (partly taken from [Gemes, 2006]): DEFINITION 1. An atomic well-formed formula (wff) β is relevant to a wff α if and only if there is some model M of α such that: if M differs from M only in the value β is assigned, M is not a model of α. So intuitively, β is relevant for α if at least in one model of α the truth value of β cannot be changed without making α false. Now we can define the domain (or scope) of a wff: DEFINITION 2. The domain of a well-formed formula α, denoted by dom(α), is the set of singular terms which occur in the atomic (!) well-formed formulas (wffs) of L that are relevant for α. For example, the domain of F a.F b is {a, b} whereas the domain of F a.Ga is {a} and the domain of ∀x : F x are all singular terms of the logical language. In other words, quantifiers are treated substitutionally. The domain of a formula is
242
Jan Sprenger
thus the set of singular terms about which something is asserted. Those singular terms are said to occur essentially in the formula: DEFINITION 3. A singular term a occurs essentially in a formula β if and only if a is in the domain of β. So, i.e. a occurs essentially in F a.F b, but not in (F a ∨ ¬F a).F b. Now, we are interested in the development of a formula for the domain of a certain formula. DEFINITION 4. The development of a formula H for a formula E, H|E , is the restriction of H to the domain of E, i.e. to all singular terms that occur essentially in E.7 For instance, (∀x : F x)|{a,b} is F a.F b, and the development of the formula ∀x : F x for F a.Ga.Gb is F a.F b. Now we have the technical prerequisites for understanding Hempel’s satisfaction criterion: The evidence entails the hypothesis not directly, but it entails the restriction of the hypothesis to the domain of the evidence. DEFINITION 5. (Satisfaction criterion) A piece of evidence E directly Hempelconfirms a hypothesis H if and only if E entails the development of H to the domain of E. In other words, E |= H|dom(E) . DEFINITION 6. (Hempel-confirmation) A piece of evidence E Hempel-confirms a hypothesis H if and only if H is entailed by a set of sentences Γ so that for all sentences φ ∈ Γ, φ is directly Hempel-confirmed by E. There are also formulations of those criteria that refer to a body of knowledge which provides the background for evaluating the confirmation relation (e.g. our current theory of physics). We do not need that for illustrating Hempel’s basic idea, but background information plays a crucial role in contrasting a hypothesis with empirical observations, as illustrated by the Duhem problem8 : Does the failure of a scientific test speak against the hypothesis or against the auxiliary assumptions which we need for connecting the evidence to the hypothesis? Therefore we give a formulation of Hempel’s satisfaction criterion which includes background knowledge. DEFINITION 7. (Satisfaction criterion, triadic formulation) A piece of evidence E directly Hempel-confirms a hypothesis H relative to background knowledge K if and only if E and K jointly entail the development of H to the domain of E. In other words, E.K |= H|dom(E) .9 DEFINITION 8. (Hempel-confirmation, triadic formulation) A piece of evidence E Hempel-confirms a hypothesis H relative to K if and only if H is entailed by a set of sentences Γ so that for all sentences φ ∈ Γ, φ is directly Hempel-confirmed by E relative to K. 7 The development of a formula can be defined precisely by a recursive definition, cf. [Hempel, 1943]. For our purposes, the informal version is sufficient. 8 Cf. [Duhem, 1914]. 9 Cf. [Hempel, 1945/1965, pp. 36-37].
Hempel and the Paradoxes of Confirmation
243
For example, F a (directly) Hempel-confirms the hypothesis ∀x : F x. Obviously, every piece of evidence that directly-Hempel confirms a hypothesis also Hempelconfirms it, but not vice versa. It is easy to see that any sentence that follows from a set of Hempel-confirmed sentences is Hempel-confirmed, too.10 Hence, Hempel’s confirmation criterion satisfies the Consequence Condition. The same holds true of the Consistency Condition. Indeed, Hempel’s proposal satisfies his own adequacy conditions. Moreover, many intuitively clear cases of confirmation are successfully reconstructed in Hempel’s account. However, one can raise several objections against Hempel, some of which were anticipated by Hempel himself in a postscript to “Studies in the Logic of Confirmation”. First, some hypotheses do not have finite developments and are therefore not confirmable. Take the hypothesis H2 = (∀x : ¬Gxx).(∀x : ∃y : Gxy).(∀x, y, z : Gxy.Gyz → Gxz) which asserts that G is a serial, irreflexive and transitive two-place relation. These properties entail that H2 is not satisfiable in any finite structure and thus not Hempel-confirmable by a finite number of observations. But certainly, H2 is not meaningless — you might interpret G as the ‘greater than’ relation and then, the natural numbers with their ordinary ordinal structure are a model of H2 . Read like this, H2 asserts that the ‘greater than’ relation is transitive, irreflexive and for any natural number, there is another natural number which is greater than it. It is strange that such hypotheses are not confirmable pace Hempel. The problem is maybe purely technical, but it is nevertheless embarrassing. Second, consider c, an individual constant of our predicate language, and the hypotheses H3 = ∀x : Ix and H4 = ∀x : (x = c → ¬Lx). Take the set of all planets of the solar system as the universe of our intended structure and let the individual constant c refer to Planet Earth. Then H3 might be interpreted as the claim that iron exists on all planets and H4 as the claim that no life exists on other planets. Both are meaningful hypotheses open to empirical investigation. Now, the observation report E = Ic (there is iron on Earth) directly Hempel-confirms H3 .H4 (there is iron on all planets and life does not exist on other planets) relative to empty background knowledge.11 While this may still be acceptable, it also follows that H4 is Hempel-confirmed by E = Ic, due to the Special Consequence Condition. This is utterly strange since the actual observation (there is iron on Earth) is completely independent of the hypothesis at stake (no life exists on other planets). Clearly, this conclusion goes beyond what the available evidence entitles us to infer. More embarrassing, this type of inference is generalizable to other examples, too.12 10 Assume that S |= H where S is Hempel-confirmed by E. Then there is a set Γ so that any element of Γ is directly Hempel-confirmed by E and that Γ |= S. Since by assumption S |= H, it follows that Γ |= H, too. Thus H is Hempel-confirmed by E. 11 The development of H .H with regard to c is Ic. 3 4 12 Cf. [Earman and Salmon, 1992].
244
Jan Sprenger
These technical problems may be mitigated in refined formulations of Hempelconfirmation, but there are more fundamental problems, too. They are in a similar vein connected to the fact that Hempel-confirmation satisfies the Special Consequence Condition. When a hypothesis H is Hempel-confirmed by a piece of evidence E (relative to K), any arbitrary disjunction X can be tacked to H while leaving the confirmation relation intact. For example, the hypothesis that all ravens are black or all doves are white is Hempel-confirmed by the observation of a black raven, although it is not clear in how far that observation is relevant for the hypothesis that all doves are white. Even worse, the same observation also confirms the hypothesis that all ravens are black or no doves are white. The tacked disjunction is completely arbitrary. Evidential relevance for the hypothesis gets lost, but a good account of confirmation should take care of these relations. Finally, consider the following case: A single card is drawn from a standard deck. We do not know which card it is. Compare, however, the two hypothesis that the card is the ace of diamonds (H5 ) and that the card is a red card (H6 ). Now, the person who draws the card tells us that the card is either an ace or a king of diamonds. Obviously, the hypothesis H6 is entailed by the evidence and thus Hempel-confirmed. But what about H5 ? We are now much more confident that H5 is true because the evidence favors the hypothesis that the card is an ace of diamonds over the hypothesis that the card is no ace of diamonds, in the usual relative sense of confirmation. However, the observation does not Hempel-confirm the hypothesis that the card is an ace of diamonds. This is so because not all assertions H5 makes about this particular card — that it is an ace and a diamond — are satisfied by the observation report. This behavior of Hempel-confirmation is awkward and stands in contrast to the most popular quantitative account of confirmation, the Bayesian account. Our toy example has analogues in science, too: it is not possible to Hempel-confirm all three of Kepler’s laws by confirming one of its three components. Any confirming observation report would have to entail each of Kepler’s laws (with regard to the planet that is observed). This is at least strange because we often cannot check each prediction of a theory. To give an example, an observation of the diffraction pattern of light apparently confirms the hypothesis that light is an electromagnetic wave. But waves have more characteristic properties than just exhibiting a diffraction pattern — in particular, properties that are not shown in our particular observation. Partial confirmation thus becomes difficult on a Hempelian view of confirmation. Hence, Hempel’s satisfaction criterion is not only liable to severe technical objections, but also fails to reconstruct an important line of thought in scientific observation and experimentation. Thus, the above objections do not only illuminate technical shortcomings of Hempel’s account, but also a general uneasiness with the Consequence Condition and the Special Consequence Condition. But why did they seem to be so plausible at first sight? I believe pace Carnap [1950] that the missing distinction between the absolute and the relative concept of confirmation is the culprit.13 We often 13 A discussion of that criticism which is more charitable towards Hempel can be found in [Huber, 2008].
Hempel and the Paradoxes of Confirmation
245
say that a certain theory is well confirmed, but we also say that a certain piece of evidence confirms a hypothesis. These two different usages correspond to different meanings of the word ‘confirmation’. When we use the former way of speaking — ‘theory T is well confirmed’ — we say something about a particular theory: T enjoys high confidence, the total available evidence speaks for T and favors it over all serious rivals. To be confirmed or to be well confirmed becomes a property of a particular hypothesis or theory. By contrast, the latter use says something about a relationship between hypothesis and evidence — it is asked whether a piece of evidence supports or undermines a hypothesis. Relative confirmation means that an empirical finding, a piece of evidence, lends support to a hypothesis or theory. This need, however, not imply that on account of the total available evidence, the theory is highly credible. The Consequence Condition is plausible whenever absolute confirmation is examined. When a strong, comprehensive theory is strongly endorsed — in the sense of ‘highly plausible’ or ‘empirically supported beyond all reasonable doubt’ — any part of this theory is also highly plausible, etc., in agreement with (CC) and (SCC). Obviously, the less risky a conjecture is, the more confidence can we put in it, and any proper part of a theory is logically weaker and thus less risky than the entire theory. Therefore the Consequence Condition makes perfect sense for degrees of belief and conviction, i.e. when it comes to endorsement and absolute confirmation. It is, however, highly questionable whether the Consequence Condition is also a sensible condition with regard to relative confirmation. Here, the evidence has to be informative with respect to the hypothesis under test. For instance, Eddington’s observations of the 1919 eclipse apparently confirmed the hypothesis that light is bent by massive bodies as the sun. General Theory of Relativity (GTR), the overarching theory, was at that time still fiercely contested, and the agreement of Eddington’s observations with GTR and their discrepancy from the Newtonian predictions constituted key evidence in favor of GTR. (The bending effect in the GTR predictions was roughly twice as high as in Newtonian theory.) But it would be much more controversial to claim — as (CC) does — that Eddington’s observations directly confirmed those parts of GTR that were remote from the bending-of-light effect, e.g. the gravitational redshift which was proven in the Pound-Rebka experiment in 1959. Confirmation does not automatically transmit to other sub-parts of an overarching theory, as vindicated in the (probabilistic) analysis by Dietrich and Moretti [2005]. Thus, we are well advised to drop the Consequence Condition. A similar criticism and be directed against the Consistency Condition since any coherent and unified theory that were in agreement with Eddington’s observations would have been confirmed by them. In a postscript to “Studies in the Logic of Confirmation” that appeared in 1965, Hempel admitted some of the problems of his account. In particular, he felt uncomfortable about the Consistency Condition which he thought to be too strong to figure as a necessary condition for (relative) confirmation. Thus, the satisfaction criterion is too narrow as a qualitative definition of confirmation. This concession suggests that Hempel actually spotted the problem of combining the concepts of
246
Jan Sprenger
relative and absolute confirmation in a single account (cf. [Huber, 2008]). But Hempel [1945/1965, p. 50] still contends that his adequacy conditions may be sufficient for a definition of confirmation. However, the next section will come up with a telling counterexample — a case of spurious confirmation which the satisfaction criterion fails to discern. 4
THE RAVEN PARADOX
Hypotheses about natural laws and natural kinds are often formulated in the form of universal conditionals. For instance, the assertion that all F ’s are G’s (H = ∀x : F x → Gx) suits hypotheses like ‘all planets have elliptical orbits’, ‘all ravens are black’ or ‘all cats are predators’. How are such claims confirmed? There is a longstanding tradition in philosophy of science that stresses the importance of instances in the confirmation of universal conditionals, going from Nicod [1925/1961] over Hempel [1945/1965] to Glymour [1980]. A confirming instance consists in the observation of an F that is also a G (F a.Ga) whereas an observation of an F that is no G (F a.¬Ga) refutes H. According to Nicod, only these two kinds of observation — ‘confirmation’ and ‘infirmation’ — are relevant to the hypothesis. L’induction par l’infirmation proceeds by refuting and eliminating other candidate hypothesis, l’induction par la confirmation supports a hypothesis by finding their instances. There is, however, an important asymmetry[Nicod. 1925/1961, pp. 23-25]: while observing a non-black raven refutes the raven hypothesis once and for all, observing a black raven does not permit such a conclusive inference. Nicod adds that not the sheer number of instances is decisive, but the variety of instances which can be accrued in favor of that hypothesis. If we try to put this idea of instance confirmation into a single condition, we might arrive at the following condition: Nicod Condition (NC): For a hypothesis of the form H = ∀x : Rx → Bx and an individual constant a, an observation report of the form Ra.Ba confirms H. However, this account does not seem to exhaust the ways a hypothesis can be confirmed. Recall the Equivalence Condition (EC): If H and H are logically equivalent sentences then E confirms H if and only if E confirms H . As already argued, the equivalence condition is an uncontroversial constraint on a logic of confirmation. Combining (EC) with Nicod’s condition about instance confirmation leads, however, to paradoxical results: Take the hypothesis that nothing that is non-black can be a raven (H = ∀x : ¬Bx → ¬Rx). A white shoe is an instance of that hypothesis, thus, observing it counts as a confirming observation report. By the Equivalence Condition, H is equivalent to H = ∀x : Rx → Bx so that a white shoe also confirms the hypothesis that all ravens are black.
Hempel and the Paradoxes of Confirmation
247
But a white shoe seems to be utterly irrelevant to the color of ravens. Hence, we have three individually plausible, but incompatible claims at least one of which has to be rejected: 1. Nicod Condition (NC): For a hypothesis of the form H = ∀x : Rx → Bx and any individual constant a, an observation report of the form Ra.Ba confirms H. 2. Equivalence Condition (EC): If H and H are logically equivalent sentences then E confirms H relative to K if and only if E confirms H relative to K. 3. Confirmation Intuition (CI): A Hypothesis of the form H = ∀x : Rx → Bx is not confirmed by an observation report of the form ¬Ra.¬Ba. This set of jointly inconsistent claims constitutes the paradox of confirmation and was first discussed in detail by Hempel [1965].14 The main conflict consists in the fact that (EC) and (NC) merely consider the logical form of scientific hypotheses whereas (CI) implicitly assumes that there is an ‘intended domain’ of a scientific hypothesis. In particular, only ravens seem to be evidentially relevant to the hypothesis that all ravens are black. One option to dissolve the paradox discussed (and rejected) by Hempel [1945/ 1965] consists in re-interpreting the hypothesis. General natural laws in the form of universal conditionals apparently confer existential import on the tentative hypotheses: ‘All ravens are black’ could be read as ‘all ravens are black and there exists at least one raven’. Then, there is no inconsistency between the above three claims. But that proposal is not convincing. The observation of a single black raven provides conclusive evidence in favor of the second part of the hypothesis. As Alexander [1958, p. 230] has pointed out, we will then focus on confirming or undermining the first part of the hypothesis (‘all ravens are black’) as soon as a black raven has been observed. Hence, the paradox appears again. Interpreting the raven hypothesis as having existential import does not remove the problem. Before going into the details of attempted solutions it is interesting to note a line of thought that can be traced back to Hempel himself. “If the given evidence E [...] is black [Ba], then E may reasonably be said to even confirm the hypothesis that all objects are black [∀x : Bx], and a fortiori, E supports the weaker assertion that all ravens are black [H = ∀x : Rx → Bx ].”15 We can transfer this argument in a canonical to non-ravens (cf. [Goodman, 1983, pp. 70-71]): 14 Note that the inconsistency vanishes if the conditionals are interpreted as subjunctive and not as material conditionals: contraposition is not a valid form of inference for subjunctive conditionals. 15 [Hempel, 1945/1965, p. 20].
248
Jan Sprenger
If the given evidence E is a non-raven [¬Ra], then E may reasonably be said to even confirm that all objects are non-ravens [∀x : ¬Rx], and a fortiori, E supports the weaker assertion that all non-black objects are non-ravens [∀x : ¬Bx → ¬Rx], i.e. that all ravens are black [H = ∀x : Rx → Bx ].16 Thus we obtain another ostensibly decisive argument against (CI). But as remarked by Fitelson [2006], the argument requires additional assumptions. Above all, the step from ‘a non-raven confirms H’ to ‘a black non-raven confirms H’ is far from trivial — it rests on the principle of monotonicity that extending the evidence cannot destroy the confirmation relation. Without this additional claim, the above argument would not bear on the observation of a black non-raven. Moreover, Hempel’s adequacy condition (SCC) is employed, namely in the transition from ‘E confirms ∀x : ¬Rx’ to ‘E confirms ∀x : ¬Bx → ¬Rx’. We may suspend judgment on monotonicity, but (SCC) is, as seen in the previous section, a controversial condition on relative confirmation. So the above reasoning does not remove the paradox convincingly.17 Hempel suggests that we should learn to live with the paradoxical conclusion. His argument can be paraphrased thus:18 Assume that we observe a grey, formerly unknown bird that is in most relevant external aspects very similar to a raven. That observation puts the raven hypothesis to jeopardy. It might just be the case that we have seen a non-black raven and falsified our hypothesis. But a complex genetic analysis reveals that the bird is no raven. Indeed, it is more related to crows than to ravens. Hence, it sounds logical to say that the results of the genetic analysis corroborate the raven hypothesis — it was at risk and it has survived a possible falsification. In other words, a potential counterexample has been eliminated. Thus there is no paradox in saying that an observation report of the form ¬Ra.¬Ba confirms H, in the sense that a satisfies the constraint given by H that nothing can be both a raven and have a color different from black.19 Hempel elaborates the crucial point in more detail, too: Compare two possible observation reports. First, we observe a crow which we know to be a crow and notice that it is grey (E1 = ¬Ba, K1 = ¬Ra). This seems to be a fake experiment 16 I borrow the idea to paraphrase Hempel’s argument in this way from Maher [1999] and Fitelson [2006]. 17 Quine [1969], by contrast, defends (CI) and finds the paradox unacceptable. Since he maintains (EC), too, he is forced to reject the Nicod Condition. Nonetheless, he defends a modified Nicod Condition whose content is restricted to natural kinds. Only instances of natural kinds confirm universal conditionals, and clearly, neither non-ravens nor non-black things count as natural kinds. However, this line of reasoning is subject to the Hempelian criticism explained in the text. 18 [Hempel, 1945/1965] makes the argument for quite a different example (‘all sodium salts burn yellow’) but I would like to stick to the original raven example in order not to confuse the reader. 19 It might now be objected that the observation of a black raven seems to lend stronger support to the raven hypothesis than the observation of a grey crow-like bird since such an observation is more relevant to the raven hypothesis. But this is a problem for a quantitative account of confirmation (we will get back to this in section 5) and not for a qualitative one.
Hempel and the Paradoxes of Confirmation
249
if evaluated with regard to the raven hypothesis — we knew beforehand that a crow could not have been a non-black raven. There was no risk involved in the experimentation, so neither confirmation nor Popperian corroboration could result. In the second case we observe an object about which we do not know anything beforehand and discover that the bird is a grey crow (E2 = ¬Ra.¬Ba, K2 = ∅). That counts as a sound case of confirmation, as argued above. Hempel describes the difference thus: When we are told beforehand that the bird is a crow [...] “this has the consequence that the outcome of the [...] color test becomes entirely irrelevant for the confirmation of the hypothesis and thus can yield no new evidence for us.”20 In other words, the available background knowledge in the two cases makes a crucial difference. Neglecting this difference is responsible for the fallacious belief (CI) that non-black non-ravens cannot confirm the hypothesis that all ravens are black. (CI) is plausible only if we tacitly introduce the additional background knowledge that the test object is no raven. Thus, in the above example, H should be confirmed if we do not know beforehand that the bird under scrutiny is a crow (K2 = ∅) and it should not be confirmed if we know beforehand that the bird is a crow (K1 = ¬Ra). In Hempel’s own words, “If we assume this additional information as given, then, of course, the outcome of the experiment can add no strength to the hypothesis under consideration. But if we are careful to avoid this tacit reference to additional knowledge (which entirely changes the character of the problem) [...] we have to ask: Given some object a [that is neither a raven nor black, but we do not happen to know this, J.S.]: does a constitute confirming evidence for the hypothesis? And now [...] it is clear that the answer has to be in the affirmative, and the paradoxes vanish.”21 Thus, the paradox is a psychological illusion, created by tacit introduction of background knowledge into the confirmation relation. From a logical point of view, (CI) reveals itself as plainly false. One of the three premises of the paradox has been discarded. A problem that remains, though, is that the Hempelian resolution does not make clear why ornithologists should go into the forest to check their hypothesis and not randomly note the properties of whatever object they encounter. This might be called the problem of armchair ornithology. In fact, this criticism is raised by Watkins [1957]. Watkins insinuates that Hempel may cheaply confirm H = ∀x : Rx → Bx by summing up observations of non-black non-ravens while sitting in the armchair.22 In a similar vein, Watkins [1957] objects that cases of confirmation as the observation of a white shoe do not 20 [Hempel,
1945/1965, p. 19]. 1945/1965, pp. 19-20]. 22 A reply to Watkins is [Vincent, 1964]. 21 [Hempel,
250
Jan Sprenger
put the hypothesis to a real test and thus contradict the falsificationist methodology for scientific hypotheses. On the Popperian, falsificationist account, hypotheses can only be corroborated by the survival of severe tests, and observing shoes does not count as a real test of a hypothesis. Second, the ‘negation’ of observing a white shoe, namely observing a black shoe, would equally confirm the raven hypothesis on Hempel’s account. This trivializes the notion of instance confirmation on which Hempel’s satisfaction criterion is based. Every universal conditional is automatically confirmed by lots of irrelevant evidence. Watkins concludes that an inductivist reasoning about confirmation (as Hempel’s instance confirmation) be better replaced by a truly falsificationist account. Alexander [1958; 1959] answers to Watkins that falsificationist corroboration also presupposes some kind of inductive reasoning if it is supposed to affect our expectations on the future: If a hypothesis survives several tests, we expect that it will survive future tests, too — otherwise it would not make sense to say that the hypothesis has been corroborated. So Watkins’s dismissal of inductive, instancebased reasoning goes too far. Moreover, Hempel makes an important proviso, namely that there be no substantial background assumptions when evaluating the evidential relevance of ¬Ra.¬Ba. If we do not know where to find and where not to find ravens, i.e. if we randomly sample from the class of all objects, the observation of a white shoe does count as a genuine test of the raven hypothesis. One may object that Hempel’s proviso is unrealistic for actual cases of (dis)confirmation (cf. [Watkins, 1960]), but conditional on this proviso, Hempel’s conclusion — everything that is not a non-black raven supports H = ∀x : Rx → Bx — seems to be correct. So the first objection vanishes. Second, it is misleading to say that the raven hypothesis is confirmed by conflicting evidence — rather, different kinds of evidence (namely, shoes of different color) equally confirm the hypothesis. Similarly, observing male as well as female black ravens confirms the raven hypothesis. Here, nobody would object that those pieces of evidence are conflicting and therefore inadmissible for confirmation. However, as pointed out by Agassi [195]), Hempel’s conclusion is stronger than that a non-black non-raven may confirm the raven hypothesis — it is claimed that this piece of evidence always confirms the raven hypothesis, independent of the background knowledge. Good [1960; 1961] has suggested the following (slightly modified) example to refute that conjecture: The only black middle-sized objects which a child sees are black crows and black ravens. No other black objects occur, and all ravens and crow the child sees are black. Suddenly she discovers a white crow. Then she says: “How surprising! Apparently objects that are supposed to be black can sometimes be white instead.”23 And what is good for the goose (crows) is equally good for the gander (ravens). So the child concludes that ravens may be white, too. On Hempel’s account, the observation of a grey crow would support rather than undermine the hypothesis that all ravens are black. Isn’t that behavior insensitive to the peculiarities of the specific case? I believe Agassi and Good are on the right track, but they do not fully pin down 23 [Good,
1961, p. 64]. Cf. [Swinburne, 1971].
Hempel and the Paradoxes of Confirmation
251
Hempel’s problem. We may admit that Hempel succeeds in explaining away the paradoxical nature of the problem. But his own satisfaction criterion fails to resolve the paradox. Remember Hempel’s diagnosis that tacitly introduced or deliberately suppressed background information is the source of the paradox. While perfectly agreeing with Hempel on this point, Fitelson and Hawthorne [2009] point out that Hempel is unable to make that difference in his own theory of confirmation. The reason is that his account is in general monotone with regard to the background knowledge: As long as the domain of the evidence is not extended (i.e. no individual constants are added), additional background knowledge cannot destroy the confirmation relation. Hempel inherits this property from deductive logic, because E.K |= H|dom(E) is the crucial condition for direct Hempel-confirmation, and thus also for Hempel-confirmation. Evidently, logical entailment is preserved under adding additional conditions to the antecedens. Therefore Hempel’s own account yields confirmation even if the background knowledge is far too strong. In the first case (we do not know beforehand that a is no raven) confirmation follows from E1 .K1 = ¬Ra.¬Ba |= (Ra → Ba) = H|dom(E) and in the second case, we have precisely the same implication E2 .K2 = ¬Ra.¬Ba |= (Ra → Ba) = H|dom(E) . Hence, adding the background knowledge that the test object is no raven does not destroy the (Hempel-)confirmation of H2 . Certainly Hempel spots two points correctly: First, the paradoxical conclusion of the raven example should be embraced, contra (CI). Second, background knowledge plays a crucial role when it comes to explaining the source of the paradox. But while pointing into the right direction, Hempel fails to set up an account of confirmation that conforms to his own diagnosis of the paradox. In particular, the adequacy criteria outlined in section 2 fail to be sufficient for a satisfactory concept of confirmation. The raven paradox drastically shows how valuable it is to distinguish between evidence and background knowledge. The distinction has to be formalized in a way that avoids Hempel’s problem. It further exhibits the problem of monotonicity with regard to evidence and background knowledge: When we happen to know more, confirmation might get lost. Therefore monotonicity is not a desirable property for accounts of confirmation, and I take this to be the third important moral from the paradoxes of confirmation. On the other hand, the arguments to resolve the paradox by giving up (CI) were on a whole convincing, and Hempel’s sixty-three-year-old judgment that part the paradoxical appearance often rests on a psychological illusion has some plausibility. The next section examines the paradoxes of confirmation from a probabilistic perspective.24 24 In a recent paper, Branden Fitelson [2009] elaborates the similarity of the raven a famous logical puzzle: the Wason Selection Task [Wason and Shapiro, 1971]. In Selection Task, four cards lie on the table. On the front side of each card, there on the back side, there is a number. The hypothesis H is: All cards with an even
paradox to the Wason is a letter, number on
252
Jan Sprenger
5 THE BAYESIAN’S RAVEN PARADOX
5.1 Bayesian confirmation theory and the Nicod Condition So far, we have discussed the paradox in a qualitative way — does observing a nonblack non-raven confirm the hypothesis that all ravens is black? The Hempelian resolution does, however, not clarify why we would recommend an ornithologist to go into the forest, in order to confirm the raven hypothesis. A natural reply would contend that black ravens confirm the raven hypothesis to a much stronger degree than white shoes. That thesis motivates a quantitative treatment of the paradox and will be the main subject of this section. Actually, the ‘confirmation intuition’ (CI) about the missing confirmatory value of non-ravens has three versions — a qualitative, a comparative and a quantitative one: Qualitative Intuition The observation of a non-black non-raven does not confirm the hypothesis that all ravens are black. Comparative Intuition The observation of a non-black non-raven confirms the hypothesis that all ravens are black to a lower degree than the observation of a black raven. Quantitative Intuition The observation of a non-black non-raven confirms the hypothesis that all ravens are black only to a minute degree. Part of the confusion in the existing literature is due to the fact that these three intuitions are not clearly set apart from each other. Hempel criticized exclusively the qualitative version. The quantitative and the comparative versions save the part of (CI) that concerns the extent of confirmation, and here our intuitions seem to be more stable. They form the resilient kernel of (CI) which makes the raven paradox so intriguing for modern confirmation theory. A further source of confusion is the question which background knowledge should be taken when evaluating these intuitions. Are they meant to hold for some, for empty or for all conceivable background assumptions? Or are those intuitions relative to the actual background assumptions?25 Hence, twelve (=3 × 4) different confirmation intuitions about the paradox could in principle be distinguished. But I believe intuitions with respect to actual background knowledge to be most interesting. First, most people seem to have that in mind when being one side have a vowel printed on the other side. Which of the cards (A, 2, F, 7) should you turn over to test the truth of H? Of course you have to turn over the card with the ‘2’ since this can be an obvious instance or counterexample to H. This line of reasoning is captured in the Nicod Condition, too. It is less obvious that you also have to turn over the ‘F ’ in order to test the contrapositive: All cards with a consonant on one side have an odd number on the other side. People regularly fail to recognize that the ‘F ’ has to be turned over, too. The kind of confirmation which this action yields is structurally identical to confirming the raven hypothesis by observing that a grey bird is not a raven, but a crow. Both the results in the Wason Selection Task and the debate around the raven paradox highlight the same kind of reluctancy to accept instances of the contrapositive as instances of the hypothesis itself. 25 I borrow these distinctions from [Fitelson, 2006].
Hempel and the Paradoxes of Confirmation
253
confronted with the paradox, so it is arguably the most accurate reconstruction of the paradox. Second, we will later argue that the Nicod Condition is best understood as referring to actual background knowledge. Indeed, Good’s [1961] raven/crow example suggests that the above confirmation intuitions will trivially hold for some background knowledge and trivially be false for every conceivable background knowledge. Finally, what empty background knowledge means stands in need of explication (though see [Carnap, 1950; Maher, 2004]). Thus we are well advised to focus on actual background knowledge. Here we have seen that the qualitative version of (CI) is under pressure, but on the other hand, the comparative and the quantitative versions enjoy some plausibility. This section tries to reinforce the arguments against the qualitative intuition and to vindicate the comparative and quantitative intuition from the point of view of Bayesian confirmation theory. The problem with the raven paradox is not the alleged truth of (CI), but the truth of the weaker comparative and quantitative versions. Qualitatively, Bayesian confirmation amounts to an increase in rational degree of belief upon learning new evidence. Degrees of belief are symbolized by subjective probabilities. In other words, evidence E confirms H if and only if P (H|E) > P (H). But we have to remember a lesson from the very first chapter of the book — confirmation is a three place predicate, relative to background knowledge. As both the raven paradox and the Duhem problem teach us, background assumptions are a crucial part of relating theory to evidence and inductive reasoning in science. The natural way to integrate them consists in taking background information for granted and conditionalizing an agent’s degrees of belief on it.26 That said, we can write down a first, qualitative definition of Bayesian confirmation: DEFINITION 9. A piece of evidence E confirms a hypothesis H relative to background assumptions K if and only if P (H|E.K) > P (H|K). This definition gives a probabilistic explication of relative confirmation, not of absolute confirmation: Definition 9 describes the relevance of evidence for a hypothesis, not high credibility of a hypothesis. However, the definition remains qualitative. To be able to tackle the comparative and quantitative versions of the paradox, we have to introduce a measure of confirmation. The following three candidates have been especially popular in the literature (see Fitelson 2001 for a discussion of their virtues and vices): Difference Measure d(H, E, K) := P (H|E.K) − P (H|K) Log-Ratio Measure r(H, E, K) := log
P (H|E.K) P (H|K)
26 Nonetheless, for reasons of convenience, we will often speak (but not write) as if the background knowledge were empty.
254
Jan Sprenger
Log-Likelihood Measure l(H, E, K) := log
P (E|H.K) P (E|¬H.K)
For reasons of simplicity, I restrict myself in the following to d and l which suffice to illustrate the substantial points. In the 1950s and early 1960s, the discussion of the confirmation paradoxes focussed on discussing, defending and rebutting (CI). In particular, Hempel himself has rejected (CI) and argued that tacit introduction of background knowledge may be responsible for the paradoxical appearance. In the light of Bayesian confirmation theory, one could, however, not only reject (CI), but also question (NC). Again, four versions of (NC) have to be distinguished. Nicod Condition (NC): For a hypothesis of the form H = ∀x : Rx → Bx and any individual constant a, an observation report of the form Ra.Ba confirms H, relative to every/actual/tautological/any background knowledge. Certainly, the Nicod Condition (every black raven confirms the raven hypothesis) is true relative to some background knowledge. But that claim is very weak and practically not helpful. It is somewhat more surprising that it is not true under all circumstances. I. J. Good [1967] constructed a simple counterexample in a note for the British Journal for the Philosophy of Science: There are only two possible worlds. In one of them, W1 , there are a hundred black ravens, no non-black ravens and one million other birds. In the other world W2 , there are a thousand black ravens, one white raven and one million other birds. Thus, H is true whenever W1 is the case, and false whenever W2 is the case. For all suggested measures of confirmation, the observation of a black raven is evidence that W2 is case and therefore evidence that not all ravens are black: P (Ra.Ba|W1 )
0, and P (Ba.Ra|K) > 0. • P (¬Ba|H.K) > P (Ra|H.K). •
>
P (H | Ra.K) 1 − P (H | Ra.K)
1 − P (H | ¬Ba.K) · P (H | ¬Ba.K)
P (Ba | Ra.¬H.K) + [1 − P (Ba | Ra.¬H.K)]
P (Ra | H.K) (2) P (¬Ba | H.K)
Then l(H, Ba.Ra, K) > l(H, ¬Ba.¬Ra, K), i.e. (3) log
P (Ba.Ra|H.K) P (¬Ba.¬Ra|H.K) > log P (Ba.Ra|¬H.K) P (¬Ba.¬Ra|¬H.K)
and in particular (4) P (H | Ba.Ra.K) > P (H | ¬Ba.¬Ra.K) (The proof can be found in [Fitelson and Hawthorne, 2009].) The theorem asserts that the degree of support which Ba.Ra lends to H, as measured by the log-likelihood ratio l, exceeds the degree of support ¬Ba.¬Ra lends to H (see (3)). In other words, Fitelson and Hawthorne vindicate the comparative version of (CI): black ravens confirm the raven hypothesis better than white shoes. It follows easily that the posterior probability of H is higher if a black raven is observed than if a white shoe (or any non-black raven) is observed. To evaluate their result, we have to look at the assumptions of the theorem. The first set of assumptions is fully unproblematic: It is demanded that neither the observation of a black raven nor the observation of a non-black non-raven will determine the truth or falsity of H. Moreover, the rational degree of belief that a non-black ravens, black ravens or non-black non-ravens will be observed has to be higher than zero (though it can be infinitely small). These are just assumptions that reflect the openness of our probability assignments to empirical evidence. The second assumption is a little bit richer in content, but still extremely plausible: If H is true then we are more likely to observe a non-black object than a raven. That reflects the belief that there are many non-black objects (grey birds, for example), but comparably few ravens. Thus the last inequality (2) carries the main burden of the theorem. Is it a plausible assumption? Let us have a look at the right hand side first. Even if H is wrong, we expect the number of black ravens to vastly exceed the number of non-black ravens. (Note that we have already observed many black ravens!) Thus, x := P (Ba | Ra.¬H.K) is quite close to 1. Moreover, in any case there are many more non-black things than ravens. So the ratio P (Ra | H.K)/P (¬Ba | H.K) will be very small, and the second addend on the right hand side of (2) can be
260
Jan Sprenger
neglected (since 1 − x is close to zero). Now we come to the left hand side. Regardless of whether we observe black ravens or white shoes, a single observation of either object will not impose major changes on the posterior probability of H. This transfers to the posterior odds of H after observing Ba.Ra or ¬Ba.¬Ra, respectively. Thus, the quotient of those posterior odds will be close to 1 — even more close than x = P (Ba | Ra.¬H.K). And by this line of reasoning, we have established (2) and the last of Fitelson and Hawthorne’s premises. Thus, their argument is not only valid, but also conclusive. Of course, it is still possible to doubt one of the plausibility arguments in the previous paragraphs. But I think they are cogent enough to put the burden of proof to those who doubt Fitelson and Hawthorne’s comparative solution. Moreover, the elegance of their proof deserves high praise, and since they use clear-cut assumptions, their analysis directly points out the points of disagreement between defenders and critics of their solution. Furthermore, they do not rely on the independence claim (IA) or variants thereof. 6 SUMMARY The first part of this article has described and reviewed Hempel’s theory of confirmation and his analysis of the paradoxes of confirmation. Hempel’s approach to modeling confirmation departs from Carnap’s probabilistic approach: he decides to lay the qualitative foundations first by formulating general adequacy constraints that any account of confirmation has to satisfy. Hempel’s qualitative account of confirmation breaks with the classical hypothetico-deductive approach and proposes the satisfaction criterion: the restriction of a hypothesis to a specified object domain has to be entailed by the evidence. The criterion, however, has several shortcomings, some of them of technical nature, others being connected to the failure to account for confirmation by successful prediction. One of the most severe objections contends, however, that the satisfaction criterion is often monotone with respect to the background knowledge and thus unable to deal with the paradoxes of confirmation. On the one hand, Hempel has convincingly argued that the paradoxes rest on a psychological illusion, due to the tacit introduction of additional background knowledge. But on the other hand, his own criterion of confirmation neglects that insight and therefore fails to remove the paradox. The second part of the article focuses on recent attempts to solve the paradoxes in the framework of Bayesian confirmation theory. While Hempel was probably right that the qualitative version of the paradox was just a scheinproblem, there are comparative and quantitative versions of the paradox, too. To vindicate these intuitions in a probabilistic framework has proved to be a tough task. By their [2009] result, Fitelson and Hawthorne solve the comparative problem and give some reasons for optimism. But so far, the quantitative problem remains unsolved. Even more embarrassing, I have argued that there are principal problems that impair a sufficiently general resolution of the paradoxes of confirmation. The conclusion which I draw — scepticism towards quantitative solutions of the paradox — is
Hempel and the Paradoxes of Confirmation
261
somewhat atypical because most contributions to the literature either propose a solution or suggest to replace a previous attempt by a novel and better one.34 But the longstanding history of the paradox indicates that it will be hard to overcome. ACKNOWLEDGEMENTS I would like to thank Andreas Bartels, Branden Fitelson, Stephan Hartmann, James Hawthorne, Franz Huber, Kevin Korb, and Jacob Rosenthal for their incredibly helpful advice and criticism. BIBLIOGRAPHY [Agassi, 959] J. Agassi. Corroboration versus Induction, British Journal for the Philosophy of Science, 9:311-317, 1959. [Alexander, 1958] H. G. Alexander. The Paradoxes of Confirmation, British Journal for the Philosophy of Science, 9:227-233, 1958 [Alexander, 1959] H. G. Alexander. The Paradoxes of Confirmation — A Reply to Dr Agassi, British Journal for the Philosophy of Science, 10:229-234, 1959. [Black, 1966] M. Black. Notes on the ‘paradoxes of confirmation’. In Jaakko Hintikka and Patrick Suppes, eds., Aspects of Inductive Logic, pp. 175-197, North-Holland, Amsterdam, 1966. [Carnap, 1950] R. Carnap. Logical Foundations of Probability, The University of Chicago Press, Chicago, 1950. [Carnap, 1952] R. Carnap. The Continuum of Inductive Methods, The University of Chicago Press, Chicago, 1952. [Dietrich and Moretti, 2005] F. Dietrich and L. Moretti. On Coherent Sets and the Transmission of Confirmation, Philosophy of Science, 72:403-424, 2005. [Duhem, 1914] P. Duhem. La Th´ eorie Physique: Son Objet, Sa Structure. 1914. Second edition, reprinted in 1981 by J. Vrin, Paris. [Earman, 1992] J. Earman. Bayes or Bust? The MIT Press, Cambridge/MA, 1992. [Earman and Salmon, 1992] J. Earman and W. Salmon. The Confirmation of Scientific Hypotheses. In Merrilee H. Salmon. ed., Introduction to the Philosophy of Science, pp. 42-103. Hackett, Indianapolis, 1992. [Fitelson, 2001] B. Fitelson. A Bayesian Account of Independent Evidence with Applications, Philosophy of Science, 68:S123-S140, 2001. [Fitelson, 2006] B. Fitelson. The Paradox of Confirmation, Philosophy Compass, 1’:95–113, 2006 [Fitelson, 2009] B. Fitelson. The Wason Task(s) and the Paradox of Confirmation, Synthese, 2009. [Fitelson and Hawthorne, 2009] B. Fitelson and J. Hawthorne. How Bayesian Confirmation Theory Handles the Paradox of the Ravens. In Ellery Eells and James Fetzer, eds., Probability in Science, Open Court, Chicago, 2009. [Friedman, 1999] M. Friedman. Reconsidering Logical Positivism, Cambridge University Press, Cambridge, 1999. [Gaifman, 1979] H. Gaifman. Subjective Probability, Natural Predicates and Hempel’s Ravens, Erkenntnis, 21:105-147, 1979. [Gemes, 1993] K. Gemes. Hypothetico-Deductivism, Content and the Natural Axiomatisation of Theories, Philosophy of Science, 60:477-487, 1993. [Gemes, 2006] K. Gemes. Content and Watkins’ Account of Natural Axiomatizations, dialectica, 60:85-92, 2006. [Glymour, 1980] C. Glymour. Theory and Evidence, Princeton University Press, Princeton, 1980. [Good, 1960] I. J. Good. The Paradox of Confirmation, British Journal for the Philosophy of Science, 11:145-149, 1960. 34 [Korb,
1994] and [Vranas, 2004] are notable exceptions.
262
Jan Sprenger
[Good, 1961] I. J. Good. The Paradox of Confirmation (II), British Journal for the Philosophy of Science, 12:63-64, 1961. [Good, 1967] I. J. Good. The White Shoe is a Red Herring, British Journal for the Philosophy of Science, 17:322, 1967. [Good, 1968] I. J. Good. The White Shoe qua Herring is Pink, British Journal for the Philosophy of Science, 19:156-157, 1968. [Goodman, 1983] N. Goodman. Fact, Fiction and Forecast, Fourth Edition. Harvard University Press, Oxford, 1983. [Hempel, 1943] C. G. Hempel. A Purely Syntactical Definition of Confirmation, Journal of Symbolic Logic, 8: 122-143, 1943. [Hempel, 1965] C. G. Hempel. Studies in the Logic of Confirmation, Aspects of Scientific Explanation, 3-51, 1965. The Free Press, New York. Reprint from Mind 54, 1945 [Hempel, 1967] C. G. Hempel. The White Shoe: No Red Herring, British Journal for the Philosophy of Science, 18:239-240, 1967. [Hosiasson, 1940] J. Hosiasson-Lindenbaum. On Confirmation, Journal of Symbolic Logic, 5:133148, 1940. [Horwich, 1982] P. Horwich. Probability and Evidence, Cambridge University Press, Cambridge, 1982. [Howson and Urbach, 1993] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. Second Edition. Open Court, La Salle, 1993. [Huber, 2008] F. Huber. Hempel’s Logic of Confirmation, Philosophical Studies, 139:181-189, 2008. [Humburg, 1986] J. Humburg. The Solution of Hempel’s Raven Paradox in Rudolf Carnap’s System of Inductive Logic, Erkenntnis, 24:57-72, 1986. [Korb, 1994] K. B. Korb. Infinitely many solutions of Hempel’s paradox. Theoretical Aspects Of Rationality and Knowledge — Proceedings of the 5th conference on Theoretical aspects of reasoning about knowledge. pp. 138-149. San Francisco: Morgan Kaufmann Publishers, 1994. [Mackie, 1963] J. L. Mackie. The Paradox of Confirmation, British Journal for the Philosophy of Science, 13:265-276, 1963. [Maher, 1999] P. Maher. Inductive Logic and the Ravens Paradox, Philosophy of Science, 66:5070, 1999. [Maher, (2004] P. Maher. Probability Captures the Logic of Confirmation. In Christopher Hitchcock, ed., Contemporary Debates in the Philosophy of Science, pp. 69-93. Blackwell, Oxford, 2004. [Musgrave, 2009] A. Musgrave. Popper and Hypothetico-Deductivism. In this volume, 2009. [Nicod, 1961] J. Nicod. Le Probl` eme Logique de l’Induction. Paris: Presses Universitaires de France. Originally published in 1925 (Paris: Alcan), 1961. [Popper, 1963] K. R. Popper. Conjectures and Refutations: The Growth of Scientific Knowledge, Routledge, London, 1963. [Royall, 1997] R. Royall. Statistical Evidence: A Likelihood Paradigm, Chapman & Hall, London, 1997. [Schurz, 1991] G. Schurz. Relevant Deduction, Erkenntnis, 35:391-437, 1991. [Suppes, 1969] P. Suppes. Models of Data. I P. Suppes, ed., Studies in the Methodology and Foundations of Science. Selected Papers from 1951 to 1969, pp. 24-35. Reidel, Dordrecht, 1969. Orginally published in Ernest Nagel, Patrick Suppes and Alfred Tarski (eds.): “Logic, Methodology and Philosophy of Science: Proceedings of the 1960 International Congress”. Stanford: Stanford University Press, 252-261, 1962. [Swinburne, 1971] R. Swinburne. The Paradoxes of Confirmation — A Survey, American Philosophical Quarterly, 8:318-330, 1971. [Uebel, 2006] T. Uebel. The Vienna Circle, 2006. [Vincent, 1964] D. H. Vincent. The Paradoxes of Confirmation, Mind, 73:273-279, 1964. [vonWright, 1966] G. H. von Wright. The Paradoxes of Confirmation. In Jaakko Hintikka and Patrick Suppes, eds., Aspects of Inductive Logic, pp. 208-218. North-Holland, Amsterdam, 1966. [Vranas, 2004] P. Vranas. Hempel’s Raven Paradox: A Lacuna in the Standard Bayesian Solution, British Journal for the Philosophy of Science, 55:545-560, 2004. [Wasonand Shapiro, 1971] P. C. Wason and D. Shapiro. Natural and contrived evidence in a reasoning problem, Quarterly Journal of Experimental Psychology, 23:63-71, 1971.
Hempel and the Paradoxes of Confirmation
263
[Watkins, 1957] J. W. N. Watkins. Between Analytical and Empirical, Philosophy, 33:112-131, 1957. [Watkins, 1960] J. W. N. Watkins. Confirmation without Background Knowledge, British Journal for the Philosophy of Science, 10:318-320, 1960. [Weisberg, 2009] J. Weisberg. Varieties of Bayesianism, this volume, 2009. [Woodward, 1985] J. Woodward. Critical Review: Horwich on the Ravens, Projectability and Induction, Philosophical Studies, 47:409-428, 1985 [Zabell, 2009] S. Zabell. Carnap and the Logic of Induction, this volume, 2009.
CARNAP AND THE LOGIC OF INDUCTIVE INFERENCE S. L. Zabell
1
INTRODUCTION
This chapter discusses Carnap’s work on probability and induction, using the notation and terminology of modern mathematical probability, viewed from the perspective of the modern Bayesian or subjective school of probability. (It is a much expanded and more mathematical version of [Zabell, 2007]). Carnap initially used a logical notation and terminology that made his work accessible and interesting to a generation of philosophers, but it also limited its impact in other areas such as statistics, mathematics, and the sciences. Using the notation of modern mathematical probability is not only more natural, but also makes it far easier to place Carnap’s work alongside the contributions of such other pioneers of epistemic probability as Frank Ramsey, Bruno de Finetti, I. J. Good, L. J. Savage, and Richard Jeffrey. Carnap’s interest in logical probability was primarily as a tool, a tool to be used in understanding the quantitative confirmation of an hypothesis based on evidence and, more generally, in rational decision making. The resulting analysis of induction involved a two step process: one first identified a broad class of possible confirmation functions (the regular c-functions), and then identified either a unique function in that class (early Carnap) or a parametric family (later Carnap) of specific confirmation functions. The first step in the process put Carnap in substantial agreement with subjectivists such as Ramsey and de Finetti; it is the second step, the attempt to limit the class of probabilities still further, that distinguishes Carnap from his subjectivist brethren. So: precisely what are the limitations that Carnap saw as natural to impose? In order to discuss these, we must begin with his conceptS of probability. 2
PROBABILITY
The word ‘probability’ has always had a multiplicity of meanings. In the beginning mathematical probability had a meaning that was largely epistemic (as opposed to aleatory); thus for Laplace probability relates in part to our knowledge and in part to our ignorance. During the 19th century, however, empirical alternatives
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
266
S. L. Zabell
arose. In the years 1842 and 1843, no fewer than four independent proposals for an objective or frequentist interpretation were first advanced: those of Jakob Friedrich Fries in Germany, Antoine Augustin Cournot in France, and John Stuart Mill and Robert Leslie Ellis in England. Less than a quarter of a century later, John Venn’s Logic of Chance [Venn, 1866], the first book in English devoted exclusively to the philosophical foundations of probability, took a purely frequentist view of the subject. Ramsey, in advancing his view of a quantitative subjective probability based on a consistent system of preferences [Ramsey, 1926], deftly side-stepped the debate by conceding that the frequency interpretation of probability was a perfectly reasonable one, one which might have considerable value in science, but argued that this did not preclude a subjective interpretation as well. During the 20th century the debate became increasingly more complex, von Mises, Reichenbach, and Neyman advancing frequentist views, and Keynes, Ramsey, and Jeffreys competing logical or subjective theories. Carnap sought to bring order into this chaos by introducing the concepts of explicandum and explicatum. Sometimes philosophical debates arise unnecessarily due to the use of ill-defined (or even undefined) concepts. For example, an argument about whether or not viruses constitute a form of life can only really arise from a failure to define just what one means by life; define the term and the status of viruses (whose structure and function are in many cases very well understood) will become clear one way or the other. This is essentially an operationalist or logical positivist perspective, a legacy of Carnap’s days in the Vienna Circle. For Carnap the explicandum was the ill-defined concept; the explicatum the clarification of it that someone advanced. But probability did not involve just a dispute over the explication of a term. The term itself did double duty, being used by some in an epistemic fashion (the degree of belief in a proposition or event), and by others in an aleatory fashion (a frequency in a class or series). To unravel the Gordian knot of probability, one had to sever the two concepts and recognize that there are two distinct explicanda, each requiring separate exegesis.
2.1 Early views In his paper “The two concepts of probability” [1945b], Carnap introduced the terms probability1 and probability2 , the first referring to probability in its guise as a measure of confirmation, the second as a measure of frequency. This had twin advantages: putting the issue so clearly, debates about the one true meaning of probability became less credible; and the more neutral terminology helped shift the argument from issues of linguistic useage (which, after all, vary from one language to another), to conceptual explication. These ideas were developed at great length in Carnap’s magisterial Logical Foundations of Probability [1950], probabilities being assigned to sentences in a formal language. In his later work Carnap discarded sentences (which he viewed as insufficiently expressive for his purposes)
Carnap and the Logic of Inductive Inference
267
in favor of events or propositions, which he regarded as essentially equivalent, and we shall adopt this viewpoint. (The main technical complication in working at the level of sentences is that more than one sentence can assert the same proposition; for example, α ∧ β and ¬(¬α ∨ ¬β).) Carnap’s approach was a direct descendant of Wittgenstein’s relatively brief remarks on probability in the Tractatus, later developed at some length by Waismann [1930]. Carnap, following Waismann, assumed the existence of a regular measure function m(x) on sentences, defining these by first assuming a normalized nonnegative function on molecular sentences and then extending these to all sentences. Carnap then defined in the usual way c(h, e), the conditional probability of a proposition h given the proposition e, as the ratio m(h ∧ e)/m(e). Carnap interpreted the conditional probabilities c(h, e) as a measure of the extent to which evidence e confirms hypothesis h. Such functions had already been studied by Janina Hosiasson-Lindenbaum [1940] a decade earlier. Unlike Carnap, Hosiasson-Lindenbaum took a purely axiomatic approach: she studied the general properties of confirmation functions c(h, e), assuming only that they satisfied a basic set of axioms. There are several equivalent versions of this set appearing in the literature; here is one particularly natural formulation: The axioms of confirmation 1. 0 ≤ c(h, e) ≤ 1. 2. If h ↔ h and e ↔ e , then c(h, e) = c(h , e ). 3. If e → h, then c(h, e) = 1. 4. If e → ¬(h ∧ h ), then c(h ∨ h , e) = c(h, e) + c(h , e). 5. c(h ∧ h , e) = c(h, e) · c(h , h ∧ e). Carnap’s conditional probabilities c(h, e) satisfied these axioms (and so were plausible candidates for confirmation functions).
2.2
Betting odds and Dutch books
But just what do the numbers m(e) or c(h, e) represent? It was one of the great contributions of Ramsey and de Finetti to advance operational definitions of subjective probability; for Ramsey, primarily as arising from preferences, for de Finetti as fair odds in a bet. By then imposing rationality criteria on such quantities, both were able to derive the standard axioms for finitely additive probability. Ramsey, in a remarkable tour-de-force, was able to demonstrate the simultaneous existence of utility and probability functions u(x) and p(x). He did this by imposing natural consistency constraints on a (sufficiently rich) set of preferences, introducing the device of the ethically neutral proposition (the philosophical equivalent of tossing a fair coin) as a means of interpolating between competing alternatives. The
268
S. L. Zabell
functions u(x) and p(x) track one’s preferences in the sense that one action is preferred to another if and only if its expected utility is greater than the other. (Jeffrey [1983] discusses Ramsey’s system and presents an extremely interesting variant of it.) De Finetti, in contrast, initially gave primacy to probabilities interpreted as betting odds. (If p is a probability, then the corresponding odds are p/(1 − p).) The odds represent a bet either side of which one is willing to take. (Thus, the odds of 2 : 1 in favor of an event means that one would accept either a bet of 2 : 1 for, or a bet of 1 : 2 against. This is somewhat akin to the algorithm for two children dividing a cake: one divides the cake into two pieces, the other chooses one of the two pieces.) De Finetti imposed as his rationality constraint the requirement that these odds be coherent; that is, that it be impossible to construct a Dutch book out of them. (In a Dutch book, an opponent can choose a portfolio of bets such that he is assured of winning money. The existence of a Dutch book is analogous to the existence of arbitrage opportunities in the derivatives market.) A conditional probability P (A | B) in de Finetti’s system is interpreted as a conditional bet on A, available only if B is determined to have happened. De Finetti was able to show that the probabilities corresponding to a coherent set of bettings odds must satisfy the standard axioms of finitely additive probability. For example, if one takes the axioms for confirmation listed in the previous subsection, all are direct consequences of coherence. John Kemeny, one of Carnap’s collaborators in the 1950s, proved a beautiful converse to this result [Kemeny, 1955]. He showed that the above five properties of a confirmation function are at once both necessary and sufficient for coherence. That is, although de Finetti had in effect shown that coherence implies the five axioms, in principle there might be other, incoherent confirmation functions also satisfying the five axioms. If one did not begin by accepting (coherent) betting odds as the operational interpretation of c(h, e), this left open the possibility of other confirmation functions, ones not falling into the Ramsey and de Finetti framework. The power of Kemeny’s result is that if one accepts the five axioms above as necessary desiderata for any confirmation function c(h, e), then such functions necessarily assign coherent betting odds to the universe of events. This was a powerful argument in favor of the betting odds interpretation, and it persuaded Carnap, who adopted it. Thus, while in The Logical Foundations of Probability Carnap had advanced no fewer than three possible interpretations for probability1 — evidential support, fair betting quotients, and estimates of statistical frequencies — in his later work he explicitly abandoned the first of these, and wrote almost exclusively in terms of the second. (The “normative” force of Dutch book arguments has of course been the subject of considerable debate. Armendt [1993] contains a balanced discussion of the issues and provides a useful entry into the literature.) Nevertheless, even accepting the subjective viewpoint, the issue remains: can the inductive confirmation of hypotheses be understood in quantitative terms? It was this later question that was of primary interest to Carnap, and the one to
Carnap and the Logic of Inductive Inference
269
which he turned in a second paper “On inductive logic” [1945a]. 3 CONFIRMATION In order to better appreciate Carnap’s analysis of the inductive process, let us briefly review the background against which he wrote. First some basic mathematical probability. Suppose we have an uncertain event that can have one of two possible outcomes, arbitrarily termed “success” and “failure”, and let Sn denote the number of successes in n instances (“trials”). If the trials are independent, and have a constant probability p of success, then the probability of k successes in the n trials is given by the binomial distribution: n k p (1 − p)n−k , 0 ≤ k ≤ n. P (Sn = k) = k n! n = k k!(n − k)! is the binomial coefficient, and n! = n · (n − 1) · (n − 2) ... 3 · 2 · 1. Suppose next that the probability p is itself random, with some probability distribution dμ(p) on the unit interval. For example, success and failure might correspond to getting a head or tail when tossing a ducat, and the ducat is chosen from a bag of ducats having variable probability p of coming up heads (reflecting the composition of coins in the bag). In this case the probability P (Sn = k) is obtained by averaging the binomial probabilities over the different possible values of p. This average is standardly given by an integral, namely 1 n k p (1 − p)n−k dμ(p), 0 ≤ k ≤ n, P (Sn = k) = k 0 Here
In our example dμ(p) is aleatory in nature, tied to the composition of the bag. But it could just as well be taken to be epistemic, reflecting our degree of belief regarding the different possible values of p.
3.1 The rule of succession In this analysis there are several important questions as yet unanswered. In particular, the nature of p (is it a physical probability or a degree of belief?) has not been specified, and no guidance has been given regarding the origin of the initial or prior distribution dμ(p). In particular, even if the nature of p is specificed, how does one determine the prior distribution dμ(p)? For Laplace and his school, one had resort to the principle of indifference: lacking any reason to favor one value of p over another, the distribution was taken to be uniform over the unit interval: dμ(p) = dp. In this case the integral simplifies to give: P (Sn = k) =
1 , n+1
0 ≤ k ≤ n.
270
S. L. Zabell
But in fact the Reverend Thomas Bayes, the eponymous founder of the subject of Bayesian statistics, employed a subtler argument that paralleled Carnap’s later approach. Bayes [1764] reasoned that in a case of complete ignorance (“an event concerning the probability of which we absolutely know nothing antecedently to any trials made concerning it”), one has P (Sn = k) = 1/(n + 1) for all n ≥ 1 and 0 ≤ k ≤ n (in effect Bayes takes the later to be the definition of the former), and this in turn implies that the prior must be uniform. The argument can in fact be made rigorous. Let k = n; then Bayes’s postulate P (Sn = k) = 1/(n + 1) tells us that
1
pn dμ(p) = 0
1 = n+1
1
pn dp,
n ≥ 1.
0
Thus the as yet unknown probability dμ(p) has the same moments as the so-called “flat” prior dp. But the Hausdorff moment theorem tells us that a probability measure on a compact set (here [0, 1]) is characterized by its moments. Thus dμ(p) and dp, having the same moments, must coincide. Given the Bayes-Laplace formula P (Sn = k) = 1/(n+1), it is a simple matter to derive the corresponding predictive probabilities. If, for example, Xj is a so-called indicator variable taking the values 1 or 0, depending on whether the outcome of the j-th trial is a success or failure, respectively (so that the number of successes Sn is X1 + ... + Xn ), then P (Xn+1 = 1 | Sn = k) is the conditional probability of a success on the next trial, based on the experience of the past n trials. Since the formula for conditional probability is P (A | B) = P (A and B)/P (B), it follows after a little algebra that P (Xn+1 | Sn = k) =
k+1 . n+2
This is the celebrated (or infamous) rule of succession. Both it and the controversial principle of indifference on which it was based were the subject of harsh criticism beginning in the middle of the 19th century; see Zabell [1989]. Stigler [1982] argues that Bayes’s form of the indifference postulate, applying as it does to the discrete outcome k, does not entail the same paradoxes as the principle of indifference applied to the continuous parameter p. But Bayes’s ingenious argument was forgotten, and Laplace’s approach became the focus of controversy. The Cambridge phenom Robert Leslie Ellis objected in the 1840s that one could not conjure something out of nothing: ex nihilo nihil ; the German Johann von Kries countered in 1886 that one could invoke instead the principle of cogent reason: alternatives are judged equipossible because our knowledge is distributed equally among them; the point is the equi-distribution of knowledge rather than nihilist ignorance. In pragmatic England the Oxford statistician and economist F. Y. Edgeworth argued the use of flat priors was justified on approximate empirical grounds; the Cambridge logician and antiquarian John Venn ridiculed the use of the rule of succession. In France the distinguished Joseph Bertrand challenged
Carnap and the Logic of Inductive Inference
271
the cogency of subjective probability; the even more distinguished Henri Poincar´e championed it. This was the decidedly unsatisfactory state of affairs in 1921, the year when John Maynard Keynes’s Treatise on Probability appeared. Keynes’s Treatise contains a useful summary of much of this debate. The next several decades saw increasing clarification of the foundations of probability and its use in inductive inference. But the particular thread we are interested in here involves a curious development that took place in two independent stages.
4 EXCHANGEABILITY In 1924 William Ernest Johnson, an English logician and philosopher at King’s College, Cambridge, published the third volume of his Logic. In an appendix at the end, Johnson suggested an alternative analysis to the one just discussed, one which represented a giant step forward. But despite the respect accorded him in Cambridge, Johnson had only limited influence outside it, and after his death in 1931, his work was little noted. It is one of the ironies of this subject that Carnap later followed essentially the same route as Johnson, but to much greater effect, in part because Carnap’s Logical Foundations of Probability embedded his analysis in a much more detailed setting, and in part because he continued to refine his treatment of the subject for nearly two decades (whereas Johnson died only a few years after the appearance of his book). Johnson’s analysis contained several elements of novelty. The first two of these were designed to meet the two basic objections that had been raised regarding the classical rule of succession: its appeal to the so-called “principle of indifference”, and its appeal by way of analogy to drawing balls from an urn.
4.1
Multinomial sampling
First, Johnson considered the case of t ≥ 2 equipossible cases (instead of just two). This was no mere technical generalization. In many of the most telling attacks on the principle of indifference, situations were considered where it was unnatural to think of the outcome of interest as being one of two equipossible competing alternatives. By encompassing the multinomial case (several possible categories rather than just two) Johnson’s analysis applied to situations in which the multiple competing outcomes are either naturally viewed as equipossible (for example, rolling a fair, six-sided die), or can be further broken down into equipossible subcases.
4.2
The permutation postulate
Second, Johnson presciently introduced the concept of exchangeability. Let us consider a sequence of random outcomes X1 , ..., Xn , each taking on one of t possible types c1 , ..., ct . (For example, you are on the Starship Enterprise, and each time
272
S. L. Zabell
you encounter someone, they are either Klingon, Romulan, or Vulcan, so that t = 3.) Then a typical probability of interest is of the form P (X1 = e1 , X2 = e2 , ..., Xn = en ),
ei ∈ {c1 , ..., ct },
1 ≤ i ≤ t.
In the classical inductive setting, the order of these observations is irrelevent, the only thing that matters being the counts or frequencies observed for each of the t categories. (More complex situtations will be discussed later.) Thus, if ni is the number of Xj falling into the i-th category, it is natural to assume that all sequences X1 = e1 , X2 = e2 , ..., Xn = en having the same frequency counts n1 , n2 , ..., nt have the same probability. Johnson termed this assumption the permutation postulate. (Carnap called the sequences e1 , ..., en state descriptions, the frequency counts n1 , ..., nt structure descriptions, and made the identical symmetry assumption.) The valid application of the rule of succession presupposes, as Boole notes, the aptness of the analogy between drawing balls from an urn — the urn of nature, as it was later called — and observing an event [Boole 1854, p. 369]. As Jevons [1874, p. 150] put it, “nature is to us like an infinite ballot-box, the contents of which are being continually drawn, ball after ball, and exhibited to us. Science is but the careful observation of the succession in which balls of various character present themselves . . . ”. The importance of Johnson’s “permutation postulate” is that it is no longer necessary to refer to the urn of nature. To what extent is observing instances like drawing balls from an urn? Answer: to the extent that the instances are judged exchangeable. Venn and others, having attacked the rote use of the rule of succession, rightly argued that some additional assumption, other than mere repetition of instances, was necessary for valid inductive inference. From time to time various names for such a principle have been advanced: Mill’s “Uniformity of Nature”; Keynes’s “Principle of Limited Variety”; Goodman’s “projectibility”. It was Johnson’s achievement to have realized both that ‘the calculus of probability does not enable us to infer any probability-value unless we have some probabilities or probability relations given’ [Johnson, 1924, p. 182]; and that the vague, verbal formulations of his predecessors could be captured in the mathematically precise formulation of exchangeability. The permutation postulate (the assumption of exchangeability in modern parlance) was later independently introduced by the Italian Bruno de Finetti (see, for example, [de Finetti, 1937]), and became a centerpiece of his theory. For our purposes here, the basic point is that if the sequence is assumed to be exchangeable, then an assignment of probabilities to sequences of outcomes e1 , e2 , ..., en reduces to assigning probabilities P (n1 , n2 , ..., nt ) to sequences of frequency counts n1 , n2 , ..., nt . This is because there are (using the standard notation for the multinomial coefficient) n! n = n1 n2 ... nt n1 ! n2 ! ... nt !
Carnap and the Logic of Inductive Inference
273
different possible sequences e1 , e2 , ..., en having the same set of frequency counts n1 , n2 , ..., nt , and each of these is assumd to be equally likely, so by exchangeability and the additivity of probability n! P (e1 , e2 , ..., en ). P (n1 , n2 , ..., nt ) = n1 ! n2 ! ... nt ! (That is, the probability of a state description e1 , ..., en , times the number of state descriptions having the same corresponding structure description n1 , ..., nt , gives the probability of that structure description.) It is a simple but nevertheless instructive exercise to verify that the predictive probabilities in this case take on a simple form: P (Xn+1 = ci | X1 = e1 , X2 = e2 , ..., Xn = en ) = P (Xn+1 = ci | n1 , n2 , ..., nt ). (That is, although the conditional probability apparently depends on the entire state description e1 , ..., en , in fact it only depends on the corresponding structure description n1 , ..., nt .) In statistical parlance this last property is summarized by saying that the frequencies n1 , ..., nt are sufficient statistics: no information is lost in summarizing the sequence e1 , ..., en by the counts n1 , ..., nt . Such statistics turn out to be a powerful tool in extensions of exchangeability discovered in recent decades; see, e.g., [Diaconis and Freedman, 1984].
4.3
The combination postulate
But what do we choose for P (n1 , n2 , ..., nt )? In the case t = 2, this reduces to assigning probabilities to the pairs (n1 , n2 ). A little thought will show that Bayes’s postulate (that the different possible frequencies k are equally likely) is equivalent to assuming that the different pairs (n1 , n2 ) are equally likely (since n1 = k, n2 = n − n1 and n is fixed). This in turn suggests the probability assignment that takes each of the possible structure descriptions to be equally likely, and this is in fact the path that both Johnson and Carnap initially took (Johnson termed this the combination postulate). Since there are n+t−1 t possible structure descriptions (also known as “ordered t-partitions of n”, a wellknown combinatorial fact, see, e.g., [Feller, 1968, p. 38]), and each of these is assumed equally likely, one has 1 . n+t−1 t
P (n1 , n2 , ..., nt ) =
274
S. L. Zabell
Together, the combination and permutation postulates uniquely determine the probability of any specific finite sequence; if a state description e1 , e2 , ..., en has structure description n1 , n2 , ..., nt then its probability is P (e1 , e2 , ..., en ) =
1
n+t−1 t
n n1 n2 ... nt
;
see Johnson [1924, appendix on eduction]. This is Carnap’s m function. Having thus specified the probabilities of the “atomic” sequences, all other probabilities, including the rules of succession, are completely determined. Some simple algebra in fact yields P (Xn+1 = ci | n1 , n2 , ..., nt ) =
ni + 1 ; n+t
see Johnson [1924]. This is Carnap’s c function. 5
THE CONTINUUM OF INDUCTIVE METHODS
Although the mathematics of the derivation of the c system is certainly attractive, its assumption that all structure descriptions are equally likely is hardly compelling, and Carnap soon turned to more general systems. It is ironic that here too his line of attack very closely paralleled that of Johnson. After criticisms from C. D. Broad [1924] and others, Johnson devised a more general postulate, later termed by I. J. Good [1965] the sufficientness postulate. This assumes that the predictive probabilities for a particular type i are a function of how many observations of the type have been seen already (ni ), and the total sample size n. It is a remarkable fact that this characterizes the predictive probabilities or rules of succession (and therefore the probability of any sequence).
5.1
The Johnson-Carnap continuum
Suppose X1 , X2 , ..., Xn , ... represent an infinite sequence of observations, each assuming one of (the same) t possible values, and that at each stage n the sequence satisfies the permutation postulate. (In modern parlance, one has an infinitely exchangeable, t-valued sequence of random variables.) Assume the sequence satisfies the following three conditions: 1. Any state description e, ..., en is a priori possible: P (e1 , ...en ) > 0. 2. The “sufficientness postulate” is satisfied: P (Xn+1 = ei | n1 , ..., nt ) = fi (ni , n). 3. There are at least three types of species; t ≥ 3.
Carnap and the Logic of Inductive Inference
275
Then (unless the outcomes are independent of each other, so that observing one or more provides no predictive power regarding the others) the predictive probabilities have a very special form: there exist positive constants α1 , ..., αt such that if α = α1 + ... + αt , then for all n ≥ 1, states ei , and structure descriptions n1 , ..., nt , ni + αi . P (Xn+1 = ei | n1 , ..., nt ) = n+α This truly beautiful result characterizes the predictive probabilities up to a finite sequence of positive constants α1 , α2 , ..., αt . Note Carnap’s c measure of confirmation is a special case of the continuum, with αi = 1 for all i. The assumption that all state descriptions have positive probability is needed to insure that the requisite conditional probabilities are well-defined. (In Carnap’s terminology, the probability function is regular.) The restriction t ≥ 3 is necessary because otherwise the sufficientness postulate would be vacuous. (One can recover the result in the case t = 2 by replacing the sufficientness postulate by the assumption that the predictive probabilities are linear in ni ; see, e.g., [Zabell, 1982].)
5.2
The de Finetti representation theorem
The assumption that arbitrarily long sequences satisfy the permutation postulate means their probabilities admit an integral representation of the type mentioned earlier in Section 3; this is the content of the celebrated de Finetti representation theorem [de Finetti, 1937]. Specifically, let Δt denote the set of probabilities on t elements: t pj = 1}. Δt := {(p1 , ..., pt ) : pj ≥ 0, j=1
De Finetti’s theorem states that if X1 , X2 , X3 , ... is an infinitely exchangeable sequence on t elements, then there exists a probability measure dμ on Δt , such that for every n ≥ 1, if n1 , ..., nt are the frequency counts of X1 , ..., Xn , then n! pn1 1 pn2 2 ...pnnt dμ(p1 , ..., pt ). P (n1 , n2 , ..., nt ) = Δt n1 !n2 !...nt ! (Note that a single measure dμ simultaneously achieves this for all sample sizes n.) There are a number of interesting foundational issues arising from this result. The integrand n! pn1 pn2 ...pnnt n1 !n2 !...nt ! 1 2 is a multinomial probability, and the theorem asserts that an exchangeable probability P can be represented as a integral mixture of multinomial probabilities. It is obvious that a multinomial probability and more generally any mixture of multinomials is exchangeable; the force of the theorem is that the converse holds:
276
S. L. Zabell
every exchangeable probability is expressible as a mixture. There is no restriction placed on the mixing measure dμ. Many results in the literature of inductive inference are often easier to state, prove, or interpret in terms of such representations. For example, Johnson’s theorem can be interpreted as telling us that when the sufficientness postulate is satisfied the averaging measure in the representation is a member of the classical Dirichlet family of prior distributions: t t Γ( j=1 αj )
α −1 pj j dp1 ...dpt−1 dμ(p1 , ..., pt ) = t Γ(α ) j j=1 j=1
(αj > 0).
(Here Γ denotes the gamma function; if k is a positive integer, then Γ(k) = (k−1)!.) The ability to characterize the predictive probabilities using Johnson’s sufficientness postulate, however, means that in principle one can entirely pass over this interesting but more mathematically complex fact. As Johnson himself observed, I substitute, for the mathematician’s use of Gamma Functions and αmultiple integrals, a comparatively simple piece of algebra, and thus deduce a formula similar to the mathematician’s, except that, instead of for two, my theorem holds for α alternatives, primarily postulated as equiprobable. [Johnson, 1932, p. 418; Johnson’s α corresponds to our t] Why are rules of succession so important? Note the joint probability of a sequence of events can be built up from the corresponding sequence of conditional probabilities. For example: the joint probability P (X1 = e1 , X2 = e2 , X3 = e3 ) can be expressed as P (X1 = e1 ) · P (X2 = e2 | X1 = e1 ) · P (X3 = e3 | X1 = e1 , X2 = e2 ). Thus one can express joint probabilities in terms of initial probabilities and rules of succession.
5.3 Interpretation of the Continuum Let us consider a specific method in the continuum, say with parameters α1 , ..., αt . Then one can write the rule of succession as n α ni + αi ni αi = + . P (Xn+1 = ci | ni ) = n+α n+α n n+α α The two expressions in square brackets have obvious interpretations: the first, ni /n is the empirical frequency, and represents the input of experience; the second,
Carnap and the Logic of Inductive Inference
277
αi /α, is our initial or prior probability concerning the likelihood of seeing ci (set ni = n = 0 in the formula). The two terms in rounded brackets, n/(n + α) and α/(n+α), sum to one and express the relative weight accorded to our observations versus our prior information. If α is small, then n/(n + α) is close to one, and the empirical frequencies ni /n are accorded primacy; if α is large, then n/(n + α) is small, and the initial probabilties are accorded primacy. Of course, “if α is large” must be understood relative to a fixed value of n; no matter how large α is, for a fixed value of α it is evident that lim
n→∞
n = 1, n+α
reflecting the fact that no matter how large the initial weight assigned to our initial probabilities, these prior opinions are ultimately swamped by the overwhelming weight of empirical evidence.
5.4
History
The result itself has an interesting history. Johnson considered the special case when the function fi (ni , n) = f (ni , n); that is, it does not depend on the category or type i. In this case there is just one parameter, α, since αi = α/t for all i. Johnson did not publish his result in his own lifetime (shades of Bernoulli and Bayes!); he had planned a fourth volume of his Logic, but only completed drafts of three chapters of it at the time of his death. A (then very young) R. B. Braithwaite edited the chapters for publication, and they appeared as three separate articles in Mind in 1932 [Johnson, 1932]. (It is ironic that G. E. Moore, the editor of Mind, questioned the desirability of including a mathematical appendix giving the details of the proof in such a journal, but Braithwaite — fortunately — insisted.) Due to its posthumous character, the proof as published contained a few lacunae, and a desire to fill these led to [Zabell, 1982]. This paper shows that not only can the above-mentioned lacunae be filled, but that Johnson’s method very naturally generalizes to cover the asymmetric case (when the predictive function fi (ni , n) depends on i), the case t = ∞, and the case of finite exchangeable sequences that are not infinitely extendable. Carnap followed much the same path as Johnson, initially considering the symmetric, category independent case, except that he assumed both the sufficientness postulate and the form of the predictive probabilities given in the theorem. It was only later that his collaborator John G. Kemeny was able to prove the equivalence of the two (assuming t > 2). Carnap subsequently extended these results, first to cover the case t = 2 [Carnap and Stegm¨ uller, 1959]; and finally in Jeffrey (1980, Chapter 6) abandoned the assumption of symmetry between categories and derived the full result given above (see also [Kuipers, 1978]). The historical evolution is traced in [Schillp, 1963, pp. 74–75 and 979–980; Carnap and Jeffrey, 1971, pp. 1–4 and 223; Jeffrey, 1980, pp. 1–5 and 103–104].
278
S. L. Zabell
6
CONFIRMATION OF UNIVERSAL GENERALIZATIONS
Suppose all n observations are of the same type; for example, that we are observing crows and thus far all have been black. In such situtations, it is natural to view our experience as evidence not just that most crows are black, but as confirming the “universal generalization” that all crows are black. This apparently natural expectation, however, leads to unexpected complexities.
6.1 Paradox feigned This is due to an interesting property of the Johnson-Carnap continuum: (infinite) universal generalizations have zero probability! For example, having observed n black crows, it follows from k successive applications of the rule of succession that the probability the next k crows are also black is P (Xn+1 = Xn+2 = ... = Xn+k = ci | ni = n) =
n+k−1
j=n
j + αi . j+α
It is not hard to see that this product tends to zero as k tends to infinity. It is a standard result that if 0 < an ≤ 1(n ≥ 1) then the infinite product n≥1 an ) diverges to zero if and only if the corresponding infinite series n≥1 (1 − an ) diverges to infinity (see, e.g., [Knopp, 1947, pp. 218-221]). Because ∞ α − αi j+α j=n
diverges (it is essentially the harmonic series), one has lim
k→∞
n+k−1
j=n
j + αi = 0. j+α
This was viewed as a defect of Carnap’s system by several critics, for example, [Barker, 1957, pp. 87-88; Ayer, 1972, pp. 37-38, 80-81]. But the phenomenon itself had been both noted and defended much earlier, by Augustus De Morgan [1838, p. 128] in the nineteenth century. (“No finite experience whatsoever can justify us in saying that the future shall coincide with the past in all time to come, or that there is any probability for such a conclusion”); and by C. D. Broad [1918] in a similar situation (the “finite rule of succession”) in the twentieth. The obvious Bayesian response was advanced by Wrinch and Jeffreys [1919] a year after Broad wrote: one assigns non-zero initial probability to the generalization. As Edgeworth noted shortly after in his review of Keyens’s Treatise, “pure induction avails not without some finite initial probability in favour of the generalisation, obtained from some other source than the instances examined” [Edgeworth 1922, p. 267]. But can one build such a “finite initial probability” into the Carnapian approach (that is, via axiomatic characterization)? In order to understand this, let us first consider the simplest case.
Carnap and the Logic of Inductive Inference
279
6.2 Paradox lost It is possible to see what is going wrong in terms of the sufficientness postulate. Suppose there are three categories, 1, 2, and 3, and none of the observations thus far fall into the first. What can one say about P (X2n+1 = c1 | n1 , n2 , n3 )? According to the sufficientness postulate, there is no difference between the three cases (a) n2 = 2n, n3 = 0, (b) n2 = 0, n3 = 2n, and (c) n2 = n3 = n. But from the point of universal generalizations there is an obvious difference: the first and second cases confirm different universal generalizations (which may have different initial probabilities), while the third case disconfirms both. Continua confirming universal generalizations must treat the cases differently. Thus it is necessary to relax the sufficientness postulate, at least in the case when ni = n for some i. This diagnosis suggests a simple remedy. Suppose one modifies the sufficientness postulate so that the “representative functions” fi (n1 , ..., nt ) (to use yet another terminology sometimes employed) are assumed to be functions of ni and n unless ni = 0 and nj = n for some j = i. Then it can be shown (see, e.g., [Zabell, 1996]) that as long the observations are exclusively of one type, the representative function consists of two parts: a term corresponding to the posterior probability that future observations will continue to be of this type (the “universal generalization”), and a Johnson-Carnap term; and this continues to be the case as long as all observations are of a single type. If, however, at any stage a second type is observed, then the representative function reverts to a pure Johnson-Carnap form. So this was a tempest in a teapot: this criticism of the continuum was easily answered even at the time it was initially made. In hindsight the reason Johnson’s postulate gives rise to the problem is apparent, the minimal change to the postulate necessary to remedy the problem results in an expanded continuum confirming precisely the desired universal generalizations (and no others), and this can be demonstrated by a straightfoward modification of Johnson’s original proof (for further discussion and references, see [Zabell, 1996]). But in fact much more is true: such an extension of the original Carnap continuum is merely a special case of a much richer class of extensions due to Hintikka, Niiniluoto, and Kuipers.
6.3
Hintikka-Niiniluoto systems
In order to appreciate Hintikka’s contribution, consider first the category symmetric case. Let Tn (X1 , X2 , ..., Xn ) denote the number of distinct types or species observed in the sample. In the continuum discussed in the previous subsection the predictive probabilities now depend not just on ni and n, but also on Tn , the number of instantiated categories. Specifically: is Tn = 1 or is Tn > 1? Thus put, this suggests a natural generalization: let the predictive probabilities be any
280
S. L. Zabell
function of ni , n, and Tn . The result is a very attractive extension of the Carnap continuum. In brief, if the predictive probabilities depend on Tn , then in general they arise from mixtures of Johnson-Carnap continua concentrated on subsets of the possible types. Thus, given three categories a, b, c, the probabilities can be concentrated on a or b or c (universal generalizations), or Johnson-Carnap continua corresponding to the three pairs (a, b), (a, c), (b, c), or a Johnson-Carnap continuum on all three. In retrospect, this is of course quite natural. If only two of the three possibilities are observed in a long sequence of observations (say a and b), then (in addition to giving us information about the relative frequency of a and b) this tentatively confirms the initial hypothesis that only a’s and b’s will occur. In the more general category asymmetric case, the initial probabilities for the six different generalizations (a, b, c, ab, ac, and bc) can differ, and the predictive probabilities are postulated to be functions of ni , n, and the observed constituent: that is, the specific set of categories observed. (Thus in our example it is not enough to tell one that Tn = 2, but which two categories or species have been observed.) This beautiful circle of results originates with Hintikka [1966], and was later extended by Hintikka and Niiniluoto [1979]. The monograph by Kuipers [1978] gives an outstanding survey and synthesis of this work, including discussion of Kuipers’s own contributions; for a recent summary and evaluation, see Niiniluoto [2009].
6.4 Attribute symmetry Both the original Johnson-Carnap continuum and its Hintikka-Niiniluoto-Kuipers generalizations are of great interest, but share a common weakness. If what one is trying to do is to capture precisely the notion of a category-symmetric state of knowledge – no more and no less — then the one and only constraint is that the resulting probabilities be invariant under permutation of the categories. Carnap referred to such invariance as attribute symmetry. If one writes an n-long sequence in compact form as X : {1, ..., n} → {1, ..., t}, and P is a probability on the possible sequences X, then exchangeability requires P to be invariant under permutations of {1, ..., n} and attribute symmetry requires P to be invariant under permutations of {1, ..., t}. Suppose one adds attribute symmetry to exchangeability as a restriction on P . The resulting class of probability functions is still infinite dimensional; see Zabell [1982, p. 1097, 1992; pp. 216–217]. At first sight this seems surprising: if our knowledge is category symmetric, surely the sufficientness postulate should hold. But it is not hard to construct counterexamples. For example, suppose we have a die and know one face is twice as likely to come up as another, but not which face. Then there are six hypotheses Hj : for 1 ≤ j ≤ 6, Hj : pj = 2/7, pk = 1/7, k = j; and the six Hj are judged equiprobable. Consider the following two possible
Carnap and the Logic of Inductive Inference
281
frequency vectors that could occur in a sample of size n = 70: n1 = (20, 10, 10, 10, 10, 10),
n2 = (20, 30, 5, 5, 5, 5).
Obviously n1 supports H1 over H2 ; and n2 supports H2 over H1 , even though, if the sufficientness postulate held, the predictive probabilities for seeing a one on the next trial should be the same in each case. So there exist natural category symmetric epistemic states in which the sufficientness postulate fails. In general, if there is attribute symmetry the sufficient statistics are the frequencies of the frequencies (denoted ar ): for each r, 0 ≤ r ≤ t, ar is the number of categories j such that nj = r. The recognition that even in these cases the entire list of frequencies ni may contain relevant information concerning the individual categories via the ar appears to go back to Turing; see [Good, 1965, Chapter 8]. Thus even assuming both exchangeability and attribute symmetry admits a rich family of possible probabilities; and it might be thought this would limit their utility. But even exchangeability by itself has many interesting qualitative consequences. The next section illustrates one of these.
7 INSTANTIAL RELEVANCE One important desideratum of a candidate for confirmation is instantial relevance: if a particular type is observed, then it is more likely that such a type will be observed in the future. In its simplest form, this is the requirement that if i < j, then P (Xj = 1 | Xi = 1) ≥ P (Xj = 1) (the Xk denoting indicators that take on the values 0 or 1). It is not hard to see that exchangeability alone does not insure instantial relevance. Suppose, for example, one draws balls at random from an urn initially having three red balls and two black balls. If the sampling is without replacement, then the probability of selecting a red ball is initially 3/5, but the probability of selecting a second red ball, given the first is red, is 1/2. In the past there was a small cottage industry devoted to investigating the precise circumstances under which the principle of instantial relevance does or does not hold for a sequence of observations. If the observations in question can be imbedded in an infinitely exchangeable sequence (that is, into an infinite sequence X1 , X2 , ..., any finite segment X1 , ..., Xn of which is exchangeable), then instantial relevance does hold. After the power of the de Finetti representation theorem was appreciated, very simple proofs of this were discovered (see, e.g., [Carnap and Jeffrey, 1971, Chapters 4 and 5]). There are also simple ways of seeing this without using the representation theorem. For example, the principle of instantial relevance is equivalent to the assertion that the observations are nonnegatively correlated. If X1 , X2 , ..., Xn is an
282
S. L. Zabell
exchangeable sequence of random variables, then an elementary argument shows that the correlation coefficient ρ = ρ(Xi , Xj ) satisfies the simple inequality ρ≥−
1 . n−1
This is because (using both the formula for the variance of a sum and the exchangeability of the sequence) if σ 2 = V ar[Xi ], one has 0 ≤ V ar[X1 + ... + Xn ] = nσ 2 + n (n − 1) ρσ 2 . Thus, if the sequence can be indefinitely extended (so that one can pass to the limit n → ∞), it follows that ρ ≥ 0. The case ρ = 0 then corresponds to the case of independence (the past conveys no information about the future, inductive inference is impossible); and the case ρ > 0 corresponds to inductive inference and positive instantial relevance. 8 FINITE EXCHANGEABILITY In the end, infinite sequences are really just fictions, so we would rather not incorporate them into our Weltanschauung in an essential way. In this section we take a closer look at this question.
8.1 Extendability The de Finetti representation only holds for an infinite sequences; it is easy to construct counterexamples otherwise. Consider, for example, the exchangeable assignment 1 P (RR) = P (BB) = 0. P (RB) = P (BR) = ; 2 This corresponds to sampling without replacement from an urn containing one red ball (R) and one black ball (B). This exchangeable probability assignment on ordered pairs cannot be extended to one on ordered triples. To see this, suppose otherwise. Then 1 P (RBR) + P (RBB) = P (RB) = , 2 so either P (RBR) > 0 or P (RBB) > 0 (or both). Suppose without loss of generality that P (RBR) > 0. Then P (RR) ≥ P (RRB) = P (RBR) > 0 (the first inequality follows because probabilities are subadditive, that is, if A ⊆ B, then P (A) ≤ P (B); the equality because P is by assumption exchangeable). But this is impossible, since P (RR) = 0. (It is not hard to see this is typical: sampling without replacement from a finite population results in an exchangeable probability assignment that cannot be extended.)
Carnap and the Logic of Inductive Inference
283
In general, if X1 , X2 , ..., Xn is an exchangeable sequence, then it may or may not be possible to extend it to a longer exchangeable sequence X1 , X2 , ..., Xn , ..., Xn+r , r ≥ 1. If it is possible to do so for every r ≥ 1, then we can think of X1 , X2 , ..., Xn as the initial sequence of an infinitely exchangeable sequence X1 , X2 , X3 , ... (thanks to the Kolmogorov existence theorem). Thus the de Finetti representation theorem applies, the infinite sequence can be represented as a mixture of iid (independent and identically distributed ) sequences, and hence a fortiori the initial segment of length n can be so represented. On the other hand, if a finite exchangeable sequence of length n has a representation as a mixture of iid sequences, it is immediate that it is infinitely extendable. Thus: A finite exchangeable sequence is infinitely extendable if and only if it is representable as a mixture of iid sequences. To summarize: in general a finite exchangeable sequence may or may not be extendable. Carnap alludes to this fact when he reports that while at the Institute for Advanced Studies in 1952–1953, he and his collaborator John Kemeny had talks with L. J. Savage. Among other things, Savage showed them that the use of a language LN with a finite number of individuals is not advisable, because a symmetric M -function in LN cannot always be extended to an M -function in a language with a greater number of individuals. [Carnap and Jeffrey, 1971, p. 3] Note the curious phrase “not advisable”. It is unclear why Savage thought this (if indeed he did): recall sampling without replacement from a finite population results in a perfectly respectable exchangeable assignment even though it cannot be extended. More generally think of any population which is naturally finite in extent, and to which we wish to extrapolate on the basis of a partial sample from it. (For example, think of a limited edition of a book, and whether or not such books are defective.) The phenomenon of non-extendability is no sense pathological. Or course there is a price to pay: the loss of the de Finetti representation. Or is there?
8.2
The finite representation theorem
Given a set of counts n = (n1 , ..., nt ), imagine an urn containing nj balls of each type, and suppose one successively draws out “at random” without replacement each ball in the urn (“at random” meaning that all possible sequences are judged equally likely). There are a total of (n1 + + nt )!/(n1 !...nt !) such sequences; the exchangeable probability assignment Hn giving each of these equal probability is called the hypergeometric distribution. If, more generally, X1 , , Xn is any exchangeable sequence whatsoever, and P (n) the corresponding probability assignment on the set of counts n, then the overall probability assignment P on the set of sequences is a mixture of the hypergeometric probabilities Hn using the weights
284
S. L. Zabell
P (n); compactly this can be expressed as P = P (n)Hn . n
This result is the finite de Finetti representation theorem. It is basically just the so-called “theorem of total probability” in disguise. It tells us that the structure of the generic finite exchangeable sequence is really quite simple. If the sequence is N long, and the outcomes can be of t different types, then you can think of it as a sequence of draws from an urn with N balls, each of which can be one of the t types, but the distribution of types of among the N balls (the n) is unknown. If (as the Spartans would say), you knew the distribution of types, then your probability assignment would be the appropriate hypergeometric distribution. But since you don’t, you assign a prior distribution to n and then average. Although the finite representation theorem is not quite as well known (or appreciated) as its big brother, the representation theorem for an infinite exchangeable sequence, it would be a serious mistake to underestimate it. To begin, thanks to the representation, there is a drastic reduction in the number of independent probabilities to be specified; in the case of tossing a coin 10 times, for example, from 210 − 1 = 1023 to 11. But there are also important conceptual and philosophical advantages to thinking in terms of the finite representation theorem.
8.3
The finite rule of succession
The classical rule of succession, that if in n trials there are k successes, then the probability of a success on the next trial is (k+1)/(n+2), assumes you are sampling from an infinite population (see [Laplace, 1774]). (Strictly speaking the last makes no sense, but it can be viewed as a shorthand for either sampling with replacement (so that the population remains unaltered by the sampling) or as passing to the limit in the case case of sampling from a finite population.) In particular, if all n are of the same type, then the probability that the next is also of this type is (n + 1)/(n + 2). But it is clear that the basic relevant question is a different one: the probability if you are sampling without replacement from a finite population. This question was first asked and answered by Prevost and L’Huilier [1799]. To answer the question, of course, one must make some assumption regarding the composition of the urn (that is, adopt some set of prior probabilities regarding the different possible urn compositions). The natural assumption, parallel to the Bayes-Laplace analysis, is to assume all possible vectors of counts are equally likely. Doing this, Prevost and L’Huilier were able to first derive the posterior probabilities for the different urn constitutions of the urn; and then from this derive the rule of succession as a consequence, the final result being that (given p successes out of m to date) the probability of a success on the next trial is (p + 1)/(m + 2), exactly the same answer as the classical Laplace rule of succession!
Carnap and the Logic of Inductive Inference
285
This result was subsequently independently rediscovered several times over the next century and a quarter, the last being by C. D. Broad in 1918, when it finally gained some traction in philosophical circles (see generally [Zabell, 1988]). The brute force mathematical derivation of this particular rule of succession requires the evaluation of a tricky combinatorial sum; and its history of successive rediscovery is a phenomenon that is sometimes seen in the mathematical literature when a result is interesting enough (so that it repeatedly attracts attention), hard enough (so that it is deemed worthy of publication), and obscure or technical enough (so that it is then subsequently easily forgotten or overlooked). But our point here is that this striking coincidence between the finite and infinite rules of succession, which, when viewed through the prism of the combinatorial legerdemain required to evaluate the necessary sum, appears to be a minor miracle, is in fact obvious when thought of in terms of the finite representation theorem. For consider. Suppose X1 , X2 , ... is an infinite exchangeable sequence of 0s and 1s having mixing measure dQ(p) = dp in the de Finetti representation (that is, the Bayes-Laplace process). If Sn = X1 + ... + Xn denotes the number of 1s in n trials, then, as noted earlier, 1 n k 1 p (1 − p)n−k dp = . P (Sn = k) = k n+1 0 Now consider the initial segment X1 , X2 , .., Xn by itself. This is a finite exchangeable sequence, and so has a finite representation in terms of some mixture of hypergeometric probabilities. But the mixing measure for the finite representation in the dichotomous case is P (Sn = k), which is, as just noted, 1/(n + 1), the Prevost-L’Huilier prior (or, as Jack Good might put it, the Prevost-L’HuilierTerrot-Todhunter-Ostrogradskii-Broad prior). But the finite representation uniquely determines the stochastic structure of a finite exchangeable sequence; thus an n-long Prevost-L’Huiler sequence is stochastically identically to the initial, n-long segment of the Bayes-Laplace process, and therefore the two coincide in all respects, including (but not limited to) their rules of succession. No tricky sums! Viewed from the perspective of the philosophical foundations of inductive inference the finite rule of succession is important for two reasons vis-a-vis the classical Laplacean analysis: 1. It eliminates a variety of possible concerns about the occurrence of the infinite in the Laplacean analysis (e.g., [Kneale, 1949, p. 205]): that is, attention is focused on a finite segment of trials, rather than a hypothetical infinite sequence or population. 2. The frequency, propensity, or objective chance p that appears in the integral is replaced by the fraction of successes in a finite population; thus a purely personalist or subjective analysis becomes possible and objections to “probabilities of probabilities” or “unknown probabilities” (e.g., [Keynes, 1921, pp. 372 –75]) are eliminated.
286
S. L. Zabell
8.4 The finite continuum of inductive methods As one final example of both the utility and interest of considering finite exchangeable sequences, we note in passing that Johnson’s derivation of the continuum of inductive methods carries over immediately to the finite case, the chief element of novelty being that now the α parameters in the rule of succession can be negative (since, for example, when sampling without replacement from an urn, the more balls of a given color one sees, the less likely it becomes to see other balls of the same color); see [Zabell, 1982].
8.5 The proper role of the infinite Aristotle (Physics 3.6, see, e.g., [Heath, 1949, pp. 102–113]) distinguishes between the actual infinite and the potential infinite, a useful distinction to keep in mind when thinking about the use of the infinite in probability. One might summarize Aristotle as saying that the use of the infinite is only appropriate in its potential rather than actual sense. Let us apply this to the case of probability: theories that depend in an essential way on the actual infinite are fatally flawed. Consider von Mises’s frequency theory. In any theory of physical probability, if 0 < p < 1 is the probability of an outcome in a sequence of independent trials, then any finite frequency k in n trials has a positive probability. Thus any observed value of k is consistent with any possible value of p. In von Mises’s theory in order to achieve this consistency of any p with any k, it is essential that p be an infinite limiting frequency. But, being infinite in nature, p is unobservable, hence metaphysical (in the pejorative sense); see, e.g., [Jeffrey, 1977]. But, one might object, doesn’t the infinite representation theorem also suffer from this defect, since it holds just for infinitely exchangeable sequences (rather than finitely exchangeable sequences, the only things we really see)? The answer is no, if one correctly understands it from both a mathematical and a philosophical standpoint. Mathematical interpretation of the representation theorem In applied mathematics one frequently uses infinite limit theorems as approximations to the large but finite. That is, the sequence, although of course necessarily finite, is viewed as effectively unlimited in length. (So, for example, in tossing a coin, there is no practical limit to how many times we can toss it, although it will certainly wear down after many googles of tosses.) But the applied mathematician must also have some idea of when to use a limit theorem as an approximation and when not. This is the reason the central limit theorem (CLT) is is of practical use, but the law of the iterated logarithm (LIL) is not: the CLT provides an excellent approximation to sums of random variables for surprisingly small sample sizes; the LIL only for surprisingly large. What this ultimately means is that what the applied mathematician needs is either a generous fund of experience or a more informative mathematical result:
Carnap and the Logic of Inductive Inference
287
not just the limiting value but the rate of convergence to that limit. Happily such a result is available for the de Finetti representation theorem, thanks to Persi Diaconis and David Freedman [1980a]. First some notation: if S is a set, let S n denote its n-fold Cartesian product (n ≤ ∞). If p is a probability on S, let pn denote the corresponding n-fold product probability on S n (corresponding to an n-long, p-iid sequence). If P is a probability on S n , then Pk denotes its restriction to S k , k ≤ n. If Θ parametrizes the set of probabilities on S and μ is a a probability on Θ (to be thought of as a mixing measure), let Pμn denote the resulting exchangeable probability on S n ; that is pnθ dμ(θ). Pμn = Θ
Finally, if P and Q are probabilities on S n , let ||P − Q|| = maxn |P (A) − Q(A)| A⊂S
denote the variation distance between P and Q. Then one has the following result: Suppose S is a finite set of cardinality t and P is an exchangeable probability on S n . Then there exists a probability μ on the Borel sets of Θ and a constant c such that
2tk n
f or all k ≤ n. ||Pk − Pμk || = Pk − pθ dμ(θ)
≤ n This beautiful result has a number of interesting consequences. First, it makes precise the interrelationship between extendability and the existence of an integral representation. Given an exchangeable sequence of length k, if the sequence is extendable to a longer sequence of length n, then it can be approximated by an integral mixture to order k/n in variation distance. The more the sequence can be extended, the more it looks like an integral mixture. Thus it is not surprising (and Diaconis and Freedman in fact use the above theorem to prove) that a sequence which can be extended indefinitely (equivalently, is the initial segment of an infinitely exchangeable sequence) has an integral representation. But the theorem also tells us how to think about the application of the representation theorem. Given a sequence that is the initial segment of a “potentially infinite” sequence (that is, unbounded in any practical sense), thinking of it as an integral mixture is a reasonable approximate procedure (in just the same way as summarizing a population of heights in terms of a normal distribution is a reasonable approximation to an ultimately discrete underlying reality). For a very readable discussion of this topic, see [Diaconis, 1977]. Philosophical interpretation of the representation theorem From this perspective the representation is a tool used for mathematical approximation. The “parameter” p is a purely mathematical object, not a physical quantity. This was in fact de Finetti’s view: “it is possible... and to my mind preferable,
288
S. L. Zabell
to stick to the firm and unexceptionable interpretation that the limit distribution is merely the asymptotic expression of frequencies in a large, but finite, number of trials” [de Finetti, 1972, p. 216]. De Finetti was a finitist who rejected the use of countable additivity in probability as lacking a philosophical justification. (It is not a consequence of the usual Dutch book argument.) In particular, de Finetti’s statement and proof of the representation theorem uses only finitely additive probability. See Cifarelli and Regazzini [1996] for an outstanding discussion of the role of the infinite in de Finetti’s papers. 9 THE FIRST INDUCTION THEOREM There is a very interesting result, which Good [1975, p. 62] terms the first induction theorem. Its interest is that it makes no reference at all to exchangeability, and yet it provides an account of enumerative induction, in that it tells us that confirming instances (in a sense to be made precise in a moment) increase the probability of other potential instances. To be precise, if P (H) > 0 and P (Ej |H) = 1, j ≥ 1 (the Ej are “implications” of H), then (E1 E2 denoting the conjunction of E1 and E2 , and so on), lim P (En+1 En+2 ...En+m E1 E2 ...En ) = 1 n→∞
uniformly in m. The proof (due to [Huzurbazar, 1955]) is at once simple and elegant. Just note that for any n ≥ 1, one has P (E1 ...En H) = 1, hence P (E1 ...En ) ≥ P (E1 ...En En+1 ) ≥ P (E1 ...En En+1 H) = P (H) > 0. It follows that un = P (E1 ...En ) is a decreasing sequence bounded from below by a positive number, and therefore has a positive limit. Thus un+m = 1; lim P (En+1 En+2 ...En+m E1 E2 ...En ) = lim n→∞ un
n→∞
and it is apparent that the convergence is uniform in m. The result is not so surprising for sampling from a finite population, but for a potentially infinite sequence is at first startling. It tells us that observing a sufficiently long sequence of confirming instances makes any further finite sequence, no matter how long, as close to one as desired. Good [1975, p. 62] says “the kudology is difficult”, but cites both Keynes [1921, Chapter 20] and Wrinch and Jeffreys [1921]; see also [Jeffreys, 1961, pp. 43–44]. 10
ANALOGY
Simple enumeration is an important form of inductive inference but there are also others, based on analogy. Carnap distinguished between two forms of analogy:
Carnap and the Logic of Inductive Inference
289
analogy by proximity and analogy by similarity; that is, proximity in time (or sequence number) and similarity of attribute. In the case of inductive analogy, Carnap wished to generalize his results, allowing for the possibility that the inductive strength of P varies depending on some measure of “closeness” of either time or attribute. In the case of attributes this required the specification of a “distance” on the attribute set; in the case of time such a metric is of course already present. But Carnap only obtained only partial results in this case (see [Carnap and Jeffrey, 1971, p. 1; Jeffrey, 1980, Chapter 6, Sections 16–18]). De Finetti and his successors were more successful. De Finetti formulated early on a concept of partial exchangeability [de Finetti, 1938], differing forms of partial exchangeability corresponding to differing forms of analogy. He viewed matters in effect as a spectrum of possibilities; exchangeability representing one extreme, a limiting case of “absolute” analogy. At the other extreme all one has is Bayes’s theorem, P (E|A) = P (AE)/P (A); absent “particular hypotheses concerning the influence of A on E”, nothing further can be said, “no determinate conclusion can be deduced”. The challenge was to find “other cases ... more general but still tractable”. For an English translation of de Finetti’s paper, see [Jeffrey, 1980, Chapter 9]. Diaconis and Freedman [Jeffrey, 2004, pp. 82–97] provides a very readable introduction to de Finetti’s ideas here.
10.1
Markov exchangeability
One example of building analogy by proximity into a probability function is the concept of Markov exchangeability (describing a form of analogy in time). Suppose X0 , X1 , ... is an infinite sequence of random outcomes, each taking values in the set S = {c1 , ..., ct }. For each n ≥ 1, consider the statistics X0 (the initial state of the chain) and the transition counts nij recording the number of transitions from ci to cj in the sequence up to Xn . (That is, the number of times k, 0 ≤ k ≤ n − 1, such that Xk = ci and Xk+1 = cj .) If for all n ≥ 1, all sequences X0 , ..., Xn starting out in the same initial state x0 and having the same transition counts nij have the same probability, then the sequence is said to be Markov exchangeable. Suppose further that the sequence is recurrent: the probability is 1 that Xn = X0 for infinitely many n. (That is, the sequence returns to the initial state infinitely often.) There is, it turns out, a de Finetti type representation theorem for the stochastic structure (probability law) of such sequences: they are precisely the mixtures of Markov chains, just as ordinary exchangeable sequences are mixtures of binomial or multinomial outcomes [Diaconis and Freedman, 1980b]. Furthermore there is also a Johnson-Carnap type rule of succession [Zabell, 1995]. Of course one might ask why Markov exchangeability is a natural assumption to make. Diaconis and Freedman [Jeffrey, 2004, p. 97] put it well: “If someone ... had never heard of Markov chains it seems unlikely that they would hit on the appropriate notion of partial exchangeability. The notion of symmetry seems strange at first ... A feeling of naturalness only appears after experience and
290
S. L. Zabell
reflection.” For further discussion of Markov exchangeability and its relation to inductive logic, see [Skyrms, 1991].
10.2 Analogy by similarity Given the tentative and limited nature of Carnap’s attempt’s to formulate an inductive logic that incorporated analogy by similarity, this stood as an obvious challenge and since Carnap’s death there have been a number of attempts in this direction; see, e.g., [Romeijn, 2006] and the references there to earlier literature. Skyrms [1993; 1996] suggests using what he terms “hyperCarnapian” systems: finite mixtures of Dirichlet priors. He argues (p. 331): “In a certain sense, this is the only solution to Carnap’s problem. ... HyperCarnapian inductive methods are the general solution to Carnap’s problem of analogy by similarity”. But what if the outcomes are continuous in nature? In order to discuss this, it will be necessary to first revisit the definition of exchangeability.
10.3 The general definition of exchangeability Consider first the general definition of exchangeability. A probability P on the space of sequences x1 , x2 , ..., xn of real numbers (that is, on Rn ) is said to be (finitely) exchangeable if it is invariant under all permutations σ of the index set {1, ..., n}; a probability P on the space of infinite sequences x1 , x2 , ... (that is, on R∞ ) is said to be infinitely exchangeable if its restriction Pn to finite sequences x1 , x2 , ..., xn is exchangeable for each n ≥ 1. There is a sweeping generalization of the de Finetti representation theorem that characterizes such probabilities. Some notation, briefly. Let {Pθ : θ ∈ Θ} denote the set of independent and identically distributed (iid) probabilities on infinite sequences. (That is, if pθ is a probability measure on R, then Pθ = (pθ )∞ is the corresponding product measure on R∞ . Here θ is just an index for the probabilities on the real line. Certain measure-theoretic niceties are being swept under the carpet at this point to simplify the exposition.) Now suppose that P is an infinitely exchangeable probability on infinite sequences. Then there exists a unique probability μ on Θ such that Pθ dμ(θ). P = Θ
That is, every exchangeable P on infinite sequences can be represented as a mixture of independent and identically distributed probabilities. (It is clear that every mixture of iid sequences is exchangeable; it is the point of the representation theorem that conversely every infinitely exchangeable probability arises thus. Aldous [1986] contains an outstanding survey of this and other generalizations of the original de Finetti theorem.) Thus, in order to arrive at P , it suffices to specify μ. Unfortunately, Θ is an uncountably infinite set, and the representation usefully reduces the dimensionality
Carnap and the Logic of Inductive Inference
291
of the problem of determining P only if one is able to exploit a difference in infinite cardinals!
10.4
The pragmatic Bayesian approach
In practical Bayesian statistics one sometimes proceeds as follows. Based on the background, training, and experience of the statistician, it is judged that the underlying but unknown distribution pθ of a population of numbers is a member of some particular parametric family (for example, normal, exponential, geometric, or Poisson) and it is the task of the statistician to estimate the unknown parameter θ. The parameter space Θ is now finite dimensional, often one dimensional. The mathematical model for a sample from such a population is an iid sequence of random variables X1 , X2 , X3 , ..., each Xj having distribution pθ , so that X1 , X2 , X3 , ... has distribution Pθ = (pθ )∞ . Being a Bayesian, the statistician assigns a “prior” or initial probability to Θ; the average over Θ using dμ then specifies a probability P as in the displayed formula above. Given a “random sample” (iid sequence) X1 , ..., Xn from the population, the statistician then computes the “posterior” or final probability P (θ|X1 , ..., Xn ) using Bayes’s theorem. In general, the larger the sample, the more concentrated the posterior distribution is about some value of the parameter. For example, if the density of pθ is (x − θ)2 1 , −∞ < x < ∞, pθ (x) = √ exp 2 2π (that is, normal, standard deviation one, unknown mean θ), then (except for certain “over-opinionated” priors) the posterior distribution for θ will be concentrated ¯ n , the sample mean for the random sample X1 , ..., Xn . about X It is apparent that this procedure in fact captures precisely the form of analogical ¯ n = x, then reasoning that Carnap had in mind. That is, if the sample mean is X the resulting posterior distribution expresses support for the belief that the next observation will be in the vicinity of x, the strength of the evidence for different values y decreasing as the distance of y from x increases. “But”, the Carnapian may object, “this is an enterprise entirely different from the one Carnap envisaged! There is no logical justification proffered for the choice of the parametric family pθ , or the choice of the prior dμ”!! True, but how might such a justification–if it existed–proceed? Consider the multinomial case in the continuum of inductive methods. There the de Finetti representation theorem tells us that the most general exchangeable sequence is a mixture of multinomial probabilities. The elegance of the JohnsonCarnap approach is that it replaces the essentially arbitrary, albeit mathematically convenient, quantitative assumption of the practicing Bayesian statistician that the prior is a member of a specific low-dimensional family (the Dirichlet priors
292
S. L. Zabell
on Δt−1 ) by the purely qualitative sufficientness postulate. That is, based on information received one might well arrive at the purely qualitative judgment that the probability that the next observation will be of a certain type should depend only on the number of that type already observed and the total number of observations to date. This is certainly a more principled approach to the problem of assigning a prior, in stark contrast to assuming the prior is Dirichlet purely for reasons of mathematical convenience. Framed in this way, the form of a principled Bayesian approach to the more general problem (of deciding on priors for other parametric families) is also clear. Can one find, at least for the most common parametric families in statistics, a natural qualitative assumption on a sequence of observations in addition to exchangeability that implies the sequence is in fact not just an arbitrary mixture of iid probabilities, but a mixture of distributions strictly within the given parametric family? For example, what would be an analog of the sufficientness postulate ensuring that an exchangeable sequence is a mixture of normal, or exponential, or geometric, or Poisson distributions?
10.5 Group invariance and sufficient statistics Thanks to some very deep and hard mathematics on the part of David Freedman, Persi Diaconis, Phil Dawid, and others, one can in fact answer this question for many of the most common statistical families. Here are some examples, followed by a brief summary of the currently known state of the theory. Let φμ,σ2 (x) denote the density of the normal distribution with mean μ and variance σ 2 ; that is, (x − μ)2 1 exp − . φμ,σ2 (x) = √ 2σ 2 2πσ If a random variable X has such a distribution, then this is denoted X ∼ N (μ, σ 2 ). The first example, characterizing exchangeable sequences that are a mixture of N (0, σ 2 ), is admittedly not the most interesting from a statistical standpoint, but it provides a simple illustration of the type of results the theory provides. EXAMPLE 1. An infinite sequence of random variables X1 , X2 , X3 , ... is said to be orthogonally invariant if for every n ≥ 1, the sequence X1 , ..., Xn is invariant under all orthogonal transformations of Rn . (An orthogonal transformation is a linear map that preserves distances. It can be thought as an n-dimensional rotation.) Schoenberg’s theorem tells us that every orthogonally invariant infinite sequence of random variables is a mixture of N (0, σ 2 ) iid random variables. (Note that a coordinate permutation is a very special kind of orthogonal transformation; thus orthogonal invariance entails exchangeability and is much more restrictive.) In terms of the de Finetti representation, if P is the distribution of the orthogonally
Carnap and the Logic of Inductive Inference
293
invariant sequence X1 , X2 , ..., and Pσ the distribution of an iid sequence of N (0, σ 2 ) random variables, then there exists a probability measure Q on [0, ∞) such that ∞ Pσ dQ(σ). P = 0
There is an equivalent formulation of Schoenberg’s theorem in terms of sufficient statistics. Consider the statistic Tn = X12 + ... + Xn2 . Then the property of orthogonal invariance is equivalent to the property that, for each n ≥ 1, conditional on Tn the distribution of X1 , .., Xn is uniform √ on the n − 1-sphere of radius Tn . Furthermore, the limit T = limn→∞ Tn / n exists almost surely and P (T ≤ σ) = Q([−∞, σ)); that is, the mixing measure Q is the distribution of the limit T . This has (accepting for the moment that one is willing to talk about infinite sequences of random √ variables, about which more later), a striking consequence. The statistic Tn / n is the standard sample estimate of the standard deviation σ. Thus one has a natural interpretation of both the Q and the σ appearing in the de Finetti representation. Far from being merely mathematical objects in the representation theorem, they acquire a significance of their own. The “parameter” (σ) emerges as the limit of the sample standard deviation (note one is certain of the existence of the limit but not its value); Q is our degree of belief regarding the unknown parameter (our uncertainty regarding the value of σ); and conditional on the limit being σ the sequence is iid N (0, σ 2 ). Thus one has a complete explication of the role of parameters, parametric families, and priors used by the pragmatic Bayesian statistician in this case. The particular parametric family arises from the particular strengthening of exchangeability (here orthogonal invariance) reflecting the knowledge of the statistician in this case. (If he doesn’t subscribe to orthogonal invariance, he shouldn’t be using a mixture of mean zero normals!) The single parameter σ is interpreted as the large sample limit of the sample standard deviation; and the mixing measure Q reflects our degree of belief as to the value of this limit. Very neat! EXAMPLE 2. Suppose P is a mixture of iid N (μ, σ 2 ) normals. Then it is easy to see that P is invariant under transformations that are orthogonal and preserves the line Ln : x1 = x2 = ... = xn . Dawid’s theorem states that this is in fact the necessary and sufficient condition for P to be such a mixture. In this case there are two sufficient statistics: Vn = X12 + ... + Xn2 ; Un = X1 + ... + Xn , and the symmetry assumption is equivalent to the property that, conditional on Un , Vn , the distribution of X1 , ..., Xn is uniform on the resulting (n − 2)-sphere. Furthermore, one has that the limits √ V = lim Vn / n U = lim Un /n, n→∞
n→∞
294
S. L. Zabell
exist almost surely and generate the mixing measure on the two-dimensional parameter space R × [0, ∞). Characterizations of this kind are known for a number of standard statistical distributions. Many of these form “exponential families”; Diaconis and Ylvisaker [1980] characterize the conjugate priors for such families in terms of the linearity of their posterior expectations. In other cases the challenge remains to find such characterizations, preferably in terms of both symmetry condition and sufficient statistics. Diaconis and Freedman [1984] is an outstanding exposition, describing many such results and placing them into a unified theoretical superstructure. In sum: Carnap recognized the limited utility of the inductive inferences that the continuum of inductive methods provided, and sought to extend his analysis to the case of analogical inductive inference: an observation of a given type makes more probable not merely observations of the exact same type but also observations of a “similar” type. The challenge lies both in making precise the meaning of “similar”, and in being able to then derive the corresponding continua. Carnap sought to meet the first challenge by proposing that underlying judgements of similarity is some notion of “distance” between predicates; but then immediately hit the brick wall of how one could use a general notion of distance to derive plausible continua. Neither Carnap nor any of his successors were able to solve this problem (although not for want of trying). The Diaconis-Freedman theory enables us to see why. If one recognizes that the problem of analogical reasoning is essentially that of justifying parametric Bayesian inference, then it is indeed possible to derive attractive results that parallel those for the multinomial case. But these results are not trivial; they involve very hard mathematics, and although many special cases have been successfully tackled, it is possible to argue that no complete theoretical superstructure yet exists.
11
THE SAMPLING OF SPECIES PROBLEM
Another important problem concerns the nature of inductive inference when the possible types or species are initially unknown (this is sometimes referred to in the statistical literature as the sampling of species problem). Carnap thought this could be done using the equivalence relation R: belongs to the same species as. (That is, one has a notion of equivalence or common membership in a species, without prior knowledge of that species.) Carnap did not pursue this idea further, however, thinking the attempt premature given the relatively primitive state of the subject at that time. Carnap’s intuition was entirely on the mark here. One can construct a theory for the sampling of species problem, one that parallels the classical continuum of inductive methods — but the attendant technical difficulties are considerable, exchangeable random sequences being replaced by exchangeable random partitions. (Two sequences generate the same partition if they have the same frequencies of frequences ar defined earlier.) Fortunately, the English mathematician J. H. C.
Carnap and the Logic of Inductive Inference
295
Kingman did the necessary technical spadework in a brilliant series of papers a quarter of a century ago. Kingman’s beautiful results enable one to establish a parallel inductive theory for this case, including a Johnson-type characterization of an analogous continuum of inductive methods; see [Zabell, 1992; 1997]. In brief, consider the following three axioms, that parallel (in two cases) or extend (in one case) those of Johnson. 1. All sequences of outcomes are possible (have positive probability). 2. The probability of seeing on the next trial the i-th species already seen, is a function of the number of times that species has been observed, ni , and the total sample size n: f (ni , n). 3. The probability of observing a new species is a function only of the number of species already observed t and the sample size n: g(t, n). It is a remarkable fact that if these three assumptions are satisfied, then one can prove that the functions f (ni , n), g(t, n) are members of a three-dimensional continuum described by three parameters α, θ, γ. The continuum of inductive methods for the sampling of species Case 1: If ni < n for some i, then f (ni , n) =
ni − α , n+θ
tα + θ . n+θ
g(t, n) =
Note that if ni < n, then t > 1, there are at least two species, and the universal generalization is disconfirmed. Case 2: If ni = n for some i, then f (ni , n) =
ni − α + cn (γ), n+θ
here cn (γ) =
g(t, n) =
tα + θ − cn (γ); n+θ
γ(α + θ)
(n + θ) γ + (α + θ − γ)
n−1 j=1
j−α j+θ
represents the increase in the probability of seeing the i-th species again due to the confirmation of the universal generalization. Not all parameter values are possible: one must have 0 ≤ α < 1; θ > −α; 0 ≤ γ < α + θ. There is a simple interpretation of the three parameters θ, α, γ. The first, θ, is related to the likelihood of new species being observed; the larger the value of θ, the more likely it is that the next observation is that of a new species.
296
S. L. Zabell
Observation of a new species has a double inductive import: it is a new species, and it is a particular species. Observing it contributes both to the likelihood that a new species will again be observed and, if a new species is not observed, that the species just observed will again be observed (as opposed to another species already observed); this is the role of α. Finally, the parameter γ is related to the likelihood that only one species will be observed. If is the initial probability that there will only be one species, then γ = (α + θ) . The special case α = γ = 0 is of particular interest. In this case the probability of an “allelic partition” (set of frequencies of frequencies ar ) has a particularly simple form: given a sample of size n, P (a1 , a2 , ..., an ) =
n
n! θar ; θ(θ + 1)...(θ + n − 1) r=1 rar ar !
this is the Ewens sampling formula. There is a simple urn model for such a process in this case, analogous to the Polya urn model [Hoppe, 1984]. Suppose we start out with an urn containing a single, black ball: the mutator. The first time we select a ball, it is necessarily the black one. We replace it, together with a ball of some color. As time progresses, the urn contains the mutator and a number of colored balls. Each colored ball has a weight of one, the mutator has weight θ. The likelihood of selecting a ball is proportional to its weight. If a colored ball is selected, it is replaced together with a ball of the same color; this corresponds to observing a species that has already been observed before (hence balls of its color are already present). If the mutator is selected, it is replaced, together with a ball of a new color ; this corresponds to observing a new species. It is not difficult to verify that the rules of succession for this process are f (ni , n) =
ni ; n+θ
g(n) =
θ . n+θ
Note that in this case the probability of a new species does not depend on the number observed. Such predictive probabilities arguably go back to De Morgan; see [Zabell, 1992]. 12 A BUDGET OF PARADOXES Strictly speaking, true paradox (in the sense of a basic contradiction in the theory itself) is no more possible in the Bayesian framework than it is in propositional logic: both are theories of consistency of input. The term “paradox” is often used instead to describe either some unexpected (but reasonable) consequence of the theory (so that we learn something from it); or an inconsistency arising from conflicting sets of inputs (which is what the theory is supposed to detect); or an apparent failure of the theory to explain what we regard as a valid intuition (which should be viewed as more of a challenge than a paradox). Nevertheless, analyzing and understanding such conundrums often gives us much greater insight
Carnap and the Logic of Inductive Inference
297
into a subject, and the theory of probability has certainly had its fair share of such “challenge problems”. In the following paragraphs a few of these paradoxes are briefly noticed, more by way of initial orientation and an entry into the literature, than any detailed analysis. Indeed the literature on all of these is considerable.
12.1
The paradoxes of conditional probability
There is an amusing and interesting literature concerning conditional probability paradoxes such as the paradox of the second ace [Shafer, 1985], the three prisoner paradox [Falk, 1992], and the two-envelope paradox [Katz and Olin, 2007]. The unnecessary controversies that sometimes arise over these (for example, in Philosophy of Science and The American Statistician, names omitted to protect the guilty) are object lessons in the pitfalls that can attend informal attempts to analyze problems based on vague intuitions without the rigor of first carefully defining the sample space of possibilities or modeling the way information is received. Properly understood these puzzles serve as examples of the utility of the theory, not its deficiencies.
12.2
Hempel’s paradox of the ravens
Nicod’s criterion states that an assertion “all A are B” is supported by an observation of an A that is also a B; Hempel’s equivalence condition that two logically equivalent propositions are equally confirmed by the same evidence. Hempel’s paradox [Hempel, 1945], in its best-known (or most notorious) form considers the assertion “all ravens are black”. This is equivalent to its contrapositive, “all nonblack objects are not ravens”. If one then observes a pink elephant, does this confirm the proposition “all ravens are black”? Strictly speaking this is not a paradox of logical or subjective probability, because it follows just from Nicod’s criterion and the equivalence condition. It is in any case easily accommodated within the Bayesian framework which, in brief, notes that pink elephants can indeed confirm black ravens, albeit to a very slight degree; see, e.g., [Hosiasson-Lindenbaum, 1940; Good, 1960]. Vranas [2004a], Howson and Urbach [2006, pp. 99–103], Fitelson [2008] provide entries to the recent literature; Sprenger [2009] provides a general survey and assessment.
12.3
Goodman’s new riddle of induction
For Carnap, probability1 is analytic and syntactic; probability2 synthetic and semantic. Returning in 1941 to Keynes’s Treatise on Probability with increased appreciation, Carnap sought to provide a satisfactory technical and quantitative foundation for inductive inference he saw as absent in Keynes. But after his paper proposing a purely syntactic justification for inductive inference [Carnap, 1945b], Nelson Goodman [1946] immediately published a serious challenge to it. To use
298
S. L. Zabell
the example later put forward by Goodman in Fact, Fiction, and Forecast (1954), under the striking heading of “the new riddle of induction”, Goodman defined a predicate grue: say an object is grue if, for some fixed time t, it is green before t and blue after. If all emeralds observed prior to time t are green, then this is equally consistent with their being either green and grue, and therefore apparently supports to an equal degree the expectation that emeralds observed after time t will be either green or red. Goodman’s conclusion was that inductive inference is not purely syntactic in nature; that to varying degrees predicates are more or less projectible, projectability depending on the extent to which a predicate is entrenched in natural language. Although Goodman and Carnap soon agreed to disagree, there was no escape; and Goodman’s point is now generally accepted. (Carnap sought to meet this objection by invoking his requirement of total evidence, of which more in a moment.) Goodman’s “new riddle” has sparked a substantial literature (see, e.g., [Stalker, 1994]). For a recent survey, see Schwartz [2009]. From a Bayesian perspective, projectability is effectively a question of the presence of exchangeability (or partially exchangeability); and as such this literature may be viewed as a complement to, rather than rival of the subjectivist position (see, e.g., [Horwich, 1982, pp. 67– 72]). For Carnap’s final views on grue, see [Carnap and Jeffrey, 1971, pp. 73–76].
12.4 The principle of total evidence Carnap’s initial defense to Goodman’s example was to invoke a requirement of total evidence, that in the application of inductive logic to a given knowledge situation, the total evidence available must be taken as basis for determining the degree of confirmation. [Carnap, 1950, p. 211] This closed one hole in the dike, only for another to arise. In 1957 Ayer raised a fundamental question: in any purely logical theory of probability, why are new observations important? This is an issue that, as Good [1967] observes, is both related to the principle of total evidence and relevant to subjective theories of probability. Good’s solution to the conundrum was a neat one: [I]n expectation, it pays to take into account further evidence, provided that the cost of collecting and using this evidence, although positive, can be ignored. In particular, we, should use all the evidence already available, provided that the cost of doing so is negligible. With this proviso then, the principle of total evidence follows from the principle of rationality [that is, of maximizing expected utility]. For further discussion of the principle of total evidence, see [Skyrms, 1987]; for the value of knowledge, see [Horwich, 1982, pp. 122–129; Skryms, 1990, Chapter 4].
Carnap and the Logic of Inductive Inference
299
Related questions here are Glymour’s problem of old evidence (if a theory T entails an experimental outcome E, but one observes E before this is discovered, does this increase the probability of T ?), see, e.g., [Garber, 1983; Jeffrey, 1992, Chapter 5; Earman, 1992, Chapter 5; Jeffrey, 2004, pp. 44-47; Howson and Urbach, 2006, pp. 197–20]; and I. J. Good’s concept of dynamic (or evolving) probability [Good, 1983, Chapter 10]. Central to both is the issue of the appropriateness of the principle of logical omniscience: if H logically entails E, then P (E | H) = 1. As Good notes [1983, p. 107], invoking a standard chestnut, it makes sense for purposes of betting to assign a probability of 1/10 that the millionth digit of π is a 7, even though one can, given sufficient times and resources, compute the actual digit (so that some would argue that the probability is either 0 or 1 depending). Discussion of this issue goes back at least to Polya [1941]; Hacking [1967] deals with the issue in terms of sentences that are “personally possible”. (Of course from a practical Bayesian perspective one simple solution is to work with probabilities defined on subsets of a sample space rather than logical propositions or sentences. Thus in the case of π, take the sample space to be the set {0, 1, ..., 9}, and assign a coherent probability to the elements of the set. Whether or not it is profitable to expand the sample space to accommodate further events then goes to the issue of the value of further knowledge.)
12.5
The Popper-Carnap controversy and Miller’s paradox
Karl Popper was a lifelong and dogged opponent of Carnap’s inductivist views. In Appendix 7 of his Logic of Scientific Discovery [Popper, 1968] Popper made the claim that the logical probability of a universal generalization must be zero; today this can only be regarded as an historical curiosity. For two critiques (among many) of Popper’s claim, see [Howson, 1973; 1987]. For those interested in the more general debate between Popper and Carnap, their exchange in the Schillp volume on Carnap [Schillp, 1963] is a natural place to start. For a general overview, see [Niiniluoto, 1973]. One important thread in the debate was Miller’s paradox ; Jeffrey [1975] is at once a useful reprise of the initial debate, and a spirited rebuttal. Closely related to Miller’s paradox is Lewis’s “principal principle”; see [Vranas, 2004b] for a recent discussion and many earlier references. For a more sympathetic view of Popper than the one here, see [Miller, 1997].
13
CARNAP REDUX
Thus far we have discussed Carnap’s basic views regarding probability and inductive inference, some of his technical contributions to this area, and some of the extensions of Carnap’s approach that took place during his lifetime and after. In this final part of the chapter we return to the philosophical (rather than technical)
300
S. L. Zabell
underpinnings of Carnap’s approach, and attempt to place them in the context of both his predecessors and his successors.
13.1 “Two concepts of probability” In his 1945 paper “The Two Concepts of Probability”, Carnap advanced his view of “the problem of probability’. Noting a “bewildering multiplicity” of theories that had been advanced over the course of more than two and a half centuries, Carnap suggested one had to carefully steer between the Scylla and Charybdis of assuming either too few or too many underlying explicanda, and settled on just two. These two underlying concepts Carnap called probability1 and probability2 : degree of confirmation versus relative frequency in the long run. Carnap’s identification of these two basic kingdoms of probability was not however novel; it is clearly stated in Poisson’s 1837 treatise on probability (where Poisson uses the terms probability and chance to distinguish the two). Thus Poisson writes: In this work, the word chance will refer to events in themselves, independent of our knowledge of them, and we will retain the word probability ... for the reason we have to believe. [Poisson, 1837, p. 31] Much the same distinction was made shortly after by Cournot [1843], Exposition de la theorie des chances et des probabilit´es, where he notes its “double sense”, which he refers to as subjective and objective, a terminology also found later in [Bertrand, 1890] and [Poincar´e, 1896]. Hacking [1975, p. 14] sees the distinction as going even further back to Condorcet in 1785. For discussion of Poisson and Cournot, see [Good, 1986, pp. 157–160; Hacking, 1990, pp. 96–99]. In the 20th century, Frank Plumpton Ramsey, one of the great architects of the modern subjective theory, likewise noted the possible validity of both senses: In this essay the Theory of Probability is taken as a branch of logic, the logic of partial belief and inconclusive argument; but there is no intention of implying that this is the only or even the most important aspect of the subject. Probability is of fundamental importance not only in logic but also in statistical and physical science, and we cannot be sure beforehand that the most useful interpretation of it in logic will be appropriate in physics also. Indeed the general difference of opinion between statisticians who for the most part adopt the frequency theory of probability and logicians who mostly reject it renders it likely that the two schools are really discussing different things, and that the word ’probability’ is used by logicians in one sense and by statisticians in another. This is as clear a statement of Carnap’s distinction as one might imagine. (It can also be found clearly stated in a number of other places such as [Polya, 1941; Good, 1950].)
Carnap and the Logic of Inductive Inference
301
Thus, although the clear recognition of the fundamentally dual nature of probability did not originate with Carnap, the importance of his contribution is this: despite clear statements by Poisson in the 19th century, Ramsey in the 20th, and others both before and after, the lesson had not been learned; and even those who recognized the duality implicit in the usage of the word for the most part believed this to reflect a confusion of thought, only one of the two senses being truly legitimate. By carefully, forcefully, and in sustained fashion arguing for the legitimacy of both, Carnap enabled the distinction to at last become an entrenched philosophical commonplace. “The duality of probability has long been known to philosophers. The present generation may have learnt it from Carnap’s weighty Logical Foundations” [Hacking, 1975, p. 13].
13.2
The later Carnap
Just as there is an early and later Wittgenstein, there is an early and later Carnap in inductive logic. Some of these changes were technical, but others reflected substantial shifts in Carnap’s underlying views. The appearance of Carnap’s book generated considerable discussion and debate in the philosophical community. A second volume was promised, but never appeared. Like many before him, who found themselves enmeshed in the intellectual quicksand of the problem of induction (such as Bernoulli and Bayes), Carnap continued to grapple with the problem, refining and extending his results, but found that new advances and insights (on the part of himself, his collaborators, and others) were coming so quickly that he eventually abandoned as impractical the project of a definitive and systematic book-length treatment in favor of publishing from time to time compilations of progress reports. Two such installments eventually appeared [Carnap and Jeffrey, 1971; Jeffrey, 1980], although even these were delayed far past their initially anticipated date of publication. Because no true successor to his Logical Foundations of Probability ever appeared, it is not always appreciated just how much of an evolution in Carnap’s views about probability took place over the last two decades of his life. This change reflected in part a changing environment: the increasing appreciation of the prewar contributions of Ramsey and de Finetti, and the publication of such books as [Good, 1950; Savage, 1954; Raiffa and Schlaifer, 1961]. Important materials in documenting this shift include the introduction to the second [1962] edition of [Carnap, 1950], his paper “The aim of inductive logic” ([Carnap, 1962], reprinted in revised form in [Carnap and Jeffrey, 1971, Chapter 1]), Carnap’s contributions to the Schilpp [1963] volume, and his posthumous “Basic system of inductive logic” ([Carnap and Jeffrey, 1971, Chapter 2; Jeffrey, 1980, Chapter 6]). Technical shifts Some of these shifts, although technical in nature, were quite important. First, there was a shift from sentences in a formal language to (effectively) subsets of
302
S. L. Zabell
a sample space. This reflected in part a desire to use the technical apparatus of modern mathematical probability, and in part a desire to formulate inductive logic in terms that had come to be standard in mathematical probability theory and theoretical statistics, where probabilities are attributed to “events” or (“propositions”) which are construed as sets of entities which can handily be taken to be models, in the sense in which that term is used in logic. [Carnap and Jeffrey, 1971, p.1] Second, as discussed at the beginning of this chapter, Carnap accepted the Ramsey–de Finetti–Savage link of probability to utility and decision making, its betting odds interpretation, the use of coherence and the Dutch book to derive the basic axioms of probability, and the central role of Bayes’s theorem in belief revision. This placed Carnap squarely in the Bayesian camp, the differences coming down to ones of the existence or status of further epistemic constraints. This change came fairly quickly; it is already evident in Carnap’s 1955 lecture notes [Carnap, 1973]. It is carefully stated in Carnap [1962] and then systematically elaborated in his Basic System. Carnap also announced in the preface to his second edition of Logical Foundations the abandonment of his requirements of logical independence (replacing it by Kemeny’s “meaning postulates”), and completeness for primitive predicates (replacing it by axioms relevant to language extensions). These are of less interest to us here. The emerging Bayesian majority Carnap’s shift to the subjective was certainly noted by others. I. J. Good, for example, remarks “Between 1950 and 1961 Carnap moved close to my position in that he showed a much increased respect for the practical use of subjective probabilities” [Good, 1975, p. 41; see also p. 40, Figure 1]. But for the best evidence of this convergence of view between Carnap and the subjectivists, however, one can summon Carnap himself as a witness. In his Basic System (his last, posthumously published work on inductive inference), Carnap tells us I think there need not be a controversy between the objectivist point of view and the subjectivist or personalist point of view. Both have a legitimate place in the context of our work, that is, the construction of a set of rules for determining probability values with respect to possible evidence. At each step in the construction, a choice is to be made; the choice is not completely free but is restricted by certain boundaries. Basically, there is merely a difference in attitude or emphasis between the subjectivist tendency to emphasize the existing freedom of choice, and the objectivist tendency to stress the existence of limitations. [Jeffrey, 1980, p. 119]
Carnap and the Logic of Inductive Inference
303
The ultimate difference between Carnap and subjectivists of the de Finetti– Savage–Good stripe, then, appears to be how they view the logical status of these additional constraints. Carnap seems to have thought of them as forming in some sense a sequence or hierarchy (thus his “at each step in the construction”); modern Bayesians, in contrast, view these more as auxiliary tools. They do not deny the utility of the symmetry arguments that underly much of the Carnapian approach but, as Savage remarks, they “typically do not find the contexts in which such agreement obtains sufficiently definable to admit of expression in a postulate” [Savage, 1954, p. 66]. Such arguments fall instead under the rubric of what I. J. Good terms “suggestions for using the theory, these suggestions belonging to the technique rather than the theory” itself [Good, 1952, p. 107]. Let us take this a little further. Is what is at stake really just a “difference in attitude or emphasis” between choice and limitation? Here is how W. E. Johnson himself saw the enterprise (as he notes in his paper deriving the continuum of inductive methods): the postulate adopted in a controversial kind of theorem cannot be generalized to cover all sorts of working problems; so it is the logician’s business, having once formulated a specific postulate, to indicate very carefully the factual and epistemic conditions under which it has practical value. [Johnson, 1932, pp. 418–419] This is surely right. There are no universally applicable postulates: different symmetry assumptions are appropriate under different circumstances, none is logically compulsory. The best one can do is identify symmetry assumptions that seem natural, have identifiable consequences, and may be a natural reflection of one’s beliefs under some reasonable set of circumstances. In judging the appropriate use of the sufficientness postulate, for example, the issue is not one of favoring “limitation” versus “choice”; it is one of whether or not you think the postulate accurately captures the epistemic situation at hand. This is the mission of partial exchangeability: to find different possible qualitative descriptions of the “the factual and epistemic conditions” that obtain in actual situations, descriptions that then turn out to have useful and satisfying quantitative implications. From credence to credibility Nevertheless Carnap did argue for additional symmetry requirements such as exchangeability; his explanation of this is perhaps most clearly presented in his 1962 paper “The aim of inductive logic”. It will be apparent that Carnap and the subjectivists part company at this point because they had radically different goals. Let Crt denote the subjective probability of an individual at time n, termed by Carnap credence. Using Bayes’s rule, Carnap imagines a sequence of steps in which one obtains discrete quanta of data Ej , j = 1, 2, ..., giving rise in turn to a sequence of credences Crt+j , j = 1, 2, ....
304
S. L. Zabell
In the case of a human being we would hesitate to ascribe to him a credence function at a very early time point, before his abilities of reason and deliberate action are sufficiently developed. But again we disregard this difficulty by thinking either of an idealized human baby or of a robot. ... [L]et us acribe to him an inital credence function Cr0 for the time point T0 before he obtains his first datum E1 . (This curiously echos Price’s analysis of inductive inference in his appendix to Bayes’s essay; see [Zabell, 1997, Section 3].) The subsequent conditional credences based on this initial credence Cr0 Carnap terms a credibility; and contrasts these with the “adult credence functions” of Ramsey, Savage, and de Finetti: When I propose to take as a basic concept, not adult credence, but either initial credence or credibility, I must admit that these concepts are less realistic and remoter from overt behavior and may therefore appear as elusive and dubious. On the other hand, when we are interested in rational decision theory, these concepts have great methodological advantages. Only for these concepts, not for credence, can we find a sufficient number of requirements of rationality as a basis for the construction of a system of inductive logic. Thus Carnap asserts there are additional rationality requirements for Cr0 , ones having “no analogue for credence functions”; for example, symmetry of individuals (i.e., exchangeability). The assertion is that absent identifiable differences between individuals at the initial time T0 (and since we are at the initial time T0 we have not yet learned of any), the probability of any proposition involving two or more individuals should remain unchanged if the individuals are permuted (see [Carnap 1962, pp. 313–314; 1971, p. 118]). Carnap regards this as “the valid core of the old principle of indifference ... the basic idea of the principle is sound. Our task is to restate it by specific restricted axioms” [Carnap, 1962, p. 316; 1973, p. 277]. No wonder this part of Carnap’s program never gained traction! It focuses on the credences of an “idealized human baby” rather than an adult; appeals to a state of complete ignorance; and presents itself as a rehabilitated version of the principle of indifference. And what does it mean to talk about individuals about what we know nothing except that they are different? In the end one exchanges one problem for another, replacing the task of finding a probability function by the (in fact much more daunting and questionable) task of establishing the existence of an underlying ideal language, one in which the description of sense experiences can be broken down into atomic interchangeable elements. Such ideal languages are a seductive dream that in one form or another go back centuries, as in John Wilkins’s philosophical language, or Leibniz’s “characteristica universalis”, which Leibniz thought could be used as the basis of a logical probability [Hacking, 1975, Chapter 15]. If Wittgenstein’s early program of logical atomism had been successful, then logical probability might be possible, but the failure of the former dooms the latter. Lacking an ultimate language in one-to-one
Carnap and the Logic of Inductive Inference
305
correspondence with reality, Carnapian programs retain an irreducible element of subjectivism. Despite the ultimate futility of Carnap’s program to justify induction in quantitative terms, the subjective Bayesian does provide a number of qualitative explicata. Inductive rationality in a single individual is not so much a matter of present opinion as the ability to be persuaded by further facts; and for two or more individuals by their ultimate arrival at consensus. To this end a number of results regarding convergence and merging of opinion have been discovered. For convergence of opinion see Skyrms [2006], and the earlier literature cited there; for merging of opinion see the classic paper of Blackwell and Dubins [1962] and the discussion in [Earman, 1992], as well as [Kalai and Lehrer, 1994] and [Miller and Sanchirico, 1999]. For further discussion of Carnap’s program for inductive logic in its final form, see [Jeffrey, 1973]. 14
CONCLUSION
Like his distinguished predecessors Bernoulli and Bayes, Rudolph Carnap continued to grapple with the elusive riddle of induction for the rest of his life. Throughout he was an effective spokesman for his point of view. But although the technical contributions of Carnap and his invisible college (such as Kemeny, Bar-Hillel, Jeffrey, Gaifman, Hintikka, Niiniluoto, Kuipers, Costantini, di Maio, and others) remain of considerable interest even today, Carnap’s most lasting influence was more subtle but also more important: he largely shaped the way current philosophy views the nature and role of probability, in particular its widespread acceptance of the Bayesian paradigm (as, for example, in [Horwich, 1982; Earman, 1992; Mayer, 1993; Jaynes, 2003; Boven and Hartman, 2004; Jeffrey, 2004; Howson and Urbach, 2006]). BIBLIOGRAPHY [Armendt, 1993] Brad Armendt. Dutch books, additivity, and utility theory. Philosophical Topics, 21:1–20, 1993. [Ayer, 1972] A. J. Ayer. Probability and Evidence. Macmillan, London, 1972. [Barker, 1957] S. F. Barker. Induction and Hypothesis. Cornell University Press, Ithaca, 1957. [Bayes, 1764] Thomas Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370–418, 1764. [Blackwell and Dubins, 1962] David Blackwell and Lester Dubins. Merging of opinions with increasing information. Annals of Mathematical Statistics, 33:882-886, 1962. [Boole, 1854] George Boole. An Investigation of the Laws of Thought. Macmillan, London, 1854. Reprinted 1958, Dover Publications, New York. [Bovens and Hartmann, 2004] Luc Bovens and Stephan Hartmann. Bayesian Epistemologyy. Oxford University Press, Oxford, 2004. [Broad, 1918] C. D. Broad. The relation between induction and probability. Mind, 27:389–404; 29:11–45, 1918. [Broad, 1924] C. D. Broad. Mr. Johnson on the logical foundations of science. Mind, 33:242–261, 369–384, 1924.
306
S. L. Zabell
[Carnap, 1945a] Rudolph Carnap. On inductive logic. Philosophy of Science, 12:72–97, 1945. [Carnap, 1945b] Rudolph Carnap. The two concepts of probability. Philosophy and Phenomenological Research, 5:513–532, 1945. [Carnap, 1950] Rudolph Carnap. Logical Foundations of Probability. University of Chicago Press, Chicago, 1950. Second edition, 1962. [Carnap, 1952] Rudolph Carnap. The Continuum of Inductive Methods. University of Chicago Press, Chicago, 1952. [Carnap, 1962] Rudolph Carnap. The aim of inductive logic. In E. Nagel, P. Suppes, and A. Tarski, editors, Logic, Methodology and Philosophy of Science, pages 303–318. Stanford University Press, Stanford, 1962. [Carnap, 1973] Rudolph Carnap. Notes on probability and induction. Synthese, 25:269–298, 1973. [Carnap and Jeffrey, 1971] Rudolph Carnap and Richard C. Jeffrey, editors. Studies in Inductive Logic and Probability, volume I. University of California Press, Berkeley and Los Angeles, 1971. [Carnap and Stegm¨ uller, 1959] Rudolph Carnap and W. Stegm¨ uller. Inductive Logik und Wahrscheinlichkeit. Springer-Verlag, Vienna, 1959. [Cifarelli and Regazzini, 1996] Donato Michele Cifarelli and Eugenio Regazzini. De Finetti’s contribution to probability and statistics. Statistical Science, 11:2253–282, 1996. [de Finetti, 1937] Bruno de Finetti. La pr´evision: ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincar´ e, 7:1–68, 1937. Translated in H. E. Kyburg, Jr. and H. E. Smokler (eds.), Studies in Subjective Probability, Wiley, New York, 1964, pp. 93-158. [de Finetti, 1938] Bruno de Finetti. Sur la condition de “´equivalence partielle”. In Actualites Scientifiques et Industrielles, volume 739, pages 5–18. Hermann, Paris, 1938. [de Finetti, 1972] Bruno de Finetti. Probability, Induction, and Statistics. Wiley, New York, 1972. [De Morgan, 1838] Augustus De Morgan. An Essay on Probabilities, and their Application to Life Contingencies and Insurance Offices. Longman, Orme, Brown, Green, and Longmans, London, 1838. [Diaconis and Freedman, 1980a] Persi Diaconis and David Freedman. De Finetti’s theorem for Markov chains. Annals of Probability, 8:115–130, 1980. [Diaconis and Freedman, 1980b] Persi Diaconis and David Freedman. Finite exchangeable sequences. Annals of Probability, 8:745–764, 1980. [Diaconis and Freedman, 1984] Persi Diaconis and David Freedman. Partial exchangeability and sufficiency. In J. K. Ghosh and J. Roy, editors, Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference, pages 205–236. Indian Statistical Institute, Calcutta, 1984. [Diaconis and Ylvisaker, 1980] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The Annals of Statistics, 7:269-281, 1979. [Earman, 1992] John Earman. Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. M. I. T. Press, 1992. [Edgeworth, 1922] Francis Ysidro Edgeworth. The philosophy of chance. Mind, 31:257–283, 1922. [Ellis, 1854] Robert Leslie Ellis. Remarks on the fundamental principle of the theory of probabilities. Transactions of the Cambridge Philosophical Society, 9:605–607, 1854. [Falk, 1992] Ruma Falk. A closer look at the probabilities of the notorious three prisoners. Cognition, 43:197–223, 1992. [Feller, 1968] William Feller. An Introduction to Mathematical Probability. Wiley, New York, 3rd edition, 1968. [Fitelson, 2008] Brandon Fitelson. Goodman’s “new riddle”. Journal of Philosophical Logic, 37:613–643, 2008. [Garber, 1983] Daniel Garber. Old evidence and logical omniscience in Bayesian confirmation theory. In J. Earman, editor, Testing Scientific Theories, volume 10, pages 99–131. University of Minnesota Press, Minneapolis, 1983. [Good, 1950] I. J. Good. Probability and the Weighing of Evidence. Hafner Press, New York, 1950. [Good, 1952] I. J. Good. Rational decisions. Journal of the Royal Statistical Society B, 14:107– 114, 1952. [Good, 1959] I. J. Good. Kinds of probability. Science, 129:443–447, 1959.
Carnap and the Logic of Inductive Inference
307
[Good, 1960] I. J. Good. The paradoxes of confirmation. British Journal for the Philosophy of Science, 11:145–149, 1960. 12, 63–64. [Good, 1965] I. J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M. I. T. Press, Cambridge, Mass., 1965. [Good, 1967] I. J. Good. On the principle of total evidence. British Journal for the Philosophy of Science, 17:319–321, 1967. [Good, 1971] I. J. Good. 46656 varieties of Bayesians. American Statistician, 25:62–63, 1971. [Good, 1975] I. J. Good. Explicativity, corroboration, and the relative odds of hypotheses. Synthese, 30:39–73, 1975. [Good, 1983] I. J. Good. Good Thinking. University of Minnesota Press, Minneapolis, 1983. [Good, 1986] I. J. Good. Some statistical applications of poisson’s work. Statistical Science, 1:157–170, 1986. [Goodman, 1946] Nelson Goodman. A query on confirmation. Journal of Philosophy, 43:383– 385, 1946. [Goodman, 1954] Nelson Goodman. Fact, Fiction, and Forecast. Hackett, Indianopolis, 1954. [Hacking, 1967] Ian Hacking. Slightly more realistic personal probability. Philosophy of Science, 34:311–325, 1967. [Hacking, 1975] Ian Hacking. The Emergence of Probability. Cambridge University Press, Cambridge, 1975. [Heath, 1949] Sir Thomas Heath. Mathematics in Aristotle. Clarendon Press, Oxford, 1949. [Hempel, 1945] C. G. Hempel. Studies in the logic of confirmation. Mind, 54:1–26, 97–121, 1945. [Hintikka and Niiniluoto, 1980] J. Hintikka and I. Niiniluoto. An axiomatic foundation for the logic of inductive generalization. In R. C. Jeffrey, editor, Studies in Inductive Logic and Probability, volume 2, pages 157–181? University of California Press, Berkeley, 1980. [Hintikka, 1966] J. Hintikka. A two-dimensional continuum of inductive methods. In J. Hintikka and P. Suppes, editors, Aspects of Inductive Logic, pages 113–132. North-Holland, Amsterdam, 1966. [Hoppe, 1984] Fred Hoppe. Polya-like urns and the Ewens samapling formula. Journal of Mathematical Biology, 20:91–94, 1984. [Horwich, 1982] Paul Horwich. Probability and Evidence. Cambridge University Press, Cambridge, 1982. [Hosiasson-Lindenbaum, 1940] Janina Hosiasson-Lindenbaum. On confirmation. Journal of Symbolic Logic, 5:133–148, 1940. [Howson and Urbach, 2006] Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. Open Court Press, Chicago and La Salle, IL, 3rd edition, 2006. [Howson, 1973] Colin Howson. Must the logical probability of laws be zero? British Journal for Philosophy of Science, 24:153–163, 1973. [Howson, 1987] Colin Howson. Popper, prior probabilities, and inductive inference. British Journal for Philosophy of Science, 38:207–224, 1987. [Huzurbazar, 1955] V. S. Huzurbazar. On the certainty of an inductive inference. Proceedings of the Cambridge Philosophical Society, 51:761–762, 1955. [Jaynes, 2003] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, 2003. [Jeffrey, 1973] Richard C. Jeffrey. Carnap’s inductive logic. Synthese, 25:299–306, 1973. [Jeffrey, 1975] Richard C. Jeffrey. Probability and falsification: critique of the Popper program. Synthese, 30:95–117, 1975. [Jeffrey, 1977] Richard C. Jeffrey. Mises redux. In R. E. Butts and J. Hintikka, editors, Basic Problems in Methodology and Linguistics, pages 213–222. Reidel, Dordrecht, 1977. [Jeffrey, 1980] Richard C. Jeffrey, editor. Studies in Inductive Logic and Probability, volume II. University of California Press, Berkeley and Los Angeles, 1980. [Jeffrey, 1983] Richard C. Jeffrey. The Logic of Decision. University of Chicago Press, Chicago, 2nd ed, 1983. [Jeffrey, 1988] Richard C. Jeffrey. Conditioning, kinematics, and exchangeability. In W. L. Harper and B. Skyrms, editors, Causation, Chance, and Credence, volume 1, pages 221–255. Kluwer, Dordrecht, 1988. [Jeffrey, 1992] Richard C. Jeffrey. Probability and the Art of Judgment. Cambridge University Press, Cambridge, 1992.
308
S. L. Zabell
[Jeffrey, 2004] Richard C. Jeffrey. Subjective Probability: The Real Thing. Cambridge University Press, Cambridge, 2004. [Jeffreys, 1961] Harold Jeffreys. Theory of Probability. Clarendon Press, Oxford, 3rd edition, 1961. [Jevons, 1874] Wliiam Stanley Jevons. The Principles of Scienc: A Treatise on Logic and Scientific Method. Macmillan, London, 1st edition, 1874. (2nd e. 1877; reprinted 1958, Dover, New York). [Johnson, 1924] William Ernest Johnson. Logic, Part III: The Logical Foundations of Science. Cambridge University Press, 1924. [Johnson, 1932] William Ernest Johnson. Probability: The deductive and inductive problems. Mind, 41:409–423, 1932. [Kalai and Lehrer, 1994] Ehud Kalai and Ehud Lehrer. Weak and strong merging of opinions. Journal of Mathematical Economics, 23:73-86, 1994. [Katz and Olin, 2007] Bernard D. Katz and Doris Olin. A tale of two envelopes. Mind, 116:903– 926, 2007. [Kemeny, 1955] John Kemeny. Fair bets and inductive probabilities. Journal of Symbolic Logic, 20:263–273, 1955. [Keynes, 1921] John Maynard Keynes. A Treatise on Probability. Macmillan, London, 1921. [Kneale, 1949] William Kneale. Probability and Induction. The Clarendon Press, Oxford, 1949. [Knopp, 1947] Konrad Knopp. Theory and Application of Infinite Series. Hafner Press, New York, 1947. [Kuipers, 1978] Theo A. F. Kuipers. Studies in Inductive Probability and Rational Expectation. D. Reidel, Dordrecht, 1978. [Maher, 1993] Patrick Maher. Betting on Theories. Cambridge Studies in Probability, Induction and Decision Theory. Cambridge University Press, Cambridge, 1993. [Miller, 1997] David Miller. Sir Karl Raimund Popper, CH, FBA. Biographical Memoirs of Fellows of The Royal Society of London, 43:367–409, 1997. [Miller and Sanchirico, 1999] Ronald I. Miller and Chris William Sanchirico. The role of absolute continuity in “merging of opinions” and “rational learning. Games and Economic Behavior, 29:170-190, 1999. [Niiniluoto, 1973] Ilkka Niiniluoto. Review: Alex C. Michalos’ The Popper-Carnap Controversy. Synthese, 25:417–436, 1973. [Niiniluoto, 2009] Ilkka Niiniluoto. The development of the Hintikka program. In Dov Gabbay, Stephan Hartmann, and John Woods, editors, Handbook of the History of Logic, volume 10. Elsevier, London, 2009. [Poincar´ e, 1896] Henri Poincar´ e. Calcul des probabilit´es. Paris, Gauthier-Villars, 1896 (2nd ed. 1912). [Polya, 1941] George Polya. Heuristic reasoning and the theory of probability. American Mathematical Monthly, 48:450–465, 1941. [Popper, 1959] Karl Popper. The Logic of Scientific Discovery. Basic Books, New York, 1959. Second ed. 1968, New York: Harper and Row. [Prevost and L’Huilier, 1799a] Pierre Prevost and S. A. L’Huilier. Sur les probabilit´es. M´ emoires de l’Academie Royale de Berlin 1796, pages 117–142, 1799. [Prevost and L’Huilier, 1799b] Pierre Prevost and S. AS. L. L’Huilier. M´emoire sur l’arte d’estimer la probabilit´e des cause par les effets. M´ emoires de l’Academie Royale de Berlin, 1796:3–24, 1799. [Ramsey, 1931] Frank Plumpton Ramsey. Truth and probability. In R. B. Braithwaite, editor, The Foundations of Mathematics and Other Logical Essays, pages 156–198. Routledge and Kegan Paul, London, 1931. Read before the Cambridge Moral Sciences Club, 1926. [Romeijn, 2006] Jan Willem Romeijn. Analogical predictions for explicit similarity. Erkenntnis, 64:253–280, 2006. [Savage, 1954] Leonard J. Savage. The Foundations of Statistics. John Wiley, New York, 1954. Reprinted 1972, New York: Dover. [Schillp, 1963] P. A. Schillp, editor. The Philosophy of Rudolph Carnap. Open Court, La Salle, IL, 1963. [Schwartz, 2009] Robert Schwartz. Goodman and the demise of the syntactic model. In Dov Gabbay, Stephan Hartmann, and John Woods, editors, Handbook of the History of Logic, volume 10. Elsevier, London, 2009.
Carnap and the Logic of Inductive Inference
309
[Shafer, 1985] Glenn Shafer. Conditional probability. International Statistical Review, 53:261– 275, 1985. [Skyrms, 1987] Brian Skyrms. On the principle of total evidence with and without observation sentences. In Logic, Philosophy of Science and Epistemology: Proceedings of the 11th International Wittgenstein Symposium, pages 187–195. H¨ older–Pichler–Tempsky, 1987. [Skyrms, 1990] Brian Skyrms, editor. The Dynamics of Rational Deliberation. Harvard University Press, Cambridge, MA, 1990. [Skyrms, 1991] Brian Skyrms. Carnapian inductive logic for Markov chains. Erkenntnis, 35:439– 460, 1991. [Skyrms, 1993] Brian Skyrms. Analogy by similarity in hypercarnapian inductive logic. In G. J. Massey J. Earman, A. I. Janis and N. Rescher, editors, Philosophical Problems of the Internal and External Worlds: Essays Concerning the Philosophy of Adolf Gr¨ unbaum, pages 273–282. Pittsburgh University Press, Pittsburgh, 1993. [Skyrms, 1996] Brian Skyrms. Inductive logic and Bayesian statistics. In Statistics, Probability, and Game Theory: Papers in Honor of David Blackwell, volume 30 of IMS Lecture Notes — Monograph Series, pages 321–336. Institute of Mathematical Statistics, 1996. [Skryms, 2006] Brian Skyrms. Diachronic coherence and radical probabilism. Philosophy of Science, 73:959-968, 2006. [Sprenger, 2009] Jan Sprenger. Hempel and the paradoxes of confirmation. In Dov Gabbay, Stephan Hartmann, and John Woods, editors, Handbook of the History of Logic, volume 10. Elsevier, London, 2009. [Stalker, 1994] Douglas Stalker. Grue! The New Riddle of Induction. Open Court, Chicago, 1994. [Stigler, 1982] Stephen M. Stigler. Thomas Bayes’s Bayesian inference. Journal of the Royal Statistical Society Series A, 145:250–258, 1982. [Venn, 1866] John Venn. The Logic of Chance. Macmillan, London, 1866. (2nd ed. 1876, 3rd ed. 1888; reprinted 1962, Chelsea, New York). [Vranas, 2004a] P. B. M. Vranas. Hempel’s raven paradox: A lacuna in the standard Bayesian solution. The British Journal for the Philosophy of Science, 55:545–560, 2004. [Vranas, 2004b] P. B. M. Vranas. Have your cake and eat it too: the old principal principle reconciled with the new. Philosophy and Phenomenological Research, 69:368-382, 2004. [Waismann, 1930] Friedrich Waismann. Logische Analyse des Wahrscheinlichkeitsbegriffs. Erkenntis, 1:228–248, 1930. [Wrinch and Jeffreys, 1919] Dorothy Wrinch and Harold Jeffreys. On certain aspects of the theory of probability. Philosophical Magazine, 38:715–731, 1919. [Wrinch and Jeffreys, 1921] Dorothy Wrinch and Harold Jeffreys. On certain fundamental principles of scientific enquiry. Philosophical Magazine, Series 6, 42:369–390, 1921. [Zabell, 1982] S. L. Zabell. W. E. Johnson’s “sufficientness postulate”. Annals of Statistics, 10:1091–1099, 1982. [Zabell, 1988] S. L. Zabell. Symmetry and its discontents. In W. L. Harper and B. Skyrms, editors, Causation, Chance, and Credence, volume 1, pages 155–190. Kluwer, Dordrecht, 1988. [Zabell, 1989] S. L. Zabell. The rule of succession. Erkenntnis, 31:283–321, 1989. [Zabell, 1992] S. L. Zabell. Predicting the unpredictable. Synthese, 90:205–232, 1992. [Zabell, 1995] S. L. Zabell. Characterizing Markov exchangeable sequences. Journal of Theoretical Probability, 8:175–178, 1995. [Zabell, 1996] S. L. Zabell. Confirming universal generalizations. Erkenntnis, 45:267–283, 1996. [Zabell, 1997] S. L. Zabell. The continuum of inductive methods revisited. In John Earman and John D. Norton, editors, The Cosmos of Science: Essays of Exploration, Pittsburg-Konstanz Series in the Philosophy and History of Science, pages 351–385. University of Pittsburgh Press/Universit¨ atsverlag Konstanz, 1997. [Zabell, 2007] S. L. Zabell. Carnap on probability and induction. In Michael Friedman and Richard Creath, editors, The Cambridge Companion to Carnap, pages 273–294. Cambridge University Press, 2007.
THE DEVELOPMENT OF THE HINTIKKA PROGRAM Ilkka Niiniluoto One of the highlights of the Second International Congress for Logic, Methodology, and Philosophy of Science, held in Jerusalem in 1964, was Jaakko Hintikka’s lecture “Towards a Theory of Inductive Generalization” (see [Hintikka, 1965a]). Two years later Hintikka published a two-dimensional continuum of inductive probability measures (see Hintikka, 1966), and ten years later he announced an axiomatic system with K ≥2 parameters (see [Hintikka and Niiniluoto, 1976]). These new original results showed once and for all the possibility of systems of inductive logic where genuine universal generalizations have non-zero probabilities in an infinite universe. Hintikka not only disproved Karl Popper’s thesis that inductive logic is inconsistent (see [Popper, 1959; 1963]), but he also gave a decisive improvement of the attempts of Rudolf Carnap to develop inductive logic as the theory of partial logical implication (see [Carnap, 1945; 1950; 1952]). Hintikka’s measures have later found rich applications in semantic information theory, theories of confirmation and acceptance, cognitive decision theory, analogical inference, theory of truthlikeness, and machine learning. The extensions and applications have reconfirmed — pace the early evaluation of Imre Lakatos [1974] — the progressive nature of this research program in formal methodology and philosophy of science.
1
INDUCTIVE LOGIC AS A METHODOLOGICAL RESEARCH PROGRAM
Imre Lakatos [1968a] proposed that Carnap’s inductive logic should be viewed as a methodological “research programme”. Such programs, both in science and methodology, are characterized by a “hard core” of basic assumptions and a “positive heuristics” for constructing a refutable “protective belt” around the irrefutable core. Their progress depends on the problems that they originally set out to solve and later “problem shifts” in their dynamic development. In a paper written in 1969, Lakatos claimed that Popper has achieved “a complete victory” in his attack against “the programme of an a priori probabilistic inductive logic or confirmation theory” - although, he added, “inductive logic, displaying all the characteristics of a degenerating research programme, is still a booming industry” [Lakatos, 1974, p. 259].1 A similar position is still advocated by David Miller [1994], one of the leading Popperian critical rationalists. 1 If the emphasis is on the term “a priori”, Hintikka agrees with Lakatos. However, we shall see that Hintikka’s “victory” over Carnap is completely different from Popper’s.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
312
Ilkka Niiniluoto
In this paper, I argue for a different view about inductive logic (see also [Niiniluoto, 1973; 1983]). Hintikka’s account of inductive generalization was a meeting point of several research traditions, and as a progressive turn it opened new important paths in Bayesian methodology, epistemology, and philosophy of science. Its potential in artificial intelligence is still largely unexplored. The roots of inductive logic go back to the birth of probability calculus in the middle of the seventeenth century. Mathematical probabilities were interpreted as relative frequencies of repeatable events, as objective degrees of possibility, and as degrees of certainty [Hacking, 1975]. For the classical Bayesians, like the determinist P. S. Laplace in the late eighteenth century, probability was relative to our ignorance of the true causes of events. The idea of probabilities as rational degrees of belief was defended in the 19th century by Stanley Jevons. This Bayesian interpretation was reborn in the early 20th century in two different forms. The Cambridge school, represented by W. E. Johnson, J. M. Keynes, C. D. Broad, and Harold Jeffreys, treated inductive probability P (h/e) as a logical relation between two sentences, a hypothesis h and evidence e. Related logical views, anticipated already by Bernard Bolzano in the 1830s, were expressed by Ludwig Wittgenstein and Friedrich Waismann. The school of subjective or personal probability interpreted degrees of belief as coherent betting ratios (Frank Ramsey, Bruno de Finetti) (see [Skyrms, 1986]). In Finland, pioneering work on “probability logic” in the spirit of logical empiricism was published by Eino Kaila in the 1920s (see [Kaila, 1926]). Kaila’s student Georg Henrik von Wright wrote his doctoral dissertation on “the logical problem of induction” in 1941 (see [von Wright, 1951; 1957]). Von Wright, with influences from Keynes and Broad, tried to solve the problem of inductive generalization by defining the probability of a universal statement in terms of relative frequencies of properties (see [Hilpinen, 1989; Festa, 2003; Niiniluoto, 2005c]). Von Wright was the most important teacher of Jaakko Hintikka who, in turn with his students, continued the Finnish school of induction. Rudolf Carnap came relatively late to the debates about probability and induction. Karl Popper had rejected induction in Logik der Forschung in 1934 (see [Popper, 1959]), and he was also sharply critical of the frequentist probability logic of Hans Reichenbach. Kaila had sympathies with Reichenbach’s empiricist approach. In a letter to Kaila on January 28, 1929, Carnap explained that he would rather seek a positivist solution where “probability inferences are equally analytic (tautologous) as other (syllogistic) inferences” (see [Niiniluoto, 1985/1986]). Carnap — influenced by the ideas of Keynes, Jeffreys, and Waismann on objective inductive probabilities, and to the disappointment of Popper — started to develop his views about probability in 1942-44 (see [Carnap, 1945]). Carnap’s Logical Foundations of Probability (LFP, 1950) gave a detailed and precise account of the inductive probability measure c∗ , which is a generalization of Laplace’s famous “rule of succession”. In 1952 Carnap published A Continuum of Inductive Methods. Its class of measures, defined relative to one real-valued parameter λ, contained c∗ as a special case only. Another special case c+ , proposed
The Development of the Hintikka Program
313
by Bolzano and Wittgenstein, was rejected by Carnap since it does not make learning from experience possible. The same point had been expressed in the 19th century by George Boole and Charles Peirce in their criticism of Bayesian probabilities. With John Kemeny, Carnap further showed how the λ-continuum can be justified on an axiomatic basis (see [Kemeny, 1963]). Part of Carnap’s counterattack to Popper’s praise of improbability was based on the new exact theory of semantic information that he developed with Yehoshua Bar-Hillel (see [Carnap and Bar-Hillel, 1952]). Ian Hacking [1971] has argued that the typical assumptions of Carnap’s inductive logic can be found already in the works of G. W. F. Leibniz in the late 17th century: (L1)
There is such a thing as non-deductive evidence.
(L2)
‘Being a good reason for’ is a relation between propositions.
(L3)
There is an objective and formal measure of the degree to which one proposition is evidence for another.
These assumptions can be found also in Keynes’ A Treatise on Probability (1921). Carnap formulated them in the Preface to LFP as follows: (C1)
All inductive reasoning is reasoning in terms of probability.
(C2)
Inductive logic is the same as probability logic.
(C3)
The concept of inductive probability or degree of confirmation is a logical relation between two statements or propositions, a hypothesis and evidence.
(C4)
The frequency concept of probability is used in statistical investigations, but it is not suitable for inductive logic.
(C5)
All principles and theorems of inductive logic are analytic.
(C6)
The validity of induction is not dependent upon any synthetic presuppositions.
The treatment of probabilities of the form P (h/e), where h is a hypothesis and e is evidence, connects Carnap to the Bayesian tradition. Against the subjectivist school, Carnap’s intention was to eliminate all “psychologism” from inductive logic — just as Gottlob Frege had done in the case of deductive logic. Carnap’s C4 accepts a probabilistic dualism with both physical and epistemic probabilities. By C5 and C6, probability as partial entailment is independent on all factual assumptions. In practical applications of inductive logic, degrees of confirmation P (h/e) have to be calculated relative to the total evidence e available to scientists. Carnap’s commitment to probabilistic induction (C1 and C2) leaves open the question whether the basic notion of induction is support (e confirms h) or acceptance (h is rationally acceptable on e). According to Carnap, David Hume was
314
Ilkka Niiniluoto
right in denying the validity of inductive inferences, so that the proper task of inductive logic is to evaluate probabilities of the form P (h/e). Such probabilities can then be used in practical decision making by applying the rule of Maximizing Expected Utility (cf. [Stegm¨ uller, 1973]). In normative decision theory, there are “quasi-psychological” counterparts to “purely logical” inductive probabilities (see [Carnap, 1971]). Carnap was followed by Richard Jeffrey in the denial of inductive acceptance rules (cf. the debate in [Lakatos, 1968b]). In the second 1962 edition of LFP, Carnap defended himself against the bitter attacks of Popper by distinguishing two senses of “degree of confirmation”: posterior probability P (h/e) and increase of probability P (h/e) − P (h). This was a clarification of the core assumption C3. In LFP, Carnap demanded that inductive logic should give an account of the following types of cases: (a)
Direct inference: from a population to a sample
(b)
Predictive inference: from a sample to another sample
(c)
Inference by analogy: from one individual to another by their known similarity
(d)
Inverse inference: from a sample to a population
(e)
Universal inference: from a sample to a universal hypothesis.
He showed how the measure c∗ helps to solve these problems. But Carnap did not wish to claim that c∗ is “perfectly adequate” or the “only adequate” explicatum of inductive probability (LFP, p. 563). So his method can be characterized by the following heuristic principle: (C7)
Use logic to distinguish alternative states of affairs that can be expressed in a given formal language L. Then define inductive probabilities for sentences of L by taking advantage of symmetry assumptions concerning such states of affairs.
The systematic applications of C7 distinguish the Carnapian program of inductive logic from the more general Bayesian school which admits all kinds of prior probability measures. As we shall see in the next section, Hintikka’s work on inductive logic relies on the heuristic principle C7 in a novel way, so that the problem (e) of universal inference gets a new solution. Hintikka’s system satisfies the core assumptions C1, C2, and C4. But, in his reply to Mondadori [1987], Hintikka himself urges that his studies did not just amount to “tinkering with Carnap’s inductive logic” or removing some “anomalies” from it, but rather “it means to all practical purposes a refutation of Carnap’s philosophical program in developing his inductive logic” [Hintikka, 1987b]. What Hintikka has in mind is the “logicism” involved in the Carnapian core assumptions C3, C5, and C6. Hintikka’s own move is to replace C3 and C6 with more liberal formulations:
The Development of the Hintikka Program
315
(C3 ) Inductive probability P (h/e) depends on the logical form of hypothesis h and evidence e. (C6 ) Inductive probabilities, and hence inductive probabilistic inferences, may depend on extra-logical factors. Here C6 allows that inductive inferences may have contextual or “local” presuppositions (cf. [Bogdan, 1976]). Inductive probability is thus not a purely syntactical or semantical notion, but its explication involves pragmatic factors. However, in the spirit of what Hintikka calls “logical pragmatics”, C3 and C6 should be combined with C7 so that the dependence and interplay of logical and extra-logical factors is expressed in an explicit and precise way. Then it turns out that C5 is ambiguous: some principles of induction may depend on pragmatic boundary conditions (like the extra-logical parameters), while some mathematical theorems of inductive logic turn out to be analytically true.
2
FROM CARNAP TO HINTIKKA’S TWO-DIMENSIONAL CONTINUUM
In inductive logic, probabilities are at least partly determined by symmetry assumptions concerning the underlying language [Carnap, 1962; Hintikka and Suppes, 1966; Niiniluoto and Tuomela, 1973]. In Carnap’s λ-continuum the probabilities depend also on a free parameter λ which indicates the weight given to logical or language-dependent factors over and above purely empirical factors (observed frequencies) (see [Carnap, 1952]). Carnap’s λ serves thus as an index of caution in singular inductive inference. In Hintikka’s 1966 system one further parameter α is added to regulate the speed in which positive instances increase the probability of a generalization. More precisely, let Q1 , ..., QK be a K-fold classification system with mutually exclusive predicates, so that every individual in the universe U has to satisfy one and only one Q-predicate. A typical way of creating such a classification system is to assume that a finite monadic first-order language L contains k basic predicates M1 , ..., Mk , and each Q-predicate is defined by a k-fold conjunction of positive or negative occurrences of the M -predicates: (±)M1 x&...&(±)Mk x. Then K = 2k . Each predicate expressible in language L is definable as a finite disjunction of Q-predicates. Carnap generalized this approach to the case where the dichotomies {Mj , ∼ Mj } are replaced by families of mutually exclusive predicates Mj = {Mj1 , ..., Mjm }, and a Q-predicate is defined by choosing one element from each family Mj (see [Jeffrey, 1980]). For example, one family could be defined by colour predicates, another by a quantity taking discrete values (e.g., age). Assume that language L contains N individual names a1 , ..., aN . Let L be interpreted in universe U with size N , so that each object in U has a unique name in L. A state description relative to individuals a1 , ..., aN tells for each ai which Q-predicate it satisfies in universe U . A structure description tells how many individuals in U satisfy each Q-predicate. Every sentence within this first-order
316
Ilkka Niiniluoto
monadic framework L can be expressed as a disjunction of state descriptions; in particular, a structure description is a disjunction of state descriptions that can be obtained from each other just by permuting individual constants. The state descriptions in L that entail sentence h constitute the range R(h) of h. Regular probability measures m for L define a non-zero probability m(s) for each state description s of L. For each sentence h in L, m(h) is the sum of all measures m(s), s ∈ R(h). A regular confirmation function c is then defined as conditional probability: (1) c(h/e) =
m(h&e) . m(e)
Let now en describe a sample of n individuals in terms of the Q-predicates, and let ni ≥ 0 be the observed number of individuals in cell Qi (so that n1 + ... + nK = n). Carnap’s λ-continuum takes the posterior probability c(Qi (an+1 )/en ) that the next individual an+1 will be of kind Qi to be (2)
ni + λ/K . n+λ
This value is known as the representative function of an inductive probability measure. The probability (2) is a weighted average of ni /n (observed relative frequency of individuals in Qi ) and 1/K (the relative width of predicate Qi ). The choice λ = ∞ gives Reichenbach’s Straight Rule, which allows only the empirical factor ni /n to determine posterior probability. The choice λ = 4 would give the range measure proposed in Wittgenstein’s Tractatus, which divides probability evenly to state descriptions, but it makes the inductive probability (2) equal to 1/K which is a priori independent of the evidence e and, hence, does not allow for the learning from experience. When λ < ∞, predictive probability is asymptotically determined by the empirical factor: (3) [c(Qi (an+1 )/en ) − ni /n] → 0, when n → ∞. Principle (3) is known as Reichenbach’s Axiom [Carnap and Jeffrey, 1971; Kuipers, 1978b]. It is known that (3) implies the principle of Positive Instantial Relevance: (4) c(Qi (an+2 )/en &Qi (an+1 )) > c(Qi (an+1 )/en ). The choice λ = K in (2) gives Carnap’s measure c∗ , which allocates probability evenly to all structure descriptions. The formula (5) c∗ (Qi (an+1 /en ) =
ni + 1 n+K
includes as a special case (ni = n, K = 2) Laplace’s Rule of Succession (6)
n+1 . n+2
The Development of the Hintikka Program
317
Laplace derived this probability of the next favourable instance after n positive ones by assuming that all structural compositions of an urn with white and black balls are equally probable. If the Q-predicates are defined so that they have different relative widths qi , such that q1 + ... + qK = 1, then (2) is replaced by (2 )
ni + qi λ . n+λ
[Carnap and Stegm¨ uller, 1959; Carnap, 1980]. (2) is obtained from (2 ) by choosing qi = 1/K for all i = 1, ..., K.2 If the universe U is potentially infinite, so that its size N may grow without limit, the probability c(h/e) is defined as the limit of the values (1) in a universe of size N (when N → ∞). Then it turns out that all measures of Carnap’s λcontinuum assign the probability zero to universal generalizations h on singular evidence e. Carnap admitted that such a result “may seem astonishing at first sight”, since in science it has been traditional to speak of “well-confirmed laws” (see [Carnap, 1945]). But he immediately concluded that “the role of universal sentences in the inductive procedures of science has generally been overestimated”, and proposed to measure the instance confirmation of a law h by the probability that a new individual not mentioned in evidence e fulfills the law h. Carnap’s attempted reduction of universal inference to predictive singular inference did not convince all his colleagues. In Lakatosian terms, this move was a regressive problem-shift in the Carnapian program. One of those who criticized Carnap’s proposal was G. H. von Wright [1951a]. Von Wright knew, on the basis of Keynes, that universal generalizations h can be confirmed by positive singular evidence en = i1 &...&in entailed by h if two conditions are satisfied: (i) the prior probability P (h) is not minimal, and (ii) new confirmations of h are not maximally probable relative to previous confirmations [von Wright, 1951b]. The Principal Theorem of Confirmation thus states the following: (7) If P (h) > 0 and P (in+1 /en ) < 1, then P (h/en &in+1 ) > P (h/en ). As Carnap’s system does not satisfy this theorem, it is “no Confirmation-Theory at all” (see [von Wright, 1957, pp. 119, 215]). It also fails to solve the dispute of Keynes and Nicod about the conditions for the convergence of posterior probability to its maximum value one with increasing positive evidence: (8) P (h/en ) → 1, when n → ∞. It is remarkable that Popper, the chief opponent of inductive logic, also argued for the zero logical probability of universal laws, i.e., the same result that shattered Carnap’s system (see [Popper, 1959, appendices vii and viii]). Lakatos [1968a] 2 Carnap’s probabilities (2) and (2 ) are known to statisticians as symmetric and nonsymmetric Dirichlet distributions (see [Festa, 1993]). Skyrms [1993a] has observed that statisticians have extended such distributions to a “value continuum” (i.e., the discrete set of Qpredicates is replaced by a subclass of a continuous space).
318
Ilkka Niiniluoto
called the assumption that P (h) > 0 for genuinely universal statements h “the Jeffreys-Keynes postulate”, and Carnap’s thesis about the dispensability of laws in inductive logic “the weak atheoretical thesis”. If indeed P (h) = 0 for laws h, then P (h/e) = 0 for any evidence e; and equally well all other measures of confirmation (like Carnap’s difference measure) or corroboration (cf. [Popper, 1959]) are trivialized (see [Niiniluoto and Tuomela, 1973, pp. 212–216, 242–243]). Carnap’s notion of instance confirmation restricts the applications of inductive logic to singular sentences. A related proposal is to accept the Carnapian framework for universal generalization in finite universes. This move has been defended by Mary Hesse [1974]. However, the applications of inductive logic would then depend on synthetic assumptions of the size of the universe — against the principle C6. Moreover, the Carnapian probabilities of finite generalizations behave qualitatively in a wrong way: the strongest confirmation is given to those universal statements that allow many Q-predicates, even when evidence seems to concentrate only on a few Q-predicates (see [Hintikka, 1965a; 1975]). In his “Replies” in the Schilpp volume (see [Schilpp, 1963, p. 977]), Carnap told that he has constructed confirmation functions which do not give zero probabilities to universal generalizations, but “they are considerably more complicated than those of the λ-system”. He did not published any of these results. In this problem situation, Hintikka’s presentation of his “Jerusalem system” in the 1964 Congress was a striking novelty. Hintikka solves the problem of universal generalization by dividing probability to constituents. He had learned this logical tool during the lectures of von Wright in Helsinki in 1947–1948. Von Wright characterized logical truth by means of “distributive normal forms”: a tautology of monadic predicate logic allows all constituents, which are mutually exclusive descriptions of the constitution of the universe. Hintikka’s early insight in 1948, at the age of 21, was the way of extending such distributive normal forms to the entire first-order logic with relations (see [Bogdan, 1987; Hintikka, 2006, p. 9]). This idea resulted in 1953 in a doctoral dissertation on distributive normal forms. Hintikka was thus well equipped to improve Carnap’s system of induction. Let L again be a monadic language with Q-predicates Q1 , ..., QK . A constituent C w tells which Q-predicates are non-empty and which empty in universe U . The logical form of a constituent is thus (9) (±)(∃x)Q1 (x)&...&(±)(∃x)QK (x). If Qi , i ∈ CT , are precisely those Q-predicates claimed to be non-empty by (9), then (9) can be rewritten in the form (∃x)Qi (x)&(x)[ Qi (x)]. (10) i∈CT
i∈CT
The cardinality of CT is called the width of constituent (10). Often a constituent with width w is denoted by C w . Then C K is the maximally wide constituent which claims that all Q-predicates (i.e., all kinds of individuals which can be described
The Development of the Hintikka Program
319
by the resources of language L) are exemplified in the universe. Note that if C K is true in universe U , then there are no true universal generalizations in L. Such a universe U is atomistic with respect to L, and C K is often referred to as the atomistic constituent of L. The number of different constituents of L is 2K . Among them we have the empty constituent of width zero; it corresponds to a contradiction. Other constituents are maximally consistent and complete theories in L: each of them specifies a “possible word” by means of primitive monadic predicates, sentential connectives and quantifiers. Thus, constituents are mutually exclusive, and the disjunction of all constituents is a tautology. Note that in a language with finitely many individual constants each constituent can be expressed by a disjunction of state descriptions or by a disjunction of structure descriptions. Each consistent generalization h in L (i.e., a quantificational statement without individual constants) can be expressed as a finite disjunction of constituents: Ci (11) h = i∈Ih
(11) is the distributive normal form of h. Constituents are strong generalization in L, and other generalizations in L are weak. By (11), the probability of generalizations reduces to the probabilities of constituents. As above, let evidence e be a descriptions of a finite sample of n individuals, and let c be the number of different kinds of individuals observed in e. Sometimes we denote this evidence by ecn . Then a constituent C w of width w is compatible with ecn only if c ≤ w ≤ K. By Bayes’s formula, (12) P (C w /e) =
P (C w )P (e/C w ) . K−c K−c P (C c+i )P (e/C c+i ) i i=0
Hence, to determine the posterior probability P (C w /e), we have to specify the prior probabilities P (C w ) and the likelihoods P (e/C w ). In his first papers, Hintikka followed the heuristic principle C7. His Jerusalem system is obtained by first dividing probability evenly to all constituents and then dividing the probability-mass of each constituent evenly to all state descriptions belonging to it [Hintikka, 1965a]. His combined system is obtained by first dividing probability evenly to all constituents, then evenly to all structure descriptions satisfying a constituent, and finally evenly to state descriptions belonging to a structure description [Hintikka, 1965b]. In both cases, the prior probabilities P (C w ) of all constituents are equal to 1/2K . It turns out that there is one and only one constituent C c which has asymptotically the probability one when the size n of the sample e grows without limit. This is the “minimal” constituent C c which states that the universe U instantiates precisely those c Q-predicates which are exemplified in the sample e: (13) P (C c /ecn ) → 1, if n → ∞ and c is fixed P (C w /ecn ) → 0, if n → ∞, c is fixed, and w > c.
320
Ilkka Niiniluoto
It follows from (13) that a constituent which claims some uninstantiated Q-predicates to be exemplified in U will asymptotically receive the probability zero. A weak generalization h in L will receive asymptotically the probability one if and only if its normal form (11) includes the constituent C c : (14) Assuming that n → ∞ and c is fixed, P (h/ecn ) → 1 iff C c h. The Keynes-Nicod debate thus receives an answer by Hintikka’s probability assignment. In his two-dimensional continuum of inductive methods, Hintikka [1966] was able to formulate a system which contains as special cases his earlier measures as well as the whole of Carnap’s λ-continuum. Hintikka proposes that likelihoods relative to C w are calculated in the same way as in Carnap’s λ-continuum (cf. (2)), but by restricting the universe to the w Q-predicates that are instantiated by C w . Thus, if e is compatible with C w , we have (15) P (Qi (an+1 )/e&C w ) =
ni + λ/w . n+λ
By (15), we can calculate that (16) P (e/C w ) =
c Γ(λ) Γ(nj + λ/w) , Γ(n + λ) j=1 Γ(λ/w)
where Γ is the Gamma-function. Note that Γ(n + 1) = n!. For prior probabilities Hintikka proposes that P (C w ) should be chosen as proportional to the Carnapian probability that a set of α individuals is compatible with C w . This leads to the assignment w
(17) P (C ) =
Γ(α+wλ/K) Γ(wλ/K) K i=0
.
Γ(α+iλ/K) (K i ) Γ(iλ/K)
The posterior probability P(Cw /e) can then be calculated by (12), (16), and (17). If α = 0, then (17) gives equal priors to all constituents: (18) P (C w ) = 1/2K for all C w . The Jerusalem system is then obtained by letting λ → ∞. Small value of α is thus an indication of the strength of a priori considerations in inductive generalization — just as small λ indicates strong weight to a priori considerations in singular inference. But if 0 < α < ∞, then we have
(19) P (C w ) < P (C w ) iff w < w .
The Development of the Hintikka Program
321
Given evidence ecn which has realized c Q-predicates, the minimal constituent C c compatible with ecn claims that the universe is similar to the sample ecn . This constituent is the simplest of non-refuted constituents in the sense of ontological parsimony (cf. [Niiniluoto, 1994]). By (19) it is initially the least probable, but by (13) it is the only constituent that receives asymptotically the probability one with increasing but similar evidence. If λ is chosen to be a function of w, so that λ(w) = w, then Hintikka’s generalized combined system is obtained; the original combined system of [Hintikka, 1965b] is a special case with α = 0. The formulas of the two-dimensional system are reduced to simpler equations: ni + 1 . (15 ) P (Qi (an+1 )/e&C w ) = n+w c (w − 1)!
(nj !) (n + w − 1)! j=1
(16 )
P (e/C w ) =
(17 )
(α + w − 1)! (w − 1)! P (C w ) = K . K (α+i−1)! (i ) (i−1)! i=0
Hence, by (12), (20) P (C w /ecn ) =
(α+w−1)! (n+w−1)!
. K−c K−c (α+c+i−1)! i (n+c+i−1)! i=0
In particular, when n = α, (20) reduces to K−c
1
=
K−c
i
1 2K−c
.
i=0
If n and α are sufficiently large in relation to K, then using the approximation (m + n)! m!mn , where m is sufficiently large in relation to n2 (see [Carnap, 1950, p. 150]), we get from (20) an approximate form of P (C w /e): (21) P (C w /ecn )
(α/n)w−c . (1 + α/n)K−c
(See [Niiniluoto, 1987, p. 88].) Formula (21) shows clearly the asymptotic behaviour (13) of the posterior probabilities when n increases without limit. The representative function of the generalized combined system is K−c K−c (α+c+i−1)! i (n+c+i)! i=0 . (22) P (Qi (an+1 /ecn ) = (ni + 1) K−c K−c (α+c+i−1)! i (n+c+i−1)! i=0
322
Ilkka Niiniluoto
If h is a universal generalization in L which claims that certain b Q-predicates are empty, and if h is compatible with e, then K−b−c K−b−c (α+c+i−1)! i (n+c+i−1)! . (23) P (h/ecn ) = i=0 K−c K−c (α+c+i−1)! i (n+c+i−1)! i=0
Approximately, for sufficiently large α and n, (23) gives (24) P (h/e)
1 . (1 + α/n)b
In agreement with (14), the value of (24) approaches one when n increases without limit. On the other hand, if α → ∞, we can see by (20) that P (C w /e) → 1 if and only if w = K. In fact, the same result holds for the prior probabilities of constituents: (25) If α → ∞, then P (C K ) → 1 and P (C w ) → 0 for w < K. More generally, we have the result that the probabilities of Hintikka’s λ − αcontinuum approach the probabilities of Carnap’s λ-continuum, when α → ∞. The result (25) explains why the probabilities of all universal generalizations are zero for all of Carnap’s measures: his probabilities of universal generalizations are fixed purely a priori in the sceptical fashion that the prior probability of the atomistic constituent C K is one. Carnap’s λ-continuum is thus the only special case (α = ∞) of Hintikka’s two-dimensional continuum where the asymptotic behaviour (13) of posterior probabilities does not hold. 3
AXIOMATIC INDUCTIVE LOGIC
The aim of axiomatic inductive logic is to find general rationality principles which narrow down the class of acceptable probability measures. The first axiomatic treatment of this kind was presented by W. E. Johnson [1932] (cf. [Pietarinen, 1972]). His main results were independently, and without reference to him, rediscovered by Kemeny and Carnap in 1952-54 (see [Schilpp, 1963]). Let P be a real-valued function P defined for pairs of sentences (h, e), where e is consistent, of a finite monadic language L. Assume that P satisfies the following: (A1) Probability axioms (A2) Finite regularity: For singular sentences h and e, P (h/e) = 1 only if e ⊃ h. (A3) Symmetry with respect to individuals: The value of P (h/e) is invariant with respect to any permutation of individual constants. (A4) Symmetry with respect to predicates: The value of P (h/e) is invariant with respect to any permutation of the Q-predicates.
The Development of the Hintikka Program
323
(A5) λ-principle: There is a function f such that P (Qi (an+1 /e) = f (ni , n). For the advocates of personal probability, A1 is the only general constraint of rational degrees of belief. It guarantees that probabilities serve as coherent betting ratios. A2 excludes that some contingent singular sentence has the prior probability one. A3 is equivalent to De Finetti’s condition of exchangeability (cf. [Carnap and Jeffrey, 1971; Hintikka, 1971]). It entails that the probability P (Qi (an+1 /e) depends upon evidence e only through the numbers n1 , . . . , nK , so that it is independent on the order of observing the individuals in e. A4 states that the Qpredicates are symmetrical: P (Qi )(aj ) = 1/K for all i = 1, . . . , K. A5 is Johnson’s [1932] “sufficientness postulate”, or Carnap’s “axiom of predictive irrelevance”. It states that the representative function P (Qi (an+1 /e) is independent of the numbers nj , j = i, of observed individuals in other cells than Qi (as long as the sum n1 + ... + nK = n). The Kemeny–Carnap theorem states that axioms A1-A5 characterize precisely Carnap’s λ-continuum with λ > 0: if A1–A5 hold for P , then f (ni , n) = where λ=
ni + λ/K , n+λ
Kf (0, 1) . 1 − Kf (0, 1)
If K = 2, the proof requires the additional assumption that f is a linear function of ni . The case λ = 0 is excluded by A2. By dropping A4, the function f (ni , n) will have the form (2 ). Hence, we see that a regular and exchangeable inductive probability measure is Carnapian if and only if it satisfies the sufficiency postulate A5. In particular, the traditional Bayesian approach of Laplace with probability c∗ satisfies A5. Axiom A5 is very strong, since it excludes that predictive singular probabilities P (Qi (an+1 /ecn ) about the next instance depend upon the variety of evidence ecn , i.e., upon the number c of cells Qi such that ni > 0. As the number of universal generalizations in L which evidence e falsifies is also a simple function of c, axiom A5 makes induction purely enumerative and excludes the eliminative aspects of induction (see [Hintikka, 1968b]). We have already seen that the representative function (22) of Hintikka’s generalized combined system depends on c. The inability of Carnap’s λ-continuum to deal with inductive generalization is thus an unhappy consequence of the background assumption A5. The Carnap-Kemeny axiomatization of Carnap’s λ-continuum was generalized by Hintikka and Niiniluoto in 1974, who allowed that the inductive probability (2) of the next case being of type Qi depends on the observed relative frequency ni of kind Qi and on the number c of different kinds of individuals in the sample e (see [Hintikka and Niiniluoto, 1976]):
324
Ilkka Niiniluoto
A6
c-principle: There is a function f such that P (Qi (an+1 /ecn ) = f (ni , n, c).
The number c expresses the variety of evidence e, and it also indicates how many universal generalizations e has already falsified. Hintikka and Niiniluoto proved that measures satisfying axioms A1–A4 and A6 constitute a K-dimensional system determined by K-parameters λ=
Kf (1,K+1,K) 1−Kf (1,K+1,K)
−1
γc = f (0, c, c), for c = 1, ..., K − 1. Here λ > −K and (26) 0 < γc
≤
λ/K . c+λ
This K-dimensional system is called the NH-system by Kuipers [1978b]. (See also [Niiniluoto, 1977].) The upper bound of (26) is equal to the value of probability f (0, c, c) in Carnap’s λ-continuum; let us denote it by δc . It turns out that, for infinite universes, the probability of the atomistic constituent C K is P (C K ) =
γ1 ...γ)K − 1 . δ1 ...δK−1
Hence, P (C K ) = 1 iff γi = δi for all i = 1, ..., K − 1. In other words, Carnap’s λ-continuum is the only special case of the K-dimensional system which does not attribute non-zero probabilities to some universal generalizations. Again, Carnap’s systems turns out to be biased in the sense that it assigns a priori the probability one to the atomistic constituent C K that claims all Q-predicates to be instantiated in universe U . The reduction of all inductive probabilities to K parameters, which concern probabilities of very simple singular predictions, gives a counter-argument to Wolfgang Stegm¨ uller’s [1973] claim that it does not “make sense” to bet on universal generalizations (cf. [Hintikka, 1971]). In the K-dimensional system, a bet on a universal law is equivalent to a system of K bets on singular sentences on finite evidence. The parameter γc = f (0, c, c) expresses the predictive probability of finding a new kind of individual after c different successes. For such evidence e, the posterior probability of C c approaches one when γc approaches zero. Further, P (C c ) decreases when γc increases. Parameter γw thereby serves as an index of caution for constituents of width w. While Hintikka’s two-dimensional system has one index α of overall pessimism about the truth of constituents C w , w < K, in the K-dimensional system there is a separate index of pessimism for each width w < K. The K-dimensional system allows more flexible distributions of prior probabilities of constituents than Hintikka’s α − λ-continuum. For example, principle
The Development of the Hintikka Program
325
(19) may be violated. One may divide prior probability equally first to sentences S w (w = 0, ..., K) which state that there are w kinds individuals in the universe. K
Such “constituent-structures” S w are disjunctions of the (w) constituents C w of width w. This proposal was made by Carnap in his comment on Hintikka’s system (see [Carnap, 1968]; cf. [Kuipers, 1978a]). Assuming that the parameters γc do not have their Carnapian values, one can show (27) P (Qi (an+1 )/e&C w ) =
ni + λ/K . n + wλ/K
Comparison with formula (15) shows that, in addition to the λ-continuum, the intersection of the K-dimensional system and Hintikka’s α − λ-system contains those members of the latter which satisfy the condition that λ as a function of w equals αw for some constant a > 0.3 The case with a = 1 is Hintikka’s generalized combined system (cf. (15 )). This new way of motivating this system shows its naturalness. The relations of different inductive systems are studied in detail by Theo Kuipers [1978b].4 It follows from (27) that the K-dimensional system satisfies Reichenbach’s Axiom (3) and Instantial Positive Relevance (4). The fundamental adequacy condition (13) of inductive generalization is satisfied whenever the parameters γi are chosen more optimistically than their Carnapian values: (28) If γi < δi , for i = c, ..., K−1, then P (C c /e) → 1 when n → ∞ and c is fixed. This result shows again that the much discussed result of Carnap’s λ-continuum, viz. the zero confirmation of universal laws, is really an accidental feature of a system of inductive logic. We get rid of this feature by weakening the λ-principle A5 to the c-principle A6. 4
EXTENSIONS OF HINTIKKA’S SYSTEM
Hintikka’s two-dimensional continuum was published in Aspects of Inductive Logic [1966], edited by Hintikka and Patrick Suppes. This volume is based on an International Symposium on Confirmation and Induction, held in Helsinki in late September 1965 as a continuation of an earlier seminar at Stanford University in the spring of the same year. Hintikka had at that time a joint appointment at Helsinki and Stanford. Besides essays on the paradoxes of confirmation (Max Black, Suppes, von Wright), the volume includes several essays on induction by 3 This class of inductive methods is mentioned by Hilpinen [1968, p. 65]. Kuipers [1978b] calls them SH-systems. 4 Zabell [1997] has developed Johnson’s axiomatization so that constituents of width one receive non-zero probabilities. This result is a special case of the K-dimensional system with γ 1 < δ1 .
326
Ilkka Niiniluoto
Hintikka’s Finnish students: Risto Hilpinen, Raimo Tuomela, and Juhani Pietarinen. Like Carnap’s continuum, Hintikka’s two-dimensional system is formulated for a monadic first-order language with finitely many predicates. As a minor technical improvement, the Q-predicates may have different widths (cf. (2 )). The Qpredicates may be defined by families of predicates in Carnap’s style, so that they allow discrete quantitative descriptions. Moreover, the number of predicates may be allowed to be countably infinite (see [Kuipers, 1978b]). However, more challenging questions concern extensions of Hintikka’s framework to languages which are essentially more powerful than monadic predicate logic. Hilpinen [1966] considers monadic languages with identity. In such languages it is possible to record that we have picked out different individuals in our evidence (“sampling without replacement”). Numerical quantifiers “there are at least d individuals such that” and “there are precisely d − 1 individuals” can be expressed by sentences involving a layer of d interrelated quantifiers. The maximum number of nested quantifiers in a formula is called its quantificational depth. Hence, by replacing existential quantifiers in formula (9) by numerical quantifiers, constituents of depth d can specify that each Q-predicate is satisfied by either 0, 1, ..., d − 1, or at least d individuals (see [Niiniluoto, 1987, p. 59]). A constituent of depth d splits into a disjunction of “subordinate” constituents at the depth d+1: the claim that there are at least d individuals in Qi means that there are precisely d or at least d + 1 individuals in Qi . For finite universes monadic constituents with identity are equivalent to Carnap’s structure descriptions, but expressed without individual constants. Hilpinen extends Hintikka’s Jerusalem system to the monadic language with identity by dividing the probability mass evenly to all constituents at the depth d and then evenly to state descriptions entailing a constituent. Hilpinen shows that, on the basis this probability assignment, it is not reasonable to project that cells not instantiated in our evidence are occupied by more than d individuals. Also it is not rational to project that observed singularities (i.e., cells with only one observed individual) are real singularities in the whole universe. However, all constituents according to which in our universe there are unobserved singularities have an equal degree of posterior probability on any evidence, and these constituents are equally probable as the constituent which denies the existence of unobserved singularities. The last result is not intuitive. Hilpinen shows that it can be changed by an alternative probability assignment: distribute probability first evenly to constituents of depth 1, then evenly to all subordinate constituents of dept 2, etc. (cf. [Hintikka, 1965a]). Then the highest probability is given to the constituent which denies the existence of unobserved singularities. Tuomela [1966] shows that Hintikka’s main result (13) about inductive generalization can be achieved in an ordered universe. The decision problem for a first-order language containing the relation Rxy =“y is an immediate successor of x” is effectively solvable. The Q-predicates for such a language specify triples: predecessor, object, successor. The constituents state which kinds of triples there are
The Development of the Hintikka Program
327
in the universe. If all constituents are given equal prior probabilities, the simplest constituent compatible with evidence will have the greatest posterior probability. Inductive logic for full first-order logic is investigated by Hilpinen [1971].5 In principle, Hintikka’s approach for monadic languages can be generalized to this situation, since Hintikka himself showed in 1953 how distributive normal forms can be defined for first-order languages L with a finite class of polyadic relations (cf. [Niiniluoto, 1987, pp. 61-80]). For each quantificational depth d > 0, i.e., the number of layers of quantifiers, a formula of L can be expressed as a disjunction of constituents of depth d. This normal form can be expanded to greater depths. A new feature of this method results from the undecidability of full first-order logic: some constituents are non-trivially inconsistent and there is no effective method of locating them. The logical form of a constituent of depth d is still (13), but now the Q-predicates or “attributive constituents” are trees with branches of the length d. A constituent of depth 1 tells what kinds of individuals there are in the universe, now described by their properties and their relations to themselves. A constituent C (2) of depth 2 is a systematic description of all different kinds of pairs of individuals that one can find in the universe. A constituent C (d) of depth d is a finite set of finite trees with maximal branches of the length d. Each such branch corresponds to a sequence of individuals that can be drawn with replacement from the universe. Such constituents C (d) are thus the strongest generalizations of depth ≤ d expressible in language L. Each complete theory in L can be axiomatized by a monotone sequence of subordinate constituents C (d) |d < ∞, where ... C (d+1) C (d) ... C (1) .’indexDe Morgan, A. Given the general theory of distributive normal forms, the axiomatic approach can in principle be applied to the case of constituents of depth d. Whenever assumptions corresponding to A1, A2, A3, and A6 can be made, there will be one constituent C (d) which receives asymptotically the probability one on the basis of evidence consisting of ramified sequences of d interrelated individuals. The general case has not yet been studied. Hilpinen’s 1971 paper is still the most detailed analysis of inductive logic with relations.6 5 Assignment of mathematical probabilities to formulas of a first-order language L is studied in probability logic (see [Scott and Krauss, 1966; Fenstad, 1968]). The main representation theorem, due to Jerzy Los in 1963, tells that the probability of a quantificational sentence can be expressed as a weighted average involving two kinds of probability measures: one defined over the class of L-structures, and the other measures defined over sets of individuals within each L-structure. Again exchangeability (cf. A3) guarantees some instantial relevance principles (cf. [Nix and Paris, 2007]), but otherwise probability logic has yet not lead to new fruitful applications in the theory of induction. 6 Nix and Paris [2007] have recently investigated binary inductive logic, but they fail to refer to Hintikka’s program in general and to Hilpinen [1971] in particular. The basic proposal of Nix and Paris is to reduce binary relations to unary ones: for example, ‘x pollinates y’ is treated as equivalent to a long sentences involving only unary predicates of x and y. As a general move this proposal is entirely implausible. The reduction of relations to monadic properties was a dogma of classical logic, until De Morgan and Peirce in the mid-nineteenth century started the serious study of the logic of relations (see [Kneale and Kneale, 1962]). The crucial importance of the distinction between monadic and polyadic first-order logic was highlighted by the metalogical results in the 1930s: the former is decidable, the latter is undecidable. This difference has
328
Ilkka Niiniluoto
Hilpinen studies constituents of depth 2. Evidence e includes n observed individuals and a complete description of the relations of each pair of individuals in e. Now constituents C w of depth 2 describe what kinds of individuals there are in the universe U . A statement Dv which specifies, for each individual ai in e, which attributive constituent ai satisfies, gives an answer to the question of how observed individuals are related to unobserved individuals. Hilpinen distributes inductive probability P (C w ) to constituents C w evenly. Probabilities of the form P (Dv /C w ) are defined by the Carnap-Hintikka style representative function of the form (15). Likelihoods P (e/Dv &C w ) are also determined by the same type of representative function, but now applied to pairs of individuals. Again, corresponding to Hintikka’s basic asymptotic result (13), the highest posterior probability on large evidence is given to the simplest conjunction De &C e where De states that the individuals in e are related to unobserved individuals in the same ways as to observed individuals and C e states that there are in the universe only those kinds of individuals that are, according to De , already exemplified in E. Hilpinen observes that there is another kind of “simplicity”: the number of kinds of individuals in De may be reduced by assuming that all individuals in e bear the same relations to observed and some yet unobserved individuals. If this statement is denoted by D1 , and the corresponding constituent by C 1 , then P (Dv /C v ) is maximized by D1 &C 1 . Hence, inductive methods dealing with polyadic languages need at least two separate parameters which regulate the weights given to the two kinds of simplicity.’indexlawlike generalizations An extension of Hintikka’s system to modal logic has been proposed by Soshichi Uchii [1972; 1973; 1977] (cf. [Niiniluoto, 1987, pp. 91-102]). Such an account is interesting, if the formulation of lawlike generalizations requires intensional notions like necessity and possibility (see [Pietarinen, 1972]). Hintikka himself is one of the founders of the possible worlds semantics for modal logic (see [Bogdan, 1987; Hintikka, 2006]). Uchii is interested in a monadic language L() with the operators of nomic or causal necessity and nomic possibility ♦. Here ♦ =∼ ∼. It is assumed that necessity satisfies the conditions of the Lewis system S5. The nomic constituents of L() can now be defined in analogy with (10): ♦(∃x)Qi (x)&(x) Qi (x) . (29) i∈CT
i∈CT
Uchii calls (29) “a non-paradoxical causal law”. (29) specifies which kinds of individuals are physically possible and which kinds are physically impossible. Even stronger modal statements can be defined by ♦Ci & Ci . (30) i∈H
i∈H
where Ci are the ordinary constituents of the language L without . The laws expressible in L() are typically what John Stuart Mill called “laws of coexistence”. dramatic consequences to Hintikka’s theory of distributive normal forms as well.
The Development of the Hintikka Program
329
To express Mill’s “laws of succession”, some temporal notions have to be added to L() (see [Uchii, 1977]). Let us denote by Bi the nomic constituent (29) which has the same positive Q-predicates as the ordinary constituent Ci . As actuality entails possibility, there are K − w nomic constituents compatible with an ordinary constituent C w of width w. Uchii’s treatment assumes that P (Ci ) = P (Bi ) for all i. Further, the probability of evidence e, given knowledge about the actual constitution Ci of the universe, is not changed if the corresponding nomic constituent Bi is added to the evidence: P (e/Ci ) = P (e/Ci &Bi ). It follows from (13) that (31) P (B c /ecn ) → 1, when n → ∞ and c is fixed, iff P (B c /C c ) = 1. Thus, if we have asymptotically become certain that C c is the true description of the actual constitution of the universe, the same certainty holds for the nomic constituent B c if and only if P (C c /B c ) = P (B c /C c ) = 1. Uchii makes this very strong assumption, which simply eliminates all the K − c nomic constituents compatible with C c and undermined by the asymptotic evidence. In fact, he postulates that P ((∃xφ(x))/♦(∃x)φ(x)) = 1 for all formulas φ. This questionable metaphysical doctrine, which says that all genuine possibilities are realized in the actual history, is known as the Principle of Plenitude. An alternative interpretation is proposed by Niiniluoto [1987, pp. 101-102]. Perhaps the actual constitution of the universe is not so interesting, since evidence e obtained by active experimentation will realize new possibilities. As laws of nature have counterfactual force, experimentation can be claimed to be the key to their confirmation (see [von Wright, 1971]). So instead of the fluctuating true actual constituent C c , we should be more interested in the permanent features of the universe expressed by the true nomic constituent. This suggests that the inductive approach of Sections 2 and 3 is directly formulated with nomic constituents, so that the axiomatic assumptions imply a convergence result for the constituent B c on the basis of experimental evidence ecn . 5
SEMANTIC INFORMATION
Hintikka was quick to note that his inductive probability measures make sense of some of Popper’s ideas. Hintikka [1968b] observed that his treatment of induction is not purely enumerative, since the inductive probability of a generalization depends also on the ability of evidence to refute universal statements. This eliminative aspect of induction is related to the Popperian method of falsification. Popper [1959] argued that preferable theories should have a low absolute logical probability: good theories should be falsifiable, bold, informative, and hence improbable. In Hintikka’s two-dimensional system with the prior probability assignment (17), the initially least probable of the constituents compatible with evidence e, i.e., constituent C c (see (19)) eventually will have the highest posterior probability. The smaller finite value α has, the faster we switch our degrees of confirmation from initially more probable constituents to initially less probable constituents.
330
Ilkka Niiniluoto
Hence, the choice of a small value of parameter α is “an indication of one aspect of that intellectual boldness Sir Karl has persuasively advocated” [Hintikka, 1966, p. 131]. A systematic argument in defending essentially the same conclusion comes from the theory of semantic information [Hintikka and Pietarinen, 1966]. It can be shown that the degree of information content of a hypothesis is inversely proportional to its prior probability. A strong generalization is the more informative the fewer kinds of individuals it admits of. Therefore, C c is the most informative of the constituents compatible with evidence. For constituents, high degree of information and low prior probability — Popper’s basic requirements — but also high degree of posterior probability go together. The relevant notion of semantic information was made precise by Carnap and Bar-Hillel [1952] (cf. [Niiniluoto, 1987, pp. 147-155]). They defined the information content of a sentence h in monadic language L by the class of the content elements entailed by h, where content elements are negations of state descriptions. Equally, information content could be defined as the range R(∼ h) of the negation ∼ h of h, i.e., the class of state descriptions which entail ∼ h. If Popper’s “basic sentences” correspond to state descriptions, this is equivalent to Popper’s 1934 definition of empirical content. As the surprise value of h Carnap and Bar-Hillel used the logarithmic measure (32) inf(h) = − log P (h), which is formally similar to Shannon’s measure in statistical information theory. For the degree of substantial information of h Carnap and Bar-Hillel proposed (33) cont(h) = P (∼ h) = 1 − P (h). Substantial information is thus inversely related to probability, just as Popper [1959] also required. As both Carnap and Popper thought that P (h) = 0 for all universal generalizations, they could not really use the cont-measure to serve any comparison between rival laws or theories. Hintikka’s account of inductive generalization opened a way for interesting applications of the theory of semantic information. He developed these ideas further in “The Varieties of Information and Scientific Explanation” [Hintikka, 1968a] and in the volume Information and Inference (1970), edited together by Hintikka and Suppes. Hintikka [1968a] defined measures of incremental information which tell how much information h adds to the information already contained in e: (34) inf add (h/e) = inf(h&e) − inf(e) contadd (h/e) = cont(h&e) − cont(e). Measures of conditional information tell how informative h is in a situation where e is already known:
The Development of the Hintikka Program
331
(35) inf cond (h/e) = − log P (h/e) contcond (h/e) = 1 − P (h/e). Hence, inf add turns out to be same as inf cond . Measures of transmitted information tell how much the uncertainty of h is reduced when e is learned, or how much substantial information e carries about the subject matter of h: (36) transinf(h/e) = inf(h) − inf(h/e) = log P (h/e) − log P (h) transcontadd (h/e) = cont(h) − contadd (h/e) = 1 − P (h ∨ e) transcontcond (h/e) = cont(h) − contcond (h/e) = P (h/e) − P (h). Thus, e transmits some positive information about h, in the sense of transinf and transcontcond , just in case P (h/e) > P (h), i.e., e is positively relevant to h. In the case of transcontadd , the corresponding condition is that h ∨ e is not a tautology, i.e., h and e have some common information content. Hilpinen [1970] used these measures to given an account of the information provided by observations. His results provide an information-theoretic justification of the principle of total evidence. Measures of transmitted information have also an interesting application to measures of explanatory power or systematic power (see [Hintikka, 1968a; Pietarinen, 1970; Niiniluoto and Tuomela, 1973]). In explanation, the explanans h is required to give information about the explanandum e. With suitable normalizations, we have three interesting alternatives for the explanatory power of h with respect to e: (37) expl1 (h, e) = transinf(e/h)/ inf(e) =
log P (e) − log P (e/h) log P (e)
expl2 (h, e) = transcontadd (e/h)/cont(e) = expl3 (h, e) = transcontcond (e/h)/cont(e) =
1 − P (h ∨ e) = P (∼ h/ ∼ e) 1 − P (e) P (e/h) − P (e) . 1 − P (e)
Here expl2 (h, e) is the measure of systematic power proposed by Hempel and Oppenheim in 1948 (see [Hempel, 1965]). Note that all of these measures receive their maximum value one if h entails e, so that they cannot distinguish between alternative deductive explanations of e. One the other hand, if inductive explanation is explicated by the positive relevance condition (cf. [Niiniluoto and Tuomela, 1973; Festa, 1999]), then they can used for comparing rival inductive explanations h of data e.7 7 For measures of systematic power relative to sets of competing hypotheses, see Niiniluoto and Tuomela [1973].
332
Ilkka Niiniluoto
6
CONFIRMATION AND ACCEPTANCE
Hintikka’s basic result shows that universal hypotheses h can be confirmed by finite observational evidence. This notion of confirmation can be understood in two different senses (cf. [Carnap, 1962; Niiniluoto, 1972]): h may have a high posterior probability P (h/e) on e, or e may increase the probability of h. (HP) High Probability: e confirms h if P (h/e) > q ≥ 12 . (PR) Positive Relevance: e confirms h iff P (h/e) > P (h). PR is equivalent to conditions P (h&e) > P (h)P (e), P (h/e) > P (h/ ∼ e), and P (e/h) > P (e). The basic difference between these definitions is that HP satisfies the principle of Special Consequence: (SC)
If e confirms h and h g, then e confirms g,
while PR satisfies the principle of Converse Entailment: (CE) If a consistent h entails a non-tautological e, then e confirms h. In Peirce’s terminology, hypothetical inference to an explanation is called abduction, so that by “abductive confirmation” one may refer to the support that a theory receives from its explanatory successes. By Bayes’s Theorem, PR satisfies the abductive criterion, when P (h) > 0 and P (e) < 1: (38) If h deductively or inductively explains or predicts e, then e confirms h. (see [Niiniluoto, 1999]). It is known that no reasonable notion of confirmation can satisfy SC and CE at the same time. In the spirit of PR, Carnap [1962] proposed that quantitative degrees of confirmation are defined by the difference measure: (39) conf(h, e) = P (h/e) − P (h). We have seen in (36) that (39) measures the transmitted information that e carries on h. As Hintikka [1968b] points out, many other measures of confirmation, evidential support, and factual support are variants of (39) (see also [Kyburg, 1970). This is the case also with Popper’s proposals for the degree of corroboration of h by e (see [Popper, 1959, p. 400]). Popper was right in arguing that degrees of corroboration should not be identified with prior or posterior probability. But Hintikka’s system has the interesting result that, in terms of measure (39) and its variants, with sufficiently large evidence e the minimal constituent C c at the same time maximizes posterior probability P (C w /e) and the information content cont(C w ). Hence, it also maximizes the difference (39), which can be written in the form P (h/e) + cont(h) − 1 (see [Hintikka and Pietarinen, 1966]). Hintikka [1968a] proposed a new measure of corroboration which gives an interesting treatment of weak generalizations (see also [Hintikka and Pietarinen, 1966; Niiniluoto and Tuomela, 1973]). Assume that h is equivalent to the disjunction of constituents C1 , ..., Cm , and define corr(h/e) as the minimum of the posterior probabilities P (Ci /e):
The Development of the Hintikka Program
333
(40) corr(h, e) = min {P (Ci /e)|i = 1, ...., m}. Measure (40) guarantees that, unlike probability, corroboration covaries with logical strength: (41) If e h1 ⊃ h2 , then P (h1 /e) ≤ P (h2 /e) (42) If e h1 ⊃ h2 , then corr(h1 , e) ≥ corr(h2 , e). Further, (40) favours C c among all (weak and strong) generalizations in language L: (43) With sufficiently large evidence ecn with fixed c, corr(h, ecn ) has its maximum value when h is the constituent C c . Hintikka inductive probability measures can be applied to the famous and much debated paradoxes of confirmation. In Hempel’s paradox of ravens, the universal generalization “All ravens are black” is confirmed by three kinds of instances: black ravens, black non-ravens, and non-black non-ravens. The standard Bayesian solution, due to Janina Hosiasson-Lindenbaum in 1940 (see [Hintikka and Suppes, 1966; Niiniluoto, 1998]), is that these three instances give different incremental confirmation to the hypothesis, since in a finite universe these cells are occupied by different numbers of objects. Instead of such an empirical assumption, one could also make a conceptual stipulation to the effect that the predicates “black” and “non-black”, and “raven”, and “non-raven”, have different widths, and then apply the formula (2 ). Hintikka [1969a] proposes that the “inductive asymmetry” of the relevant Q-predicates could be motivated by assuming an ordering of the primitive predicates (cf. [Pietarinen, 1972]). Another famous puzzle is Nelson Goodman’s paradox of grue. Here Hintikka’s solution appeals to the idea that parameter α regulates the confirmability of universal laws. More lawlike generalizations can be more easily confirmed than less lawlike ones. If we associate a smaller α to the conceptual scheme involving the predicate “green” than to the scheme involving the odd predicate “grue”, then differences in degrees of confirmation of the generalizations “All emeralds are green” and “All emerald are grue” can be explained (see [Hintikka, 1969b; Pietarinen, 1972; Niiniluoto and Tuomela, 1973]). Carnap’s system of induction does not include rules of acceptance. Rather, the task of inductive logic is to evaluate the epistemic probabilities of various hypotheses. These probabilities can be used in decision making (see [Carnap, 1980; Stegm¨ uller, 1973]). Carnap agrees here with many statisticians — both frequentists (Jerzy Neyman, E. S. Pearson) and Bayesians (L. J. Savage) — who recommend that inductive inferences are replaced by inductive behaviour or probabilitybased actions. In this view, the main role of the scientists is to serve as advisors of practical decision makers rather than as seekers of new truths. On the other hand, according to the cognitivist model of inquiry, the tentative results of scientific research constitute a body of accepted hypotheses, the so-called “scientific
334
Ilkka Niiniluoto
knowledge” at a given time. In the spirit of Peirce’s fallibilism, they may be at any time questioned and revised by new evidence or novel theoretical insights. But, on some conditions, it is rational to tentatively accept a hypothesis on the basis of evidence. One of the tasks of inductive logic is define such rules of acceptance for corrigible factual statements. Hintikka, together with Isaac Levi [1967], belongs to the camp of the cognitivists. The set of accepted hypotheses is assumed to consistent and closed under logical consequence (cf. [Hempel, 1965]). Henry E. Kyburg’s lottery paradox shows then that high posterior probability alone is not sufficient to make a generalization h acceptable. But in Hintikka’s system one may calculate for the size n of the sample e a threshold value n0 which guarantees that the informative constituent C c has a probability exceeding a fixed value 1 − ε: (44) Let n0 be the value such that P (C c /e) ≥ 1 − ε if and only if n ≥ n0 . Then, given evidence e, accept C c on e iff n ≥ n0. (See [Hintikka and Hilpinen, 1966; Hilpinen, 1968].) Assuming logical closure, all generalizations entailed by C c are then likewise acceptable on e.8 In Hintikka’s two-dimensional continuum, n0 can be defined as the largest integer n for which
ε ≤ max
K−c
K−c
i
i=1
(
c n−α ) , c+i
where the maximum is taken over values of c, 0 ≤ c ≤ K − 1, and ε = ε/(1 − ε). Hintikka and Hilpinen [1966] argue further that a singular hypothesis is inductively acceptable if and only if it is a substitution instance of an acceptable generalization: (45) A singular hypothesis of the form φ(ai ) is acceptable on e iff the generalization (x)φ(x) is acceptable on e. This principle reduces singular inductive inferences (Mill’s “eduction”) to universal inferences. 7
COGNITIVE DECISION THEORY
According to Bayesian decision theory, it is rational for a person X to accept the action which maximizes X’s subjective expected utility. Here the relevant utility 8 Note that (44) is a factual detachment rule in the sense that it concludes a factual statement from factual and probabilistic premises. This kind of inductive rule should be distinguished from probabilistic detachment rules which can be formulated as deductive arguments within the probability calculus (see [Suppes, 1966]). An example of the latter kind of rules is the following:
P (h/e) = r P (e) = 1 P (h) = r.
The Development of the Hintikka Program
335
function express quantitatively X’s subjective preferences concerning the outcomes of alternative actions, usually in terms of some practical goals. The probabilities needed to calculate expected utility are X’s personal probabilities, degrees of belief concerning the state of nature. Cognitive decision theory adopts the same Bayesian decision principle with a new interpretation: the relevant actions concern the acceptance of rival hypotheses, and the utilities express some cognitively important values of inquiry. Such epistemic utilities may include truth, information, explanatory and predictive power, and simplicity. With anticipation by Bolzano, the basic ideas of cognitive decision theory were suggested in the early 1960s independently by Hempel and Levi (cf. [Hempel, 1965; Levi, 1967]). Inductive logic is relevant to this project, since it may provide the relevant epistemic probabilities [Hilpinen, 1968; Niiniluoto and Tuomela, 1983: Niiniluoto, 1987]. Let us denote by B = {h1 , ..., hn } a set of mutually exclusive and jointly exhaustive hypotheses. Here the hypotheses in B may be the most informative descriptions of alternative states of affairs or possible worlds within a conceptual framework L. For example, they may be state descriptions, structure descriptions or constituents of a monadic language, or complete theories expressible in a finite first-order language.9 If L is interpreted on a domain U , so that each sentence of L has a truth value (true or false), it follows that there is one and only true hypothesis (say h∗ ) in B. Our cognitive problem is to identify the target h∗ in B. The elements hi of B are the potential complete answers to the cognitive problem. The set D(B) of partial answers consists of all non-empty disjunctions of complete answers. The trivial partial answer in D(B), corresponding to ‘I don’t know’, is represented by a tautology, i.e., the disjunction of all complete answers. For any g ∈ D(B) and hj ∈ B, we let u(g, hj ) be the epistemic utility of accepting g if hj is true. We also assume that a rational probability measure P is associated with language L, so that each hj can be assigned with its epistemic probability P (hj /e) given the available evidence e. Then the best hypothesis in D(B) is the one g which maximizes the expected epistemic utility: (46) U (g/e) =
n
P (hj /e)u(g, hj ).
j=1
Expected utility gives us a new possibility of defining inductive acceptance rules: (EU) Accept on evidence e the answer g ∈ D(B) which maximizes the value U (g/e). Another application is to use expected utility as a criterion of epistemic preferences and cognitive progress: (CP) Step from answer g ∈ D(B) to another answer g ∈ D(B) is cognitively progressive on evidence e iff U (g/e) < U (g /e). 9 The framework also includes situations where B is a subset of some quantitative space like the real numbers R, but then sums are replaced by integrals.
336
Ilkka Niiniluoto
(See [Niiniluoto, 1995].) Assume now that g is a partial answer in D(B) with (47) g ≡ hi , i∈Ig
If truth is the only relevant epistemic utility, then we may take u(g, hj ) simply to be the truth value of g relative to hj : u(g, hj ) = 1 if hj is in g = 0 otherwise. Hence, u(g, h∗ ) is the real (and normally unknown) truth value tv(g) of g relative to the domain U . It follows from (46) that the expected utility U (g/e) equals the posterior probability P (g/e) of g on e: P (hj /e) = P (g/e) (48) U (g/e) = i∈Ig
In this sense, we may say that posterior probability equals expected truth value. The rule of maximizing expected utility leads now to an extremely conservative policy: the best hypotheses g on e are those that satisfy P (g/e) = 1, i.e., are completely certain on e. For example, e itself and a tautology are such statements. If we are not certain of the truth, then it is always progressive to change an uncertain answer to a logically weaker one. The argument against using probability as a criterion of theory choice was made already by Popper in 1934 (see [Popper 1959]). He proposed that good theories should be bold, improbable, and informative (cf. Section 6). However, it is likewise evident that information cannot be the only relevant epistemic utility. Assume that the information content of g is measured by cont(g) = 1 − P (g) (see (33)). If we now choose u(g, hj ) = cont(g), then the expected utility U (g/e) equals 1 − P (g), which is maximized by a contradiction with probability zero. Further, any false theory could be cognitively improved by adding new falsities to it. Similar remarks can be made about explanatory and systematic power, if they were chosen as the only relevant utility. Levi [1967] measures the information content I(g) of a partial answer g in D(B) by the number of complete answers it excludes. With a suitable normalization, I(g) = 1 if and only if g is one of the complete answers hj in B, and I(g) = 0 for a tautology. Levi’s proposal for epistemic utility is the weighted combination of the truth value tv(g) of g and the information content I(g) of g: (49) tv(g) + qI(g), where 0 < q ≤ 1 is an “index of boldness”, indicating how much the scientist is willing to risk error, or to “gamble with truth”, in her attempt to relieve from agnosticism. The expected epistemic utility of g is then
The Development of the Hintikka Program
337
(50) P (g/e) + qI(g). By using the weight q, formula (50) expresses a balance between two mutually conflicting goals of inquiry. It has the virtue that all partial answers g in D(B) are comparable with each other. If epistemic utility is defined by information content cont(g) in a truth-dependent way, so that (51)
u(g, e)
= cont(g) if g is true = −cont(∼ g) if g is false,
(i.e., in accepting hypothesis g, we gain the content of g if g is true, but we lose the content of the true hypothesis ∼ g if g is false), then the expected utility U (g/e) is equal to (52) P (g/e) − P (g). This proposal, originally made by Levi in 1963 but rejected in Levi [1967], was defended by Hintikka and Pietarinen [1966]. This measure combines the criteria of boldness (small prior probability P(g)) and high posterior probability P (g/e). Similar results can be obtained when cont(g) is replaced by Hempel’s [1965] measure of systematic power (37): if (53)
u(g, e)
= P (∼ g/ ∼ e) if g is true = −P (g/ ∼ e) if g is false,
then the expected utility of g is [P (g/e) − P (g)]/P (∼ e), which again is a variant of the difference measure (52) (see [Pietarinen, 1970]). Hilpinen [1968] proposed a modification of (49) and (51): (54)
u(g, e)
= 1 − P (g) if g is true = −qP (g) if g is false.
Then the expected utility is (55) U (g/e) = P (g/e) − qP (g), where again q serves as an index of boldness. In Hintikka’s system, with sufficiently large evidence e, the answer which maximizes the values (50), (52), and (55) is the minimal constituent C c . As C c is also the simplest of the constituent compatible with e, Hintikka posterior probabilities favour simplicity — so that simplicity need not be added to the framework as an extra condition (cf. [Niiniluoto, 1994]). 8 INDUCTIVE LOGIC AND THEORIES Inductive logic has the reputation that it is a formal tool of narrowly empiricist methodology. Although induction was discussed already by Aristotle, his account
338
Ilkka Niiniluoto
was intimately connected to concept formation (see [Hintikka, 1980; Niiniluoto, 1994/95]). The role of induction in science was emphasized especially by the British philosophers from Francis Bacon to William Whewell, John Stuart Mill, and Stanley Jevons in the 19th century and the Cambridge school in the 20th century. Many empiricists were also inductivists in the sense that they restricted science to empirical observations and generalizations that could be discovered and justified by enumerative induction. The hypothetico-deductive (HD) method of science allows scientists to freely invent hypothetical theories to explain observed data, but requires that such hypotheses are indirectly tested by their empirical consequences. Bayes’s theorem provides a method of evaluating the performance of hypotheses in observational tests. In the basic result about indirect confirmation (38), the hypothesis h may be a theory with makes postulates about unobservable entities and processes. In this sense, the theory of inductive probabilities is not committed to narrow empiricism and inductivism [Hempel, 1965; Niiniluoto and Tuomela, 1973]. For Carnap, inductive logic was part of his program of logical empiricism. The typical assumption of Carnap’s system is that evidence is given by singular observational sentences. His negative results about the zero confirmation of laws seemed further to strengthen the “atheoretical thesis” [Lakatos, 1968a] which makes laws and theories dispensable in the theory of induction. In his replies to Hilary Putnam, Carnap acknowledged that it would be desirable to construct inductive logic to “the total language of science, consisting of the observational language and the theoretical language” [Schilpp, 1963, p. 988]. This would allow that inductive predictions could take into account “also the class of the actually proposed laws”. In particular, Carnap admitted that the meaning postulates of theories should be given the m-value one. Carnap was one of architects of the view which claims that scientific theories include theoretical terms that are not explicitly definable by observational terms, but still theories formulated in the total language L of science should be testable by the consequences that they have in the observational sublanguage Lo of L. Hempel agreed, but raised in 1958 in “The Theoretician’s Dilemma“ the following puzzle for scientific realists: if a theory T in L achieves deductive systematization between observational sentences e and e in Lo (i.e., (T &e) e , but not e e ), then the elimination methods of Ramsey and Craig show that the same deductive systematization is achieved by an observational subtheory of T in Lo . Therefore, theoretical terms are not after all logically indispensable for observational deductive systematization (see [Hempel, 1965]). Hempel suggested that theoretical terms may nevertheless be logically indispensable for inductive systematization. Consider a simple theory T : T = (x)(M x ⊃ O1 x)&(x)(M x ⊃ O2 x), where O1 and O2 are observational terms and M is a theoretical predicate. Then, by the first law, from O1 a one may inductively infer M a, and, by the second law, from Ma one may infer O2 a. However, this attempt to establish an inductive link
The Development of the Hintikka Program
339
between O1 a and O2 a via T is not justified, since it relies at the same time on the incompatible principles CE and SC (see [Niiniluoto, 1972]). The theoretician’s dilemma was the starting point Raimo Tuomela in his studies on the deductive gains of the introduction of theoretical terms (see [Hintikka and Suppes, 1970]). The issue of inductive gains of theoretical terms was the topic of the doctoral dissertation of Ilkka Niiniluoto in 1973. Let eIh state that h is “inducible” from e. Then theory T in L achieves inductive systematization between e and e in Lo , if (T &e)Ie but not eIe . By using PR as the explication of I, these conditions can be written:: (IS)
P (e /e&T ) > P (e ) P (e /e) = P (e ).
The first condition has an alternative interpretation: (IS )
P (e /e&T ) > P (e /T ).
Both conditions presuppose that inductive probabilities can take genuine theories as their conditions. The relevant calculations are given in [Niiniluoto and Tuomela, 1973] by using Hintikka’s generalized combined system. For example, if h a universal generalization in Lo , and T is a theory in L using theoretical terms, then the probability P (h/e&T ) depends on the number b of Q-predicates of L which are empty by h but not by T . For large values of n and α, we have approximately (56) P (h/e&T )
1 . (1 + α/n)b
Comparison with (24) shows that P (h/e) < P (h/e&T ) iff b > b (op. cit., p. 38). In particular, if T is an explicit definition of M in terms of Lo , then b = b . Hence, (57) If T is an explicit definition of the theoretical terms in L by Lo , then P (h/e&T ) = P (h/e). The main result proved in [Niiniluoto and Tuomela, 1973, Ch. 9] is that theoretical terms can be logically indispensable for observational inductive systematization. This result shows that Hintikka’s inductive logic helps to solve the theoretician’s dilemma and thereby to give support to scientific realism. Probabilities of the form P (h/e&T ) can be used for hypothetico-inductive inferences, when T is a tentative theory. When T &e is accepted as evidence, e gives observational support and T gives theoretical support for h. Many formulas discussed in earlier sections can be reformulated by taking into account the background theory T . It is also possible to calculate directly probabilities of the form P (C w /e), where C w is a constituent in the full language L with theoretical terms and e is observational evidence in Lo (see [Niiniluoto, 1976]). These applications prove that inductive logic can developed as a non-inductivist methodological program [Niiniluoto and Tuomela, 1973 Ch. 12]. Comparison of probabilities of the form P (h/e) and P (h/e&T ) illuminate also the inductive effects of conceptual change. Changes in the values of inductive
340
Ilkka Niiniluoto
parameters λ and α cannot be modelled in terms of Bayesian conditionalization (cf. Hintikka, 1966; 1987b; 1997]. The strategy of Niiniluoto and Tuomela [1973] is to keep parameters fixed, and to study the changes due conditionalization on conceptual or theoretical information. Result (57) shows that inductive logic satisfies a reasonable condition of linguistic invariance: if the new predicate M in L is explicitly definable by predicates of Lo , and T is the meaning postulate expressing this definition, then the probabilities in L conditional on T are equal to probabilities in Lo . Similarly, if L1 and L2 are intertranslatable, their inductive logics are equal [Niiniluoto and Tuomela, 1973, p. 175]. 9
ANALOGY AND OBSERVATIONAL ERRORS
Inference by analogy is a traditional form of non-demonstrative reasoning. It can be regarded as a generalization of the deductive rule for identity: (RI)
F (a) b=a F (b).
If identity = is replaced by the weaker condition of similarity, RI is replaced by (RS)
F (a) b is similar to a F (b).
The classical idea of explicating similarity is by partial identity: objects a and b share some of their properties. Using the terms of Keynes, the attributes that a and b agree on belong to their positive analogy, and the attributes that a and b disagree belong to their negative analogy. If objects a and b are known to agree on k attributes and disagree on m attributes, then J. S. Mill in his A System of Logic proposed to measure the strength of the analogical inference by its probability k/(k + m). This suggestion means that analogical inference is simple enumerative induction with respect to properties. Carnap required that a system of inductive logic should be able to handle inference by analogy. It turns out that Carnap’s and Hintikka’s systems give a satisfactory treatment of simple positive analogy, but fail if there is some negative analogy (cf. [Hesse, 1964]). Assume that, in a monadic language with k primitive predicates, F1 (x) = M1 (x)&...&Mm (x) F2 (x) = M1 (x)&...&Mm (x)&Mm+1 (x)&...&Mn (x), where 1 ≤ m < n. Then the width of F1 is w1 = 2k−m and the width of F2 is w2 = 2k−n . Hence, in the K-dimensional system of inductive logic, P (F2 (b)/F1 (b)) =
w2 w1
The Development of the Hintikka Program
(58)
P (F2 (b)/F1 (b)&F2 (a))
= >
341
1−(K−w2 )f (0,1,1) 1−(K−w1 )f (0,1,1) w2 w1 = P (F2 (b)/F1 (b)).
(See [Niiniluoto, 1981, p. 7].) By choosing f (0, 1, 1) as equal to its value in Carnap’s λ-continuum (λ/K)/(1 + λ), the probability (58) is (59)
1 + w2 λ/K . 1 + w1 λ/K
By putting λ = K in (59), we obtain Carnap’s analogy formula for his measure c∗ : 1 + w2 (60) 1 + w1 (See [Carnap, 1950, pp. 569-570].) However, as soon as the evidence includes some known difference between objects a and b, the analogy influence disappears in (58). Carnap’s first attempt to handle the difficulty was a new “analogy parameter” η (see [Carnap and Stegm¨ uller, 1959]). This device was applied in Hintikka’s system by Pietarinen [1972] (see also [Festa, 2003]). In his posthumously published “Basic System”, Carnap developed the idea that analogy can be accounted for by distances between predicates (see [Carnap, 1980]). Natural measures of such distances can be defined by means of Q-predicates. For two Q-predicates Qu and Qv of the form (±)M1 x&...&(±)Mk x, the distance d(Qu , Qv ) = duv is m/k if they disagree on m of the k primitive predicates M1 , ..., Mk . For example, in a monadic language L with two primitive predicates, we have d(M1 x&M2 x, M1 x& ∼ M2 x) = 12 , and d(M1 x&M2 x, ∼ M1 x& ∼ M2 x) = 1. For languages with families of predicates, each primitive dichotomy {M j, ∼ Mj } is replaced by a family Mj of predicates, j = 1, ..., k, and each family is assumed to have its internal distance dj (e.g., distance between colours, distance between discrete values of age). Then the earlier definition of duv can be generalized by defining the distance between two Q-predicates by the Euclidean metric relative to the k dimensions. With this distance function, the set of Q-predicates becomes a metric space of concepts (see [Niiniluoto, 1987]). If duv is normalized so that 0 ≤ duv ≤ 1, the resemblance ruv between two Q-predicates Qu and Qv can be defined by ruv = 1 − duv . Alternatively, we can use ruv = 1/(1 + duv ). Two individuals a and b, which satisfy the Q-predicates Qu and Qv , respectively, are then completely similar (relative to the expressive power of language L) if and only if ruv = 1. The principle of Positive Instantial Relevance (4) states that the observation of individuals of kind Qi increases our expectation to find completely similar individuals in the universe. A natural modification of this principle is to allow similarity influence: the observation of individuals of kind Qi increases the expectation to find individuals in cells Qj that are close to Qi . In terms of singular probabilities, similarity influence should allow that P (Q2 (b)/Q1 (a)) > P (Q3 (b)/Q1 (a)) iff d12 < d13 iff r12 > r13 .
342
Ilkka Niiniluoto
(See [Carnap, 1980, p. 46].) Note that this principle already helps to solve the problem of negative analogy. Similarly, in the context of Hintikka’s system or the K-dimensional system, the probability P (C w /e) of the constituent C w should reflect the distances of the cells CT w claimed to be non-empty by C w and the cells CTe already exemplified in evidence e. Proposals to this effect were given in [Niiniluoto, 1980; 1981]: the probability P (Qi (an+1 )/e) can be modified by multiplying it with a factor which expresses the minimum distance of Qi from the cells CTe or the weighted influence of the observed individuals in the cells CTe . It turned out that these specific proposals did not always satisfy the principle of Positive Instantial Relevance (4) and Reichenbach’s Axiom (3). An alternative, proposed by Kuipers [1984; 1988], is to allow that the analogy influence gradually vanishes when the evidence increases. Kuipers replaces Carnap’s representative function (2) by (61)
ni + αi (en ) + λ/K n + α(n) + λ
where αi (en ) is the analogy profit of cell Qi from sample en and α(n) =
K i=1
αi (en )
is the analogy in the first n trials. To guarantee the validity of Reichenbach’s axiom, the marginal analogy of the nth trial α(n) − α(n − 1) is assumed to decrease from its positive initial value to zero when n grows without limit. To obtain a theory of inductive generalization with analogy influence, probabilities of the form P (Qi (an+1 )/en &C w ) in the K-dimensional system can be modified in a similar way by allowing initial and evidence-based analogy profits (see [Niiniluoto, 1988]).10 Distances between Q-predicates have been applied also in a proposal concerning the treatment of observational errors in inductive logic (see [Niiniluoto, 1997]). In Hintikka’s system, the likelihoods P (Qi (an+1 )/en &C w ) are defined by restricting the Q-predicates Qi to those allowed by the constituent C w (see (15)). Thereby the possibility of observational errors is excluded. The so called Jeffrey conditionalization handles uncertain evidence by directly modifying probabilities, without conditionalization relative to a possibly mistaken evidence statement. Instead, standard applications of Bayesian statistical inference include an error distribution, which allows the observed data to deviate from the true value of a parameter. To build this idea to inductive logic, let Qj (a) state that object a is Qj , and Sj (a) state that object a seems to be Qj . Then it is natural to assume that errors of observation depend on distances between predicates: (62) The probability pij = P (Si (a)/Qj (a)) is a decreasing function of the distance dij = d(Qi , Qj ). 10 Other attempts to handle inductive analogy include Skyrms [1993b], di Maio [1996], Festa [1996], Maher [2000], and Romeyn [2005].
The Development of the Hintikka Program
343
The probability of correct observation, or observational reliability, can be assumed to be a constant pii = P (Si (a)/Qi (a)) = β for all i = 1, ..., K. In standard inductive logic without observational errors, pij = 1 if i = j and 0 otherwise. By Bayes’s Theorem, probabilities of the form P (Qi (a)/Sj (a)) can be calculated. Further, probabilities of the form P (Si (an+1 )/en (S)&C w ) have to be defined, when en (S) is like an ordinary sample description but Qj s replaced by Sj s.
10
TRUTHLIKENESS
In his campaign against induction and inductive logic, Karl Popper defined in 1960 comparative and quantitative notions of truthlikeness or verisimilitude (see [Popper, 1963]). Instead of assessing the inductive probability that a hypothesis is true, truthlikeness is an objective concept expressing how “close” to the truth a scientific theory is. Maximum degree of truthlikeness is achieved by a theory which is completely and comprehensively true, so that verisimilitude combines the ideas of truth and information content while probability combines truth with lack of content. According to Popper, in some situations we may have strong arguments for claiming that we have made progress toward the truth, and he proposed his own measure of corroboration as an epistemic indicator of verisimilitude. Popper’s attempted definition failed, since it did not allow the comparison of two false theories. Since 1974, a new program for defining degrees of truthlikeness was initiated by Risto Hilpinen, Pavel Tichy, Ilkka Niiniluoto, Raimo Tuomela, and Graham Oddie. All of them employed the notion of similarity or likeness between possible worlds or between Hintikkian constituents describing such possible worlds. Niiniluoto [1977] further proposed that inductive probabilities can be applied in the estimation of degrees of verisimilitude. In the likeness approach (see [Niiniluoto, 1987]), truthlikeness is defined relative to a cognitive problem B = {hi |i ∈ I}, where the elements of B are mutually exclusive and jointly exhaustive (see Section 7). The unknown true element h* of B is the target of the problem B. The basic step is the introduction of a real-valued function Δ : B × B → R which expresses the distance Δ(hi , hj ) = Δij between the elements of B (i.e., complete answers). Here 0 ≤ Δij ≤ 1, and Δij = 0 iff i = j. This distance function Δ has to be specified for each cognitive problem B separately, but there are canonical ways of doing this for special types of problems. First, if B is the set of state descriptions, the set of structure descriptions, or the set of constituents of a first-order language L, the distance Δ can be defined by counting the differences in the standard syntactical form of the elements of B. For example, a monadic constituent tells that certain kinds of individuals (given by Q-predicates) exist and others do not exist; the simplest distance between monadic constituents is the relative number of their diverging claims about the Q-predicates. If a monadic constituent Ci is characterized by the class CTi of Q-predicates that are non-empty by Ci , then the Clifford distance between Ci and Cj is the size of the symmetric difference between CTi and CTj :
344
Ilkka Niiniluoto
(63) |CTi ΔCTj |/K. Secondly, Δ may be directly definable by a natural metric underlying the structure of B. For example, if the elements of B are point estimates of an unknown real-valued parameter, their distance can be given simply by the geometrical or Euclidean metric on R (or RK ). If the elements of B are quantitative laws, then their distance is given by the Minkowski metrics between functions. The next step is the extension of Δ to a function B × D(B) → R, so that Δ(hi , g) expresses the distance of a partial answer g ∈ D(B) from hi ∈ B. Let g ∈ D(B) be a potential answer with g=
hi ,
i∈Ig
where Ig ⊆ I. Define the minimum distance of g from hi by (64) Δmin (hi , g) = min Δij. j∈Ig
Then g is approximately true if Δmin (h∗ , g) is sufficiently small. Degrees of approximate truth can be defined by (65) AT (g, h∗ ) = 1 − Δmin (h∗ , g). Hence, AT (g, h∗ ) = 1 if and only if g is true. Instead, the notion of truthlikeness T r should have its maximum when g is identical with the complete truth h∗ : (66) T r(g, h∗ ) = 1 iff g ≡ h∗ . To introduce a concept which satisfies the condition (66), truthlikeness should include a factor which tells how effectively a statement is able to exclude falsities. This can be expressed by the relativized sum-measure which includes a penalty for each mistake that g allows, and weights this mistake by its distance from the target: Δij / Δij . (67) Δsum (hi , g) = j∈Ig
j∈I
At the same time, a truthlike statement should preserve truth as closely as possible. As a sufficient condition, one might suggest that a partial answer g is more truthlike than another partial answer g if g is closer to the target h∗ than g with respect to both the minimum distance and the sum distance, but only few answers in D(B) would be comparable by this criterion. Full comparability is achieved by the min-sum measure Δms , where the weights γ and γ indicate our cognitive desire of finding truth and avoiding error, respectively:
(68) Δγγ ms (hi , g) = γΔmin (hi , g) + γ Δsum (hi , g) (γ > 0, γ > 0).
The Development of the Hintikka Program
345
Then a partial answer g is truthlike if its min-sum distance from the target h∗ is sufficiently small. One partial answer g is more truthlike than another partial ∗ γγ ∗ answer g if Δγγ ms (h , g ) < Δms (h , g). The degree of truthlikeness T r(g, h∗ ) of g ∈ D(B) (relative to the target hj in B) is defined by
∗ (69) T r(g, h∗ ) = 1 − Δγγ ms (h , g).
This measure T r has many nice features. For a tautology t, we have T r(t, h∗ ) = 1 − γ . For a complete answer hi , we get the expected result that T r(hi , h∗ ) decreases with the distance Δ(hi , h∗ ). This gives a practical rule for the choice of the parameters: γ and γ should be chosen so that the complete answers closest to h∗ have a degree of truthlike larger than 1 − γ , while the most misleading ones should be worse than ignorance. (For example, we may choose γ /γ ≈ 1/2.) If the distance function Δ on B is trivial, i.e., Δij = 1 for all i = j, then T r(g, h∗ ) reduces to a special case of Levi’s [1967] definition of epistemic utility. As the target h∗ is unknown, the value of T r(g, h∗ ) cannot be directly calculated by our formulas (68) and (69). However, there is a method of making rational comparative judgments about verisimilitude, if we have - instead of certain knowledge about the truth — rational degrees of belief about the location of truth. Thus, to estimate the degree T r(g, h∗ ), where h∗ is unknown, assume that there is an epistemic probability measure P defined on B, so that P (hi /e) is the rational degree of belief in the truth of hi given evidence e. The expected degree of verisimilitude of g ∈ D(B) given evidence e is then defined by P (hi /e)T r(g, hi ). (70) ver(g/e) = i∈I
If B includes constituents of a monadic language and P is chosen to be Hintikka’s inductive probability measure, explicit formulas for calculating the value of ver(g/e) can be provided (see [Niiniluoto, 1987, Ch.9.5; Niiniluoto, 2005a]). If the probability distribution P on B given e is even, so that we are completely ignorant of the true answer to the cognitive problem, the values of ver(hj /e) are also equal to each other. On the other, the difference between expected verisimilitude and posterior probability is highlighted by the fact that ver(g/e) may be high even when e refutes g and thus P (g/e) = 0. This is also a crucial difference between ver and most probabilistic measures of confirmation and corroboration (e.g., (39) and (40)). Equation (70) gives us a comparative notion of estimated verisimilitude: g seems more truthlike than g on evidence e, if and only if ver(g/e) < ver(g /e). Function ver defines also an acceptance rule: given evidence e, accept as the most truthlike theory g which maximizes ver(g/e). This rule is comparable to the Bayesian decision theory, when the loss function is proportional to distances from the truth. Besides comparisons to point estimation, a theory of Bayesian interval estimation can be based upon the rule of accepting the most truthlike hypothesis (see [Niiniluoto, 1986; 1987; Festa, 1986]).
346
Ilkka Niiniluoto
The relation between the functions Tr and ver is analogous to the relation between truth value tv (1 for true, 0 for false) and probability P , i.e., (71) T r : ver = tv : P. This can be seen from the fact that the posterior probability P (g/e) equals the expected truth value of g on e: P (hi /e)tv(g, hi ) = P (hi /e) = P (g/e). i∈I
iıIg
(Cf. (48).) By (71), expected verisimilitude ver(g/e) is an estimate of real truthlikeness T r(g, h∗ ) in the same sense in which posterior epistemic probability is an estimate of truth value. The standard form of Bayesianism which evaluates hypotheses on the basis of their posterior probability can thus be understood as an attempt to maximize expected truth values, while ver replaces this goal by the maximization of expected truthlikeness. Definition (70) guarantees that ver(g/e) = T r(g, hj ) if evidence e entails one of the strong answers hj . In Hintikka’ s system of inductive logic, the result (13) guarantees that asymptotically it is precisely the boldest constituent compatible with the evidence that will have the largest degree of estimated verisimilitude: (72) ver(g/e) → T r(g, C c ), when n → ∞ and c is fixed. (73) ver(g/e) → 1 iff g ≡ C c , when n → ∞ and c is fixed.
11
MACHINE LEARNING
Besides expected verisimiltude ver, other ways of combining closeness to the truth and epistemic probability include the concepts of probable verisimilitude (i.e., the probability given e that g is truthlike at least to a given degree) and probable approximate truth (i.e., the probability given e that g is approximately true within a given degree). (See [Niiniluoto, 1987; 1989].) Let g in D(B) be a partial answer, and ε ∈ 0 a small real number. Define (74) Vε (g) = {hi in B|Δmin (hi , g) ≤ ε}. Denote by g ε the “blurred” version of g which contains as disjuncts all the members of the neighborhood Vε (g). Then g g ε , and g is approximately true (within degree ε) if and only if g ε is true. The probability that the minimum distance of g from the truth h∗ is not larger than ε, given evidence e, defines at the same time the posterior probability that the degree of approximate truth AT (g, h∗ ) of g is at least 1 − ε: P (hi /e). (75) P AT1−ε (g/e) = P (h∗ ∈ Vε (g)/e) = hi ∈Vε (g)
The Development of the Hintikka Program
347
PAT defined by (75) is thus a measure of probable approximate truth. Clearly we have always P (g/e) ≤ P AT1−ε (g/e). When ε decreases toward zero, in the limit we have P AT1 (g/e) = P (g/e). Further, P AT1−ε (g/e) > 0 if and only if P (g ε ) > 0. Unlike ver, PAT shares with P the property that logically weaker answers will have higher PAT-values than stronger ones. An important feature of both probable verisimilitude and probable approximate truth is that their values can be non-zero even for hypotheses with a zero probability on evidence: it is possible that P AT1−ε (g/e) > 0 even though P (g/e) = 0. The notion of probable approximate truth is essentially the same as the definition of PAT in the theory of machine learning (see [Niiniluoto, 2005b]). What the AI community calls “concepts” are directly comparable to monadic constituents, and thereby “concept learning” can be modelled by Hintikka-style theory of inductive generalization. While AI accounts of concept learning are purely eliminative, as they recognize positive and negative instances, the treatment by Hintikka’s system allows one to include enumerative and analogical considerations as well. The convergence properties of Hintikka’s inductive probabilities are also comparable to the work on formal learning theory. Recall that in Hintikka’s system the posterior probability P (C c /e) approaches one when c is fixed and n grows without limit. But this result (13) states only that our degrees of belief about C c converge to certainty on the basis of inductive evidence. It does not yet guarantee that C c is identical with the true constituent C ∗ . Similarly, by (73) we know that the expected verisimilitude ver(C c /e) converges to one when c is fixed and n grows without limit. Again this does not guarantee that the “real” truthlikeness T r(C c , h∗ ) of C c is maximal, i.e., that C c is true. For these stronger results an additional evidential success condition is needed: (ESC) Evidence e is true and fully informative about the variety of the world w. ESC means that e is exhaustive in the sense that it exhibits (relative to the expressive power of the given language L) all the kinds of individuals that exist in the world (see [Niiniluoto, 1987, p. 276]). With ESC we can reformulate our results so that they concern convergence to the truth: (13 )
If ESC holds, P (C ∗ /e) → 1, when c is fixed and n → ∞.
(72 )
If ESC holds, ver(g/e) → T r(g, C ∗ ), when n → ∞ and c is fixed.
(73 )
If ESC holds, ver(g/e)to1 iff g ≡ C ∗ , when n → ∞ and c is fixed.
Similar modifications can be made in the convergence results about probable approximate truth. Now we can see that the convergence results of formal learning theory are not stronger than those of probabilistic approaches, even though they demand success with respect to all data streams, since they presuppose something like the success condition ESC: decidability in the limit assumes that the data streams are “compelete in that they exhaust the relevant evidence” [Earman, 1992, p. 210]
348
Ilkka Niiniluoto
or “perfect” in that “all true data are presented and no false datum is presented” and all objects are eventually described [Kelly, 1996, p. 270]. Without ESC even “global underdetermination” cannot be avoided (ibid., p. 17), since we cannot be certain that even an infinite sample of swans refutes the false generalization ‘All swans are white’: it is logically possible that an infinite stream of white swans is picked out from a world containing white and black swans. A more formal way of expressing these conclusions is to note that the learner of Hintikka’s system uses epistemic probabilities in her prior distribution P (C w ) and likelihoods P (e/C w ). The result (13) as such needs no assumption that behind these likelihoods there are some objective conditions concerning the sampling method.11 The same observation can be made about the famous results of de Finetti and L.J. Savage about the convergence of opinions in the long run, when the learning agents start from different non-dogmatic priors ([Howson and Urbach, 1989; Earman, 1992]; cf. [Niiniluoto, 1984, p. 102]). But it is possible to combine a system of inductive logic with the assumption that the evidence arises from a fair sampling procedure which gives each kind of individual an objective non-zero chance of appearing the evidence e [Kuipers, 1977b], where such chance is defined by a physical probability or propensity. As such propensities do not satisfy the notorious Principle of Plenitude, claiming that all possibilities will sometimes be realized, they do not exclude infinite sequences which violate ESC (see [Niiniluoto, 1988b]). But such sequences are extremely improbable by the convergence theorems of probability calculus (cf. [Festa, 1993, p. 76]). Suppose that we draw with replacement a fair sample of objects from an urn w. Let r be the proportion of objects of kind A in w. Then the objective probability of picking out an A is also r. The Strong Law of Large Numbers states now that the observed relative frequency k/n converges with probability one to the unknown value of r. Such “almost sure” convergence is weaker than convergence in the ordinary sense. The reason for using this notion of convergence is that there are no logical reasons for excluding such non-typical sequences of observations that violate ESC — even though their measure among all possible sequences is zero. Formal learning theory and probabilistic theories of induction, as plausible attempts to describe scientific inquiry, are in the same boat with respect to the crucial success conditions: ESC is precisely the reason why inductive inference is always non-demonstrative or fallible even in the ideal limit, since there are no logical reasons for excluding the possibility that ESC might be incorrect. The conclusion to be drawn from these considerations can be stated as follows: the best results for a fallibilist “convergent realist” do not claim decidability in the limit or even gradual decidability, but rather convergence to the truth with probability one. 11 So we need not follow Kuipers [1977b] in the claim that the appropriate applications of the K-dimensional system are restricted to “multinomial contexts” which have underlying objective probability distributions associated with repeatable experiments (see [Niiniluoto, 1983]). But of course this condition may be a contextual presupposition of inductive logic.
The Development of the Hintikka Program
12
349
EVALUATION OF THE HINTIKKA PROGRAM
Inductive probability measures, or their systems, can be viewed as a sequence of successive “theories” about rational inferential practices. There is a remarkable continuity among the proposals of inductive logicians: the K-dimensional system is a generalization of Hintikka’s generalized combined system, which is a generalization of and modification of Carnap’s c∗ , which is a generalization of Laplace’s rule of succession. Moreover, Hacking suggest that “a Leibnizian ought to like Carnap’s preferred c∗ ” [Hacking, 1975, p. 141]. The successor “theories” in this sequence contain their predecessors as special or limiting cases. The sequence is also progressive by the Lakatosian standards: each member solves more problems than its predecessor. In particular, Hintikka’s system is able to handle inductive generalization, while the K-dimensional system shows why the systems from Laplace to Carnap failed in this task. Hintikka’s system can be modified so that it accounts also for analogical inference and observational errors, and it can be applied to more complex methodological situations than those listed by Carnap — in particular, to cases involving theoretical premises and conclusions. Hintikka’s philosophical conclusion from his two-dimensional continuum was that inductive probabilities are always relative to extra-logical parameters. When α is small, the posterior probability of universal generalizations grows rapidly. In this sense, the choice of a small α is an index of boldness of the investigator, or the choice of a large α is an index of caution. Alternatively, the choice of α can be regarded as a regularity assumption about the lawlikeness of the relevant universe U .12 According to Hintikka, this abolishes the hopes for purely logical probabilities in the sense of Carnap’s core assumptions C3, C5, and C6 (see [Hintikka, 1987b; 1997, p. 317]). After the introduction of the λ-continuum, Carnap’s position was in fact quite similar. He argued that, for certain kinds of processes with objective probabilities, there is an optimal value of λ [Carnap, 1952]; methods for estimating such an optimal value in different contexts are elaborated by Festa [1995]. In his later work, Carnap [1980] related the choice of λ to “objectivist” matters of attribute distance, but also admitted that the choice of λ may be a symptom of the investigator’s personality. He concluded that λ should be included in the interval between 12 and K, and finally gave the recommendation to take λ = 1. The contextual or extra-logical character of inductive probabilities brings inductive logic close to the mainstream of Bayesianism. From this perspective, inductive logic is a form of Bayesianism which investigates epistemic probabilities in relation to several kinds of structural assumptions — such as the expressive power of languages, the choice of inductive parameters, order and disorder in the universe, and similarity influence. Using terms from Bayesian decision theory, such “structural axioms” do no belong to the “pure theory of rationality” [Suppes, 1969, p. 12 This regularity is relative to the expressive power of universal generalizations. Similarly, Walk [1966] argues that Carnap’s λ is directly proportional to the statistical entropy of the universe.
350
Ilkka Niiniluoto
95], but rather limit those situations where the theory is applicable [Niiniluoto, 1977]. In any case, inductive logic finds more structure (depending on linguistic frameworks and contextual assumptions) to the probabilities than the personalist Bayesians. Instead of trying to find one and only one system of inductive reasoning, or to fix a unique prior probability distribution for each problem as some “objective Bayesians” suggest, the study of probability measures in relation to different kinds of structural assumptions leads to a rich framework with potential applications to many types of methodological situations (cf. [Niiniluoto, 1977; 1983, 1988a]). In comparison with standard Bayesian statistics, the Hintikka program has a special character (cf. [Niiniluoto, 1983]). A statistician might suggest that constituents could be replaced by statistical hypotheses: for example, the claim that all individuals belong to cell Q1 could be expressed by the claim that the proportion of individuals in Q1 belongs to a short interval (q, 1]. The latter hypothesis can be studied by the techniques of Bayesian statistics, i.e., by introducing a prior probability distribution over the parameter space and applying Bayes’s Theorem relative to observed data, so that no separate inductive logic is needed. However, there is an important difference between the two cases: while the observation of one counterexample falsifies the constituent, it does not refute the statistical counterpart of a constituent. Further, the Hintikkian prior distributions which assign non-zero probabilities to constituents give positive weight to subsets of the parameter space with the geometrical measure zero. To illustrate, in a classification system with two Q-predicates, the parameter space consists of pairs p1 , p2 , where pi is the proportion of individuals in Qi , and p1+ p2 = 1. There are three constituents in this case, and two of them correspond to singular sets of measure zero: C1 = {1, 0} C2 = {0, 1} C3 = {p, 1 − p|0 < p < 1}. Laplace’s prior, i.e., a uniform distribution over the whole parameter space, gives probability one to the atomistic constituent C3 . This argument illuminates how Hintikka’s improvement of Carnap’s approach constitutes a break away from the standard Bayesian statistics. Hintikka [1997] further points out that inductive logic with parameters is not “purely Bayesian”, since changes of probabilities due to new choices of the values of λ and α cannot be described by conditionalization. In this sense, inductive logic can be a tool for studying language change (cf. [Niiniluoto and Tuomela, 1973]). Hintikka [1997] also suggests the move of introducing all extra-logical assumptions (such as the choice of parameters) as explicit premises, so that inductive logic would become a part of deductive logic. The same point is repeated in Hintikka’s “Replies” in [Auxier and Hahn, 2006]. It is interesting that, in his later work, Hintikka has turned out to be a staunch critic of traditional treatments of induction. He has developed an “interrogative model of inquiry”, based on a dialogue between two players, “the inquirer”
The Development of the Hintikka Program
351
and “Nature” (see [Hintikka, 1981; 1987a; 1988; 1992]). Within this model, induction plays only a very modest role in the micro-level of experimental inquiry, and Hume’s problem seems to disappear from the “serious theory of the scientific method”. Hintikka argues that in typical controlled experiments Nature provides us answers that are already general or have at least AE-complexity. Thus, we should reject the “super-dogma” of the “Atomistic Postulate” which restricts Nature’s answers to negated or unnegated atomistic propositions. In my view, Hintikka’s study of inquiry on different levels of complexity is highly fruitful. However, one may doubt whether the step of inductive generalization can really dispensed in all important contexts by explicitly formulated strong regularity assumptions.13 Moreover, Hintikka’s own system of inductive logic is not tied to the Atomistic Postulate in the same way as Carnap’s, since one can study probabilities of the form P (g/e&T ) and P (g/T ) where T is a background theory (see [Niiniluoto, 1997]). Carnap’s original motivation for developing a system of inductive logic came from his logical empiricism. Many philosophers of science feel that this motivation is largely outdated, as it is too much oriented to naive and old-fashioned empiricism. However, the picture looks different when we take notice of Hintikka’s ability to deal with genuine inductive generalizations and scientific theories. To use the slogan of Niiniluoto and Tuomela [1973], inductive logic can be “noninductivist” — free of simple assumptions of the role of induction in scientific inference. With its applications to theoretical systematization, semantical information, explanation, abduction, and truthlikess, inductive logic can be a tool for a critical scientific realist. Lakatos made fun of inductive logicians who, he argued, desperately attempt to calculate exact numerical degrees of confirmation for specific scientific theories. But the introduction of quantitative degrees of probability (or systematic power, truthlikeness, etc.) with absolute values can be viewed primarily as a tool of making comparative judgements between rival theories. Moreover, inductive logic need not be understood to provide such manuals of calculation. Rather, it is a study of the general principles of probabilistic and uncertain reasoning (cf. [Niiniluoto, 1980, p. 226]). In this respect, even though Hintikka’s system has not yet been fully extended to first-order languages, the generalization of distributive normal forms and constituents indicates that the most important general features of induction in monadic cases extend to the richer languages as well. On the other hand, at least for the manageable cases of monadic languages, it is certainly possible to implement the formulas of Hintikka’s system as computer programs. The development of computer science, robotics, and artificial intelligence has opened the further perspective that inductive logic could be useful as a 13 In his “Replies” [Auxier and Hahn, 2006, p. 778], Hintikka states: “If these premises are spelled out explicitly, we no longer need any independent rules of inductive inference. Ordinary deductive logic with probability theory does the whole job.” Here Hintikka seems to ignore the distinction between factual and probabilistic rules of detachment (see note 8 above).
352
Ilkka Niiniluoto
framework of machine learning.14 In their attempt to model uncertain inference by computer programs, the AI community is reinventing many logical approaches proposed earlier by philosophers. In addition to its applications in the philosophy of science, it may this turn out that the future paradigm of systems of induction can be found in the field of machine learning. BIBLIOGRAPHY [Auxier and Hahn, 2006] R. E. Auxier and L. E. Hahn, eds. The Philosophy of Jaakko Hintikka. Chicago and La Salle: Open Court, 2006. [Batens, 1975] D. Batens. Studies in the Logic of Induction and in the Logic of Explanation. Brugge: De Temple, 1975. [Bogdan, 1976] R. J. Bogdan, ed. Local Induction. Dordrecht: Reidel, 1976. [Bogdan, 1987] R. J. Bogdan, ed. Jaakko Hintikka. Dordrecht: Reidel, 1987. [Carnap, 1945] R. Carnap. On Inductive Logic. Philosophy of Science 12, 72-97, 1945. [Carnap, 1950] R. Carnap. The Logical Foundations of Probability. Chicago: University of Chicago Press, 1950. (2nd ed. 1962.) [Carnap, 1952] R. Carnap. The Continuum of Inductive Methods. Chicago: The University of Chicago Press, 1952. [Carnap, 1968] R. Carnap. The Concept of Constituent-Structure. In Lakatos (1968b), pp. 218220, 1968. [Carnap, 1971] R. Carnap. Inductive Logic and Rational Decisions. In [Carnap and Jeffrey, 1971, pp. 5-32]. [Carnap, 1980] R. Carnap. A Basic System of Inductive Logic Part II. In [Jeffrey, 1980, pp. 7-155]. [Carnap and Bar-Hillel, 1952] R. Carnap and Y. Bar-Hillel. An Outline of the Theory of Semantic Information. In Y. Bar-Hillel, ed., Language and Information, pp. 221–274. Reading, Mass.: Addison-Wesley, 1952. [Carnap and Jeffery, 1971] R. Carnap and R. C. Jeffrey, eds. Studies in Inductive Logic and Probability vol. I. Berkeley and Los Angeles: University of California Press, 1971. [Carnap and Stegm¨ uller, 1959] R. Carnap and W. Stegm¨ uller. Induktive Logik und Wahrscheinlichkeit. Springer: Wien, 1959. [Cohen and Hesse, 1980] L. J. Cohen and M. Hesse, eds. Applications of Inductive Logic. Oxford: Oxford University Press, 1980. [Earman, 1992] J. Earman. Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. Cambridge, Mass.: The MIT Press 1992. [Fenstad, 1968] J. E. Fenstad. The Structure of Logical Probabilities. Synthese 18, 1-27, 1968. [Festa, 1986] R. Festa. A Measure for the Distance Between an Interval Hypothesis and the Truth. Synthese 67, 273–320. 1986. [Festa, 1993] R. Festa. Optimum Inductive Methods. A Study in Inductive Probabilities, Bayesian Statistics, and Verisimilitude. Dordrecht: Kluwer, 1993. [Festa, 1995] R. Festa. Verisimilitude, Disorder, and Optimum Prior Probabilities. In T. A. F. Kuipers and A. R. Mackor, eds., Cognitive Patterns in Science and Common Sense. Amsterdam: Rodopi, pp. 299–320, 1995. [Festa, 1996] R. Festa. Analogy and Exchangeability in Predictive Inferences. Erkenntnis 45, 229–252, 1996. [Festa, 1999] R. Festa. Bayesian Confirmation. In M. C. Galavotti and A. Pagnini, eds., Experience, Reality, and Scientific Explanation. Dordrecht: Kluwer, pp. 55–87, 1999. [Festa, 2003] R. Festa. Induction, Probability, and Bayesian Epistemology. In L. Haaparanta and I. Niiniluoto, eds., Analytic Philosophy in Finland. Amsterdam: Rodopi, pp. 251-284, 2003. [Gavroglu et al., 1989] K. Gavroglu, Y. Goudaroulis, and P. Nicolacopulos, eds. Imre Lakatos and Theories of Scientific Change. Dordrecht: Kluwer, 1989. [Hacking, 1971] I. Hacking. The Leibniz-Carnap Program for Inductive Logic. The Journal of Philosophy 68, 597-610, 1971. 14 It
is remarkable that already Carnap [1971] spoke about “inductive robots”.
The Development of the Hintikka Program
353
[Hacking, 1975] I. Hacking. The Emergence of Probability. Cambridge: Cambridge University Press, 1975. [Helman, 1988] D. H. Helman, ed. Analogical Reasoning. Dordrecht: Kluwer, 1988. [Hempel, 1965] C. G. Hempel. Aspects of Scientific Explanation and Other Essays in the Philosophy of Science. New York: The Free Press, 1965. [Hesse, 1964] M. Hesse. Analogy and Confirmation Theory. Philosophy of Science 31: 319-324, 1964. [Hesse, 1974] M. Hesse. The Structure of Scientific Inference. London: Macmillan, 1974. [Hilpinen, 1966] R. Hilpinen. On Inductive Generali1zation in Monadic First-Order Logic with Identity. In Hintikka and Suppes (1966), pp. 133–154, 1966. [Hilpinen, 1968] R. Hilpinen. Rules of Acceptance and Inductive Logic, Acta Philosophica Fennica 21. Amsterdam: North-Holland 1968. [Hilpinen, 1970] R. Hilpinen. On the Information Provided by Observations. In [Hintikka and Suppes, 1970, pp. 97–122]. [Hilpinen, 1971] R. Hilpinen. Relational Hypotheses and Inductive Inference. Synthese 23, 266– 286, 1971. [Hilpinen, 1972] R. Hilpinen. Decision-Theoretic Approaches to Rules of Acceptance. In R. E. Olson and A. M. Paul, eds., Contemporary Philosophy in Scandinavia. Baltimore: The John Hopkins Press, pp. 147–168, 1972. [Hilpinen, 1973] R. Hilpinen. Carnap’s New System of Inductive Logic. Synthese 25, 307–333, 1973. [Hilpinen, 1989] R. Hilpinen. Von Wright on Confirmation Theory. In [Schilpp and Hahn, 1989, pp. 127–146]. [Hintikka, 1965a] J. Hintikka. Towards a Theory of Inductive Generalization. In Y. Bar-Hillel, ed., Proceedings of the 1964 International Congress for Logic, Methodology, and Philosophy of Science. Amsterdam: North-Holland, pp. 274–288, 1965. [Hintikka, 1965b] J. Hintikka. On a Combined System of Inductive Logic. Studia LogicoMathematica et Philosophica in Honorem Rolf Nevanlinna, Acta Philosophica Fennica 18. Helsinki: Societas Philosophica Fennica, pp. 21–30, 1965. [Hintikka, 1966] J. Hintikka. A Two-dimensional Continuum of Inductive Methods. In [Hintikka and Suppes, 1966, pp. 113–132]. [Hintikka, 1968a] J. Hintikka. The Varieties of Information and Scientific Explanation. In B. van Rootselaar and J. F. Staal (Eds.), Logic, Methodology and Philosophy of Science III, Proceedings of the 1967 International Congress: Amsterdam: North-Holland, pp. 151–171, 1968. [Hintikka, 1968b] J. Hintikka. Induction by Enumeration and Induction by Elimination [with Reply]. In [Lakatos, 1968b, pp. 191–216 and 223–231. 1968. [Hintikka, 1968c] J. Hintikka. The possibility of rules of acceptance (Comments on Kyburg). In [Lakatos, 1968b, pp. 144–146]. [Hintikka, 1968d] J. Hintikka. Conditionalization and Information (Comments on Carnap). In [Lakatos, 1968b, pp. 303–306]. [Hintikka, 1969a] J. Hintikka. Inductive Independence and the Paradoxes of Confirmation. In N. Rescher (Ed.), Essays in Honor of Carl G. Hempel. Dordrecht: Reidel, pp. 24–46, 1969. [Hintikka, 1969b] J. Hintikka. Statistics, Induction, and Lawlikeness: Comments on Dr. Vetter’s Paper. Synthese 20, 72–85, 1969. [Hintikka, 1970] J. Hintikka. On Semantic Information. In [Hintikka and Suppes, 1970, pp. 3– 27]. [Hintikka, 1971] J. Hintikka. Unknown Probability, Bayesianism, and de Finetti’s Representation Theorem. In R. C. Buck and R. S. Cohen (Eds.), In Memory of Rudolf Carnap, Boston Studies in the Philosophy of Science VIII. Dordrecht: Reidel, pp. 325–341, 1971. [Hintikka, 1975] J. Hintikka. Carnap and Essler versus Inductive Generalization. Erkenntnis 9, 235–244, 1975. [Hintikka, 1981] J. Hintikka. On the Logic of an Interrogative Model of Scientific Inquiry. Synthese 47, 60–84, 1981. [Hintikka, 1987a] J. Hintikka. The Interrogative Approach to Inquiry and Probabilistic Inference. Erkenntnis 26, 429–442, 1987a. [Hintikka, 1987b] J. Hintikka. Replies and Comments. Mondadori and my Work on Induction and Probability. In Bogdan (1987), pp. 302–307, 1987.
354
Ilkka Niiniluoto
[Hintikka, 1988] J. Hintikka. What is the Logic of Experimental Inquiry? Synthese 74, 173–190, 1988. [Hintikka, 1992] J. Hintikka. The Concept of Induction in the Light of the Interrogative Approach to Inquiry. In J. Earman (Ed.), Inference, Explanation, and Other Frustrations. Berkeley: University of California Press, pp. 23–43, 1992. [Hintikka, 1997] J. Hintikka. Replies. In [Sintonen, 1997, pp. 309–338]. [Hintikka, 1999] J. Hintikka. Inquiry as Inquiry: A Logic of Scientific Discovery. Dordrecht: Kluwer, 1999. [Hintikka, 2006] J. Hintikka. Intellectual Autobiography. In [Auxier and Hahn, 2006, pp. 1-84]. [Hintikka and Hilpinen, 1966] J. Hintikka and R. Hilpinen. Knowledge, Acceptance and Inductive Logic. In [Hintikka and Suppes, 1966, pp. 1–20]. [Hintikka and Hilpinen, 1971] J. Hintikka and R. Hilpinen. Rules of Acceptance, Indices of Lawlikeness, and Singular Inductive Inferences: Reply to a Critical Discussion. Philosophy of Science 38, 303–307, 1971. [Hintikka and Niiniluoto, 1976] J. Hintikka and I. Niiniluoto. An Axiomatic Foundation of Inductive Generalization. In Przelecki, Szaniawski, and W´ ojcicki (1976), pp. 57–81, 1976. Reprinted with Postscript in [Jeffrey, 1980, pp. 157-181]. [Hintikka and Pietarinen, 1966] J. Hintikka and J. Pietarinen. Semantic Information and Inductive Logic. In [Hintikka and Suppes, 1966, pp. 96–112]. [Hintikka and Suppes, 1966] J. Hintikka and P. Suppes, eds. Aspects of Inductive Logic. Dordrecht: Reidel, 1966. [Hintikka and Suppes, 1970] J. Hintikka and P. Suppes, eds. Information and Inference. Dordrecht: Reidel, 1970. [Howson and Urbach, 1989] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. La Salle, Ill.: Open Court, 1989. [Jeffrey, 1980] R. C. Jeffrey, ed. Studies in Inductive Logic and Probability vol. II. Berkeley and Los Angeles: University of California Press, 1980. [Johnson, 1932] W. E. Johnson. Probability: The Deductive and Inductive Problems. Mind 41, 403-423, 1932. [Kaila, 1926] E. Kaila. Die Prinzipien der Wahrscheinlichkeitslogik. Turku: Annales Universitatis Fennicae Aboensis, 1926. [Kawalec, 2003] P. Kawalec. Structural Realiabilism: Inductive Logic as a Theory of Justification. Dordrecht: Kluwer, 2003. [Kelly, 1996] K. Kelly. The Logic of Reliable Inquiry. New York: Oxford University Press, 1996. [Kemeny, 1963] J. Kemeny. Carnap’s Theory of Probability and Induction. In [Schilpp, 1963, pp. 711–738]. [Keynes, 1921] J. M. Keynes. A Treatise on Probability. London: Macmillan, 1921. [Kneale and Kneale, 1962] W. Kneale and M. Kneale. The Development of Logic. Oxford: Oxford University Press, 1962. [Kuipers, 1978a] T. Kuipers. On the Generalization of the Continuum of Inductive Methods to Universal Hypotheses. Synthese 37, 255-284, 1978. [Kuipers, 1978b] T. Kuipers. Studies in Inductive Probability and Rational Expectation. Dordrecht: Reidel, 1978. [Kuipers, 1984] T. Kuipers. Two Types of Inductive Analogy by Similarity. Erkenntnis 21, 6387, 1984. [Kuipers, 1988] T. Kuipers. Inductive Analogy by Similarity and Proximity. In [Helman, 1988, pp. 299–313]. [Kuipers, 1997] T. Kuipers. The Carnap-Hintikka Programme in Inductive Logic. In [Sintonen, 1997, pp. 87-99]. [Kuipers, 2006] T. Kuipers. Inductive Aspects of Confirmation, Information, and Content. In [Auxier and Hahn, 2006, pp. 855-883]. [Kyburg, 1970] H. E. Kyburg, Jr. Probability and Inductive Logic. London: The Macmillan Company, 1970. [Lakatos, 1968a] I. Lakatos. Changes in the Problem of Inductive Logic. In [Lakatos, 1968b, pp. 315-417]. [Lakatos, 1968b] I. Lakatos, ed. The Problem of Inductive Logic. Amsterdam: North-Holland, 1968. [Lakatos, 1974] I. Lakatos. Popper on Demarcation and Induction. In P. A. Schilpp, ed., The Philosophy of Karl Popper, La Salle: Open Court, pp. 241-273, 1974. [Levi, 1967] I. Levi. Gambling with Truth. New York: Alfred A. Knopf, 1967.
The Development of the Hintikka Program
355
[Maher, 2000] P. Maher. Probabilities for Two Properties. Erkenntnis 52, 63-91, 2000. [di Maio, 1996] M. C. di Maio. Predictive Probability and Analogy by Similarity in Inductive Logic. Erkenntnis 43, 369-394, 1996. [Miller, 1994] D. Miller. Critical Rationalism: A Restatement and Defence. Chicago: Open Court, 1994. [Mondadori, 1987] M. Mondadori. Hintikka’s Inductive Logic. In [Bogdan, 1987, pp. 157-180]. [Niiniluoto, 1972] I. Niiniluoto. Inductive Systematization: Definition and a Critical Survey. Synthese 25, 25-81, 1972. [Niiniluoto, 1973] I. Niiniluoto. Review of A. C. Michalos, The Popper–Carnap Controversy. Synthese 25, 417–436, 1973. [Niiniluoto, 1976] I. Niiniluoto. Inductive Logic and Theoretical Concepts. In [Przelecki et al., 1976, pp. 93–112]. [Niiniluoto, 1977] I. Niiniluoto. On a K-Dimensional System of Inductive Logic. In F. Suppe and P. D. Asquith, eds., PSA 1976, Vol. 2. East Lansing: Philosophy of Science Association, 425–477, 1977. [Niiniluoto, 1980] I. Niiniluoto. Analogy, Transitivity, and the Confirmation of Theories. In [Cohen and Hesse, 1980, pp. 218–234]. [Niiniluoto, 1981] I. Niiniluoto. Analogy and Inductive Logic. Erkenntnis 16, 1–34, 1981. [Niiniluoto, 1983] I. Niiniluoto. Inductive Logic as a Methodological Research Programme. Scientia: Logic in the 20th Century. Milano, 77–100, 1983. [Niiniluoto, 1984] I. Niiniluoto. Is Science Progressive? Dordrecht: D. Reidel, 1984. [Niiniluoto, 1985/1986] I. Niiniluoto. Eino Kaila und der Wiener Kreis. In G. Gimpl, ed., WederNoch. Tangenten zu den finnish-¨ osterreichischen Kulturbeziehungen. Jahrbuch f¨ ur deutschfinnische Literaturbeziehungen 19/20, pp. 223-241, 1985/1986. [Niiniluoto, 1987] I. Niiniluoto. Truthlikeness. Dordrecht: Reidel, 1987. [Niiniluoto, 1988a] I. Niiniluoto. Analogy by Similarity in Scientific Reasoning. In [Helman, 1988, pp. 271–298]. [Niiniluoto, 1988b] I. Niiniluoto. Probability, Possibility, and Plenitude. In J. H. Fetzer, ed., Probability and Causality. Dordrecht: Reidel, pp. 91–108, 1988. [Niiniluoto, 1994] I. Niiniluoto. Descriptive and Inductive Simplicity. In W. Salmon and G. Wolters, eds., Logic, Language, and the Structure of Scientific Theories: Proceedings of the Carnap–Reichenbach Centennial. Pittsburgh: University of Pittsburgh Press, pp. 147–170, 1994. [Niiniluoto, 1994/1995] I. Niiniluoto. Hintikka and Whewell on Aristotelian Induction. Grazer Philosophische Studien 49, 49–61, 1994/1995. [Niiniluoto, 1995] I. Niiniluoto. Is There Progress in Science?. In H. Stachowiak, ed., Pragmatik: Pragmatische Tendenzen in der Wissenschaftstheorie. Hamburg: Felix Meiner, pp. 30-58, 1995. [Niiniluoto, 1997] I. Niiniluoto. Inductive Logic, Atomism, and Observational Error. In [Sintonen, 1997, pp. 117–131]. [Niinuloto, 1998] I. Niiniluoto. Induction and Probability in the Lvow-Warsaw School. In K. Kijania-Placek and J. Wole˜ nski, eds., The Lvov-Warsaw School and Contemporary Philosophy. Dordrecht: Kluwer, pp. 323–335, 1998. [Niiniluoto, 1999] I. Niiniluoto. Defending Abduction. Philosophy of Science (Proceedings) 66: S436–S451, 1999. [Niiniluoto, 2005a] I. Niiniluoto. Abduction and Truthlikeness. In R. Festa, A. Aliseda, and J. Peijnenburg (Eds.), Confirmation, Empirical Progress, and Truth Approximation. Amsterdam: Rodopi, pp. 255-275, 2005. [Niiniluoto, 2005b] I. Niiniluoto. Inductive Logic, Verisimilitude, and Machine Learning. In P. H´ ajek, L. Vald´ es-Villanueva, and D. Westerstal, eds., Logic, Methodology, and Philosophy of Science: Proceedings of the Twelfth International Congress. London: King’s College Publications, pp. 295-314, 2005. [Niiniluoto, 2005c] I. Niiniluoto. G. H. von Wright on Probability and Induction. In I. Niiniluoto and R. Vilkko (Eds.), Philosophical Essays in Memoriam Georg Henrik von Wright. Acta Philosophica Fennica 77, Helsinki: The Philosophical Society of Finland, pp. 11-32, 2005. [Niiniluoto and Tuomela, 1973] I. Niiniluoto and R. Tuomela. Theoretical Concepts and Hypothetico-Inductive Inference. Dordrecht: Reidel, 1973. [Nix and Paris, 2007] C. J. Nix and J. B. Paris. A Note on Binary Inductive Logic. Journal of Philosophical Logic 36, 735-771, 2007.
356
Ilkka Niiniluoto
[Pietarinen, 1970] J. Pietarinen. Quantitative Tools for Evaluating Scientific Systematizations. In [Hintikka and Suppes, 1970, pp. 123–147]. [Pietarinen, 1972] J. Pietarinen. Lawlikeness, Analogy and Inductive Logic. Acta Philosophica Fennica 26. Amsterdam: North-Holland, 1972. [Pietarinen, 1974] J. Pietarinen. Inductive Immodesty and Lawlikeness. Philosophy of Science 41, 196–198, 1974. [Popper, 1959] K. R. Popper. The Logic of Scientific Discovery. London: Hutchinson, 1959. [Popper, 1963] K. R. Popper. Conjectures and Refutations. London: Hutchinson, 1963. [Przelecki et al., 1976] M. Przelecki, K. Szaniawski, and R. W´ ojcicki, eds. Formal Methods in the Methodology of Empirical Sciences. Dordrecht: Reidel, 1976. [Romeyn, 2005] J.-W. Romeyn. Bayesian Inductive Logic. Groningen: University of Groningen, 2005. [Schilpp, 1963] P. A. Schilpp, ed. The Philosophy of Rudolf Carnap. Lassalle, Ill.: Open Court, 1963. [Schilpp and Hahn, 1989] P. A. Schilpp and L. E. Hahn, eds. The Philosophy of Georg Henrik von Wright. La Salle, Illinois: Open Court, 1989. [Scott and Krauss, 1966] D. Scott and P. Krauss. Assigning Probabilities to Logical Formulas. In [Hintikka and Suppes, 1966, pp. 219-264]. [Sintonen, 1997] M. Sintonen, ed. Knowledge and Inquiry. Essays on Jaakko Hintikka’s Epistemology and Philosophy of Science. Amsterdam: Rodopi, 1997. [Skyrms, 1986] B. Skyrms. Choice and Chance. An Introduction to Inductive Logic. Third Edition. Belmont, California: Wadsworth Publishing Company, 1986. [Skyrms, 1993a] B. Skyrms. Carnapian Inductive Logic for a Value Continuum. In H. Wetterstein, ed., The Philosophy of Science, Midwest Studies in Philosophy 18. South Bend: University of Notre Dame Press, pp. 78-89, 1993. [Skyrms, 1993b] B. Skyrms. Analogy by Similarity in HyperCarnapian Inductive Logic. In J. Earman, ed., Philosophical Problems of the Internal and External Worlds. Essays in the Philosophy of Adolf Gr¨ unbaum. Pittsburgh: University of Pittsburgh Press, pp. 273–282, 1993. [Stegm¨ uller, 1973] W. Stegm¨ uller. Carnap’s Normative Theory of Inductive Probability. In P. Suppes et al., eds., Logic, Methodology, and Philosophy of Science IV. Amsterdam: NorthHolland, pp. 501-513, 1973. [Suppes, 1966] P. Suppes. Probabilistic Inference and the Concept of Total Evidence. In [Hintikka and Suppes, 1966, pp. 49-65]. [Suppes, 1969] P. Suppes. Studies in the Methodology and Foundations of Science: Selected Papers from 1952 to 1969. Dordrecht: D. Reidel, 1969. [Tuomela, 1966] R. Tuomela. Inductive Generalization in an Ordered Universe. In [Hintikka and Suppes, 1966, pp. 155–174]. [Uchii, 1972] S. Uchii. Inductive Logic with Causal Modalities: A Probabilistic Approach. Philosophy of Science 39, 162-178, 1972. [Uchii, 1973] S. Uchii. Inductive Logic with Causal Modalities: A Deterministic Approach. Synthese 26, 264-303, 1973. [Uchii, 1977] S. Uchii. Induction and Causality in Cellular Space. In F. Suppe and P. D. Asquith, eds., PSA 1976, vol. 2, East Lansing: Philosophy of Science Association, pp. 448-461, 1977. [Walk, 1966] K. Walk. Simplicity, Entropy and Inductive Logic. In [Hintikka and Suppes, 1966, pp. 66-80]. [von Wright, 1951] G. H. von Wright. Carnap’s Theory of Probability. Philosophical Review 60, 362–374, 1951. [von Wright, 1951b] G. H. von Wright. A Treatise on Induction and Probability. London: Routledge and Kegan Paul, 1951. [von Wright, 1957] G. H. von Wright. The Logical Problem of Induction and Probability, 2nd revised edition. Oxford: Basil Blackwell, 1957. [First Edition: 1941. In Acta Philosophica Fennica 3, Helsinki.] [von Wright, 1971] G. H. von Wright. Explanation and Understanding. Ithaca: Cornell University Press, 1971. [Zabell, 1997] S. Zabell. Confirming Universal Generalizations. Erkenntnis 45: 267-283, 1997.
HANS REICHENBACH’S PROBABILITY LOGIC Frederick Eberhardt and Clark Glymour
1
INTRODUCTION
Any attempt to characterize Reichenbach’s approach to inductive reasoning must take into account some of the core influences that set his work apart from more traditional or standard approaches to inductive reasoning. In the case of Reichenbach, these influences are particularly important as otherwise Reichenbach’s views may be confused with others that are closely related but different in important ways. The particular influences on Reichenbach also shift the strengths and weaknesses of his views to areas different from the strengths and weaknesses of other approaches, and from the point of some other approaches Reichenbach’s views would seem quite unintelligible if not for the particular perspective he has. Reichenbach’s account of inductive reasoning is fiercely empirical. More than perhaps any other account it takes its lessons from the empirical sciences. In Reichenbach’s view, an inductive logic cannot be built up entirely from logical principles independent of experience, but must develop out of the reasoning practiced and useful to the natural sciences. This might already seem like turning the whole project of an inductive logic on its head: We want an inductive inference system built on some solid principles (whatever they may be) to guide our scientific methodology. How could an inference procedure that draws on the methodologies of science supply in any way a normative foundation for an epistemology in the sciences? For Reichenbach there are two reasons for this “inverse” approach. We will briefly sketch them here, but return with more detail later in the text: First, Reichenbach was deeply influenced by Werner Heisenberg’s results, including the uncertainty principle, that called into question whether there is a fact to the matter – and consequently whether there can be certain knowledge about – the truth of propositions specifying a particular location and velocity for an object in space and time. If there necessarily always remains residual uncertainty for such propositions (which prior to Heisenberg seemed completely innocuous or at worst subject to epistemic limitations), then – according to Reichenbach – this is reason for more general caution about the goals of induction. Maybe the conclusions any inductive logic can aim for when applied to the sciences are significantly limited. Requiring soundness of an inference – preservation of truth with certainty – may not only
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
358
Frederick Eberhardt and Clark Glymour
be unattainable, but impossible, if truth is not a viable concept for empirical propositions. Once uncertainty is built into the inference, deductive standards are inappropriate for inductive inference not only because the inference is ampliative (which is the standard view), but also because binary truth values no longer apply. Second, the evidence supporting Albert Einstein’s theory of relativity, and its impact on the understanding of the nature of space and time revealed to Reichenbach the power of empirical evidence to overthrow truths that were taken to be (even by Reichenbach himself in his early years) necessarily true. The fact that Euclidean space had been discovered not only to be not necessary, but quite possibly not true — despite Immanuel Kant’s transcendental proofs for its synthetic a priori status — called the state of a priori truths into question more generally. The foundations of any inference system could no longer be taken to be a priori, but had to be established independently as true of the world. Reichenbach refers to this confirmation of the correspondence between formal structures and the real world as “coordination” (although “alignment” might have been the more intuitive description of what he meant). Einstein’s and Heisenberg’s findings had their greatest impact on Reichenbach’s views on causality. Influenced by the Kantian tradition, Reichenbach took causal knowledge to be so profound that in his doctoral thesis in 1915 he regarded it as synthetic a priori knowledge [Reichenbach, 1915]. But with the collapse (in Reichenbach’s view) of a synthetic a priori view of space, due to Einstein, Reichenbach also abandoned the synthetic a priori foundation of causality. Consequently, Reichenbach believed that causal knowledge had to be established empirically, and so an inductive procedure was needed to give an account of how causal knowledge is acquired and taken for granted to such an extent that it is mistaken for a priori knowledge. But empirical knowledge, in Reichenbach’s view, is fraught with uncertainty (due to e.g. measurement error, illusions etc.), and so this uncertainty had to be taken into account in an inductive logic that formalizes inferences from singular (uncertain) empirical propositions to general (and reasonably certain) empirical claims. Heisenberg’s results implied further problems for any general account of causal knowledge: While the results indicated that the uncertainty found in the micro-processes of quantum physics is there to stay, macro-physics clearly uses stable causal relations. The question was how this gap could be bridged. It is therefore unsurprising that throughout Reichenbach’s life, causal knowledge formed the paradigm example for considerations with regard to inductive reasoning, and that probability was placed at its foundation. The crumbling support for such central notions as space, time and causality, also led Reichenbach to change his view on the foundations of deductive inference. Though he does not discuss the foundations of logic and mathematics in any detail, there are several points in Reichenbach’s work in which he indicates a switch away from an a prioristic view. The a prioristic view takes logic to represent necessary truths of the world, truths that are in some sense ontologic. Reichenbach rejects this view by saying that there is no truth “inherent in things”, that necessity is a result of syntactic rules in a language and that reality need not conform to the
Hans Reichenbach’s Probability Logic
359
syntactic rules of a language [Reichenbach, 1948]. Instead, Reichenbach endorsed a formalist view of logic in the tradition of David Hilbert. Inference systems should be represented axiomatically. Theorems of the inference system are the conclusions of valid deductions from the axioms. Whether the theorems are true of the world, depends on how well the axioms can be “coordinated” with the real world. This coordination is an empirical process. Thus, the underlying view holds that the axioms of deductive logic can only be regarded as true (of the world) and the inference principles truth preserving, if the coordination is successful – and that means in Reichenbach’s case, empirically successful, or useful. In the light of quantum theory, Reichenbach rejected classical logic altogether [Reichenbach, 1944]. Instead of an a priori foundation of inductive logic, Reichenbach’s approach to induction is axiomatic. His approach, exemplified schematically for the case of causal induction works something like this: We have causal knowledge. In many cases we do not doubt the existence of a causal relation. In order to give an account of such knowledge we must look at how this knowledge is acquired, and so we have to look closely at the methodologies used in the natural sciences. According to Reichenbach, unless we deny the significance of the inductive gap David Hume dug (in the hole created by Plato and Sextus Empiricus), the only way we will be able to make any progress towards an inductive logic is to look at those areas of empirical knowledge where we feel reasonably confident that we have made some progress in bridging that gap, and then try to make explicit (in form of axioms) the underlying assumptions and their justification (or stories) that we tell ourselves, why such assumptions are reliable. There are, of course, several other influences that left their marks on Reichenbach’s views. Perhaps, most importantly (in this second tier), are the positivists. Their influence is particularly tricky, since Reichenbach was closely associated with many members of the Vienna Circle, but his views are in many important ways distinctly “negativist”: Reichenbach denies that there can be any certainty even about primitive perception, but he does believe — contrary to Karl Popper — that once uncertainty is taken into account, we can make progress towards a positive probability for a scientific hypothesis. We return to the debate with Popper below. Second, it is probably fair to say that Richard von Mises, Reichenbach’s colleague during his time in Berlin and Istanbul, was the largest influence with regard to the concept of probability. Since probabilistic inferences play such a crucial role in scientific induction, Reichenbach attempted to develop a non-circular foundation and a precise account of the meaning and assertability conditions of probability claims. Reichenbach’s account of probability in terms of the limits of relative frequency, and his inductive rule, the so-called “straight rule”, for the assertability of probability claims — both to be discussed in detail below — are perhaps his best known and most controversial legacy with regard to inductive inferences. As with any attempt to describe a framework developed over a lifetime, we would inevitably run into some difficulty of piecing together what exactly Reichen-
360
Frederick Eberhardt and Clark Glymour
bach meant even if he had at all times written with crystal clarity and piercing precision — which he did not. On certain aspects, Reichenbach changed or revised his view, and it did not always become more intelligible. However, areas of change in Reichenbach’s account are also of particular interest, since they give us a glimpse into those aspects that Reichenbach presumably deemed the most difficult to pin down. They will give us an idea of which features he considered particularly important, and which ones still needed work. As someone, who in many senses sat between the thrones of high church philosophy of his time and (therefore?) anticipated many later and current ideas, Reichenbach’s views are of particular interest. 2 PROBABILITY LOGIC: THE BASIC SET-UP Reichenbach distinguishes deductive and mathematical logic from inductive logic: the former deals with the relations between tautologies, whereas the latter deals with truth in the sense of truth in reality. Deductive and mathematical logic are built on an axiomatic system. Whether the axioms are true of the world is open to question, and only of secondary interest in the deduction of mathematical theorems. Reichenbach admits that we appear unable to think other than by adhering to certain logical inferences, but that does not make deductive logic necessarily true of the world. We similarly appear quite unable to think of real space in terms of anything but Euclidean space, even though we know since Einstein (and the results of various crucial experiments) that real space is not Euclidean. In contrast to the formal relations that are of interest in deductive logic, inductive logic is concerned with the determination of whether various relations between quantities are true in the world; the aim is to represent, or, as Reichenbach says, “coordinate” the real world with mathematical relations. A scientific law states a mathematical relation about certain quantities in the world. The task of inductive logic is to establish whether the mathematical relations described by the law, correspond to the relations between the real features in the world represented by symbols in the mathematical law. While the semantics of deductive logic are formal, and can be adjusted to fit syntactic constraints, we do not have such definitional freedom of interpretation when describing the real world. Thus, inductive logic must not only satisfy a formal semantics but enable a mapping between the syntactic representation (the mathematical law) and its interpretation (the real world quantities). Reichenbach is a realist about the external world. But the access we are granted to the external world is indirect. He compares our experience about the external world to standing in a cloth cube, and drawing inferences about the objects outside the cube based on the shadows we see from the inside on the cube’s surface [Reichenbach, 1938a]. The information we obtain about the external world is not only indirect but also inexact. No empirical procedure supplies perfectly “clean” data. The data is “unclean” for two reasons: First, in measuring a particular parameter, there is always an infinity of other small influences that make the mea-
Hans Reichenbach’s Probability Logic
361
surement noisy. Second, if Heisenberg’s uncertainty principle is not only epistemic, but indicates a true metaphysical uncertainty (and Reichenbach appeared to take this view despite being close friends and an admirer of Einstein), then there is no exact measurement to be had in the first place. Consequently, all scientific laws are established by probabilistic inferences from empirical samples. If the epistemological support for scientific laws is only probabilistic, then an inductive logic cannot be two valued, but must, like probability, be continuously valued. So Reichenbach’s inductive logic is continuously valued between 0 and 1, where the “truth value” corresponds to the probabilistic support. Reichenbach terms his system “probability logic” and so shall we in what follows. To define probabilistic support, Reichenbach again turns to the methodology used in science: Scientific laws are universal claims based on a finite sequence of observations. The uncertainty in the truth of the scientific law derives (at least in part) from the fact that at any point in the sequence of observations we do not know whether future observations will follow the pattern we have seen so far. Consequently, the probability value describing the support we have for a scientific claim must be a property of the sequence of observations we see. For Reichenbach the probability of an event is the limit of the relative frequency of events of the same type — what that means becomes one of Reichenbach’s main problems — in an infinite sequence of events. (There are a few further constraints on the sequence, but we will leave them aside here.) This is what is referred to as Reichenbach’s limiting frequency interpretation of probability. But a logic, even a probability logic, is not about events, a logic is about inferences on propositions describing events. Consequently, Reichenbach must provide an account of how a probability that is defined as a limiting frequency in a sequence of events is represented in a probability logic. Reichenbach claims that the structure of events is mapped directly onto propositions in his probability logic that are “coordinated” with the events. That is, for each event in a sequence there is a proposition describing that event and the resulting sequence of propositions reflects isomorphically the structure of the sequences of events. Since the structures are isomorphic, probability values can be used interchangeably for the events and the propositions describing the events. The probability associated with a proposition in his probability logic is the limiting frequency of that proposition in a sequence of propositions describing a sequence of coordinated events in the external world. For example, if a ball is rolled down an inclined plane several times, and the time for its descent is measured on each trial, then there is a sequence of events on the inclined plane, each of which is associated with a proposition, e.g. proposition 1 might state: “The ball took 4.2 seconds (±δ).” The probability of proposition 1 is the limiting frequency (of occurrence) of this proposition in the sequence of propositions describing the trials. (Since measurements of continuous quantities can only be stated within intervals — due to measurement errors — the limiting frequency of a proposition is non-zero. But, admittedly, Reichenbach fudges the details on this point.) The approach is very intuitive given scientific practice. It is basically a formal representation of the construction of histograms.
362
Frederick Eberhardt and Clark Glymour
But the implication for a probability logic is significant: The probability logic must be a logic of sequences. So probability values are limits of the relative frequency of propositions in the sequences. We still need an account of how these values are to be estimated, especially since empirical event sequences are necessarily finite, thereby leaving the limit of the infinite sequence undetermined. And whatever the method of estimation, we require a justification for its application in place of any other procedure. Reichenbach’s procedure for estimating the probability of event types is simple: One should use the frequencies in the available finite initial segment as if they were from the limiting distribution. As more of the sequence becomes available, the empirical distribution, and with it the probabilities, should be adjusted accordingly. This inductive inference rule is now known as the straight rule. The justification of the straight rule has three parts: First, Reichenbach argues that we have recourse to higher order probabilities (supposedly based on more general abstract knowledge), that provide reason to believe in the approximate accuracy of the empirical distribution. Second, he claims that a hierarchy of higher order probabilities, in which no higher order claim is certain, need not lead to an infinite regress of probabilities (which we would be unable to determine). The regress can be truncated by blind posits – wagers — that can be substituted instead of probability values. Third, the straight rule estimate converges to the limiting frequency as accurately as one wants, and with a finite amount of data. Reichenbach recognizes that other inference rules have the same convergence guarantee, but claims that the straight rule is “simpler.” If one buys the claims, then one (supposedly) has a logic based on probabilistic inference. Unlike standard logical calculi it does not relate individual propositions, but sequences of propositions. It assigns to sequences of propositions a continuous value that is given by the limit of the relative frequencies in the sequences. For empirical claims, the value of this limit is estimated by the available scientific evidence and the straight rule, while the results of standard two-valued logic can be derived as a limiting case. Reichenbach considers this to be the best we can do in light of the limitations posed by Hume’s inductive gap. Certainty is no longer an achievable aim for induction, we can only speak of high probability. The continuous values of probability reflect a graded notion of truth (in the external world) of the conclusions reached by inductive inference. Reichenbach also sees his account as a vindication of the rationality of procedures of scientific inference. It provides a justification for an increased confidence in the truth of some scientific claims rather than others, as a function of sample size, or as a result of other similar findings in closely related fields. The calculus of probability, as part of this probability logic, provides the inference machinery to transfer probabilistic support between scientific claims. Any hope for a more sturdy bridge across the inductive gap is wishful thinking. Needless to say, not everyone took this approach to be as successful as Reichenbach considered it to be, and we will review some of the criticisms below. But before, we will flesh out the various aspects of this probability logic in more detail.
Hans Reichenbach’s Probability Logic
363
3 PROBABILITIES AS LIMITING FREQUENCIES For Reichenbach, probability, as it is used in science, is an objective quantity, not a subjective degree of belief. The main difficulty for such an account is to state precisely what such an objective probability is supposed to be, while providing justified grounds for making probability judgments. In fact, throughout his life Reichenbach worked on the foundations of probability, and his views changed. In his doctoral thesis in 1915, Reichenbach argues that the probability of an event is the relative frequency of the event in an infinite sequence of causally independent and causally identical trials [Reichenbach, 1915]. Influenced by the neo-Kantians of his time (Ernst Cassirer, Paul Natorp etc.) Reichenbach took causality to be a primitive concept, more fundamental than probability. On this view, causal knowledge is synthetic a priori knowledge, and the proof for this status was supposedly given by Kant’s transcendental deduction of the principle of causality. That is, according to (Reichenbach’s reading of) Kant we have causal knowledge for individual events that enable us to determine causal independence and identical causal circumstances, and so, according to Reichenbach, we have a non-trivial, non-circular and objective basis on which to build the concept of probability. In particular, if one can show that causally independent and identical trials imply probabilistically independent and identically distributed trials, then the law of large numbers implies that the empirical distribution converges to the true distribution in probability.1 Reichenbach was aware of the (weak) law of large numbers (although it is not discussed in any detail in his thesis), but he considered convergence in probability too weak. Relying on convergence in probability would imply that the notion of probability features in the definiens of the definition of probability, which would render the definition of the concept of probability circular. Reichenbach wanted to establish convergence with certainty. To resolve this dilemma, Reichenbach (in 1915) again reached into the Kantian toolbox and provided a transcendental argument that there is a synthetic a priori principle — the principle of lawful distribution – that guarantees with certainty that every empirical distribution converges. The essence of the transcendental argument is as follows: If there were no such principle, scientific knowledge as it is represented in the laws of nature would be impossible. But obviously science is replete with knowledge about lawful causal relations. Scientific laws state general causal regularities but Kantian causal knowledge only supplies causal knowledge with regard to single events. Something is needed to aggregate the individual causal knowledge tokens into general causal laws. Hence, there must be such a principle. Given the principle, convergence is guaranteed with certainty, and even if we do not know when convergence will occur or at what rate, we are on the right track if we use the empirical distribution, since it must converge at some point. That was, in short, the argument of his doctoral thesis. 1 The law of large numbers states that in a sequence of independent identical trials, for every > 0 the probability that the frequency of success in the sequence differs from the true probability of success by more than , converges to zero as the number of trials n goes to infinity.
364
Frederick Eberhardt and Clark Glymour
The argument is not very satisfying even if the gaps were filled in (e.g. from causally independent trials to probabilistically independent trials): Granted the nowadays implausible view that causal knowledge of token events is synthetic a priori, it is simply false to claim that there is a synthetic a priori guarantee of convergence of sequences to a limiting distribution — we know many sequences whose limiting frequencies do not converge. Reichenbach must have at some point (if not all along) felt similarly uncomfortable with his account, since his argument changed significantly between his doctoral thesis in 1915 and the publication of the English edition of The Theory of Probability in 1949 [Reichenbach, 1949c].2 In 1927 Reichenbach indicated in notes3 (and referred to earlier discussions with Paul Hertz) that convergence with certainty is untenable, and that one can only guarantee convergence in probability. This is a transitional thought of Reichenbach’s. It is an observation on the law of large numbers, essentially, which posits a probability distribution from which initial segments of sequences are obtained by i.i.d sampling. The limiting frequency interpretation, by contrast, affords the certainty of convergence to the probability by the straight rule, provided there exists a limit value at all. In addition, he changed his mind on the order of the primitives: Once Einstein had shaken the synthetic a priori status of space and time, Reichenbach similarly reviewed the synthetic a priori status of causality and concluded that it was not causality, but rather probability that was the more fundamental notion, i.e. that causality is a relation that can only be inferred on the basis of probabilistic relations (plus some additional assumptions). Claims about single event causation – what now is referred to as actual causation — were considered elliptic, either implicitly referring to a sequence of events or fictitiously transferring the causal claim from the type level to the token level. However, if causal relations are no longer fundamental, and probabilities are not to be taken as primitive, then Reichenbach had to find a new foundation for the concept of probability. This effort coincided with similar concerns by Richard von Mises. Von Mises was trying to establish a foundation of probability in terms of random events [von Mises, 1919]. While randomness was well understood pretheoretically, all attempts to characterize it formally turned out to have undesired consequences. The aim that both Reichenbach and von Mises shared was a reduction of the concept of probability to a property of infinite sequences of events, thereby avoiding any kind of circularity in the foundation. Given scientific practice it seemed intuitive and appealing to think of objective probabilities in terms of the relative frequency in an infinite sequence of events — one only had to appropriately characterize the types of sequences that would be considered admissible as providing the foundation of probability. For example, one would not consider a sequence admissible, if it simply alternated back and 2 For more details of the changes see the introduction to the translation of Reichenbach’s thesis [Reichenbach, 2008]. 3 See reference HR 044-06-21 in [Reichenbach, 1891-1953], also discussed in the introduction in [Reichenbach, 2008].
Hans Reichenbach’s Probability Logic
365
forth between 1 and 0. While the limiting relative frequency of 1s is 1/2, one knows the next number with certainty given any initial segment long enough to exhibit the pattern. Von Mises therefore wanted to restrict his considerations to sequences of random events, since the notion of randomness captured the idea that one would not be able to make any money by betting on the next item in the sequence given the previous items. More formally, the lack of after-effect and the invariance to subsequence selection were considered necessary conditions for random sequences. The lack of after-effect captures the idea of being unable to make any money by betting on the next item of a sequence, given the previous items. In particular, the probability of any outcome should be the same, no matter what the previous outcomes were. Invariance under subsequence selection requires that the probability of an event is the same under any subsequence selection rule that depends only on the indices of the items. How these notions are spelled out formally differs among authors. Reichenbach rejected the idea of random sequences because he saw no hope of being able to adequately capture randomness formally.4 There were known theoretical difficulties in showing that all the conditions for randomness could be satisfied, and Reichenbach had pointed out some of them [Reichenbach, 1932a]. Reichenbach did not give up on the idea completely, but instead settled for a somewhat weaker constraint on sequences: normal sequences. Normal sequences form a strict superset of random sequences. A sequence of events is normal if the sequence is free of after-effect and the probabilities of event types is invariant under regular divisions. Reichenbach’s definition of after-effect is not entirely clear, but roughly, in a sequence with after-effect an event E at index i implies for events at indices subsequent to i probabilities that differ from the limiting relative frequency of those events. Regular divisions are subsequence selection rules that pick out every k-th element of the original sequence for some fixed k. (In fact, the conditions are a little more intricate, but we leave that aside here.) The probability of event E then is the limiting relative frequency of E in a normal sequence of events. This works as an abstract definition of probability, but it is not adequate to determine scientific probabilities. In empirical science the sequences of measurements are finite. The finite initial segment of a sequence gives us no information about the limiting distribution. Nevertheless, Reichenbach claims that we should treat the empirical distribution given by the finite initial segment of measurements as if it were (roughly) the same as the limiting distribution. He believes we have recourse to a higher order probability that specifies the probability that the limiting relative frequency of the event (its true probability) is within some (narrow) band of width δ around the empirical frequency. This higher order probability is also based on empirical data, but indirectly: it derives from a sequence of probability values of the first order, i.e. from a sequence of sequences of events. The idea is that it integrates data from sequences of different inductive inferences. Reichenbach gives one type of example in different forms that provides some idea of how 4 Ernest Nagel agreed with this latter point, but believed that von Mises’s weaker version of randomness could nevertheless be formalized [Nagel, 1936].
366
Frederick Eberhardt and Clark Glymour
this is supposed to work (e.g. see [Reichenbach, 1949c, pp. 438-440]): Suppose we have a finite sequence of measurements, M1 , M2 , M3 , M4 , M5 , . . . , Mn and we classify them as 1 or 0 depending on whether they fall into some prespecified narrow range around a fixed value. For example, suppose we have measurements of the gravitational constant and we classify the data points as 1 if they fall within the band γ ± δ for some small value of δ, and 0 otherwise. So we can define a variable X = I(|M − γ| ≤ δ), where I(.) is the indicator function, and obtain a sequence of values of X consisting of 1s and 0s, depending on the original measurements: 1, 1, 0, 1, 0, . . . 0. Suppose further, that if we had infinite data, there would be a limiting distribution to the frequencies of 0s and 1s in the sequence, with P (X = 1) = p, and P (X = 0) = 1−p. The actual empirical distribution — determined by the relative frequency of 1s and 0s among the available measurements is Pˆ (X = 1) = pˆ and Pˆ (X = 0) = 1 − pˆ. Reichenbach claims that there is a higher-order probability q that states that P (|ˆ p − p| < ) = q for small , i.e. the true distribution falls with probability q within distance of the empirical distribution. According to Reichenbach we have an estimate of such a higher-order probability q by considering several sequences of measurements, each with their own empirical distribution. So suppose we had three sequences of measurements (say, from different experiments of the gravitational constant on (a) the moon, on (b) some planet and (c) using the Cavendish balance): a: b: c:
1, 1, 1, 0, 0, . . . , 1 0, 1, 1, 1, 1, . . . , 0 0, 0, 1, 0, 0, . . . , 1
Each will have a certain empirical distribution, say Pˆa , Pˆb and Pˆc . These three empirical distributions form their own sequence of values of pˆ, namely, pˆa , pˆb , pˆc each specifying the relative frequency of 1s in the individual sequences. pˆa , pˆb , pˆc again determine an empirical distribution, but now of higher-order probabilities. Suppose it is the case from the three initial distributions that pˆa = 0.8, pˆb = 0.7 and pˆc = 0.79. Again we can classify these values according to some approximation bound, e.g. 0.8 ± , where = 0.05. In that case the (empirical estimate of the) higher-order probability q is qˆ = 2/3. The relative frequency of 1s in a single row indicates the probability of truth of the statement about the gravitational constant for the particular test object, e.g. the planet. The second-order probability q across the different sequences indicates the probability that the first order probability claim is true. It is this kind of mutual validation of convergence across different
Hans Reichenbach’s Probability Logic
367
measurement sequences that supplies, according to Reichenbach, a probability of convergence of any empirical distribution. In another analogous example involving measurements of the melting point of different metals, Reichenbach argues that the fact that many metals have a melting point gives us reason to believe that a metal which we have so far not seen melt, will nevertheless probably have a melting point. Despite the fact that for this apparently “solid” metal the empirical distribution of measurements appears to indicate that the probability of having a melting point is zero, the second order probability that determines how indicative the empirical distribution is of the limit, will be very low, because the second order probability integrates the findings from the other metals. Reichenbach refers to these “cross-inductions” as providing a “network of inductions”. Another way of thinking about Reichenbach’s approach is to consider a hierarchical Bayesian procedure: Measurement data is used to estimate certain distributional parameters of a quantity of interest. But one may describe these parameters by a higher order distribution with its own hyper-parameters. In that case several measurement sequences could be used to gain estimates of the hyper-parameters. Once these are estimated, one can then re-compute the lower level parameters given the estimated hyper-parameters. This enables a flow of information between different sequences of measurements via the hyper-parameters and therefore provides a broad integration of data from different sources. As in Reichenbach’s account, one can continue this approach to higher orders with hyper-hyper-parameters. At some point the question will arise whether one has enough data to estimate the high-order parameters. Reichenbach claims that at some higher order, blind posits replace the probability estimates to avoid an infinite regress of higher-order probabilities. Hierarchical Bayesian methods work well theoretically, but they hinge on being able to determine whether and in what sense events are similar such that they can be included in the same sample (that is then used to determine the probabilities). On Reichenbach’s account the corresponding question regards the determination of reference classes. Reference classes are tricky territory for Reichenbach, since he goes as far as to claim that we can determine the probability of a scientific theory. For example, to determine the probability that Newton’s law of gravitation holds universally (rather than just for a particular test-object, as in the example above) Reichenbach claims that all available measurements of the gravitational constant must be placed in one sequence, and that “...we must construct a reference class by filling out the other rows [sequences of measurements] with observations pertaining to other physical laws. For instance, for the second row we can use the law of conservation of energy; for the third, the law of entropy; and so on.” [Reichenbach, 1949c, p. 439f]
368
Frederick Eberhardt and Clark Glymour
It seems obvious that the selection of the reference class is arbitrary here, but Reichenbach argues further that “...the reference class employed corresponds to the way in which a scientific theory is actually judged, since confidence in an individual law of physics is undoubtedly increased by the fact, that other laws, too, have proved reliable. Conversely, negative experiences with some physical laws are regarded as a reason for restricting the validity of other laws, that so far have not been invalidated. For instance, the fact that Maxwell’s equations do not apply to Bohr’s atom is regarded as a reason to question the applicability of Newton’s or Einstein’s law of gravitation to the quantum domain.” [Reichenbach, 1949c, p. 440] Just why the incompatibility of a set of equations, Maxwell’s, with a model of the atom, should tend to invalidate another set of equations, Newton’s, that are themselves incompatible with Maxwell’s, we have no idea. We have no idea what the reference class here may be for such a probability transfer, nor what else the underlying reference class in this case might contain. ’indexcross-induction It remains unclear what criteria Reichenbach had in mind to determine a reference class generally. Of course, the general idea is that events should somehow be of the same type, but not so similar that the variability of interest is precluded. Reichenbach claims that one should choose the narrowest reference class for which there are stable statistics (relative frequencies), and that the stability of statistics is determined at the level of advanced knowledge, i.e. at a high level of dataintegration.5 But this is obviously not an acceptable suggestion — it begs the question: The whole aim is to determine the limits of relative frequencies; requiring stable statistics in the first place is unhelpful. Trivially stable statistics are always available at the narrowest of all non-empty classes, the class containing a single event. Obviously, this could not have been Reichenbach’s intention either. Reichenbach’s recourse to advanced knowledge for these determinations of reference classes for lower level frequencies may be understood as pointing to blind posits — that the best one can offer is an educated guess, or just a guess. But then why not just guess at the lower level frequencies? Reichenbach discusses the determination of reference classes at length, but it is far from a precise account. Maybe there is ultimately some intuitive reference class, even when broad crossinductions are made in science, but doubts remain whether there is any hope of spelling out such inferences in a formal probabilistic framework, and whether the result would then provide the basis for objective probabilities. We summarize Reichenbach’s account of the foundations of probability as follows: Probability is defined as a property of infinite normal sequences of events. 5 Using the example of measurements of the gravitational constant Reichenbach points out that the sequence of measurements for the planet Mercury converges to a different limit than those of the other planets, which should therefore lead to a reconsideration of the reference class of planets into reference classes of planets near the sun, and those further away [Reichenbach, 1949c, p. 439].
Hans Reichenbach’s Probability Logic
369
Normal sequences capture many of the features of random sequences. Since we have no knowledge of the limit of infinite sequences of events, we build our inferences on finite initial segments of such sequences. We are sure to be on the right track as long as the higher order probabilities look promising. Higher order probabilities look promising when inductions from a broad variety of different measurement sequences provide similar results. Probabilities, including higher order probabilities, are therefore objective features of the world, since they are relative frequecies in sequences of events. Tidy as this sounds, it seems to founder on the issue of choice of reference class. Reichenbach’s doctoral thesis would have provided a seemingly simple answer to these difficulties: There, the reference class of events was easily determined by the class of causally independent and causally identical trials, and synthetic a priori knowledge. Having abandoned the a priori shore, Reichenbach found himself at sea.
4
PROBABILITY LOGIC
Once the binary truth values of traditional logic are replaced with continuously valued probabilities, then, according to Reichenbach, all forms of uncertainty present in the inferences of empirical science can be represented in a formal inference framework. By providing an axiomatization, Reichenbach places his probability logic within the formalist tradition of Hilbert and avoids recourse to an a prioristic foundation. He argues that the formalist requirements are achieved by showing that his probability logic requires no more than the axioms of standard propositional logic together with an interpretation of probabilities as a property of infinite sequences. Inductive reasoning is thereby reduced to deductive reasoning plus induction by enumeration of the appropriate sequences. Tautologies of traditional two-valued logic follow supposedly as a special case of this continuously valued logic. The sequences relevant to deductive logic are constant, and therefore their properties and the resulting inferences can be determined with certainty.6 For empirical truths, on the other hand, the properties of the sequences cannot be determined with certainty — the sequences are only given extensionally — and therefore only weaker truth values (between 0 and 1, excluding the boundaries) are assigned. The probability logic provides a calculus for inferences given such weakly supported propositions. The inferences follow those of standard probability calculus, and so the intended model of reasoning in the empirical sciences is achieved. The justification for application of the probability logic is given by a convergence argument. We will discuss the details of the logic in three parts — the logical syntax and semantics, the interpretation, and the justification. 6 Nevertheless, it remains unclear what Reichenbach would have thought about mathematical statements whose truth or falsity we do not (yet) know. What probability, if any, would he have assigned to the Goldbach conjecture?
370
Frederick Eberhardt and Clark Glymour
4.1 Logical Syntax and Logical Semantics Reichenbach’s probability logic starts with the usual syntax of propositional logic: letters representing propositions are connected by the usual connectives of “and”, “or”, “negation”, “implication” and “equivalence” (bi-conditional). However, unlike propositional logic, probability logic does not assign truth or falsehood to a proposition but a degree of probability p. In principle there are innumerable ways to interpret this new form of truth value, since it is introduced simply as part of a formal system. But since the aim is to capture standard probabilistic reasoning in the logic, the probability associated with a propostion should reflect the use of probabilities in science, which Reichenbach takes to be — as we saw above — the limit of relative frequencies. Unlike propositional logic, probability logic cannot be compositional.7 While in propositional logic, the truth values of individual propositions determine the truth value of any complex proposition, this is not the case for probability logic. Given the probability of proposition A and the probability of proposition B, the probability of the proposition A∨B is underdetermined. This is an obvious consequence from the undetermination in the mathematical calculus of probability, in which the set of marginals does not determine the joint distribution. Consequently, the specification of “truth”-tables in probability logic depends on the specification of a third quantity fixing the joint probability of the two (or more) propositions, and thereby determining the probability value of composite formulas. Reichenbach uses the conditional probability of B given A, since it can easily be formulated as a subsequence selection procedure. Given the marginals and the conditional, the probability value of any composite formula involving the standard binary operators is defined by the standard rules of the mathematical calculus of probability, e.g. P (A ∨ B) = P (A) + P (B) − P (A)(B|A) P (A ≡ B) = 1 − P (A) − P (B) + 2P (A)P (B|A). The standard set of axioms for propositional logic are augmented by a set of axioms that Reichenbach had developed as the foundation of mathematical probability calculus [Reichenbach, 1949c, pp. 53-65]. UNIVOCALITY: ¯ ⊃ B) ≡ (A)] (p = q) ⊃ [(A −− ⊃ B).(A −− p
7 In
q
criticism of Reichenbach’s probability logic (see below) Russell, Tarski and Nagel refer to a logic with the feature of compositionality — that the truth (or probability-) value of a complex proposition is a function only of the truth (or probability-) values of its component propositions — as the logic being “extensional”. In contrast, in an “intensional” logic the truth value depends also on the content of the individual propositions. Their terminology is misleading, because it overloads the term “extensional” also used for sequences that can only be defined by enumeration. Furthermore, as Reichenbach notes in response (and we discuss below), his probability logic does not fit this dichotomy.
Hans Reichenbach’s Probability Logic
371
NORMALIZATION: (A ⊃ B) ⊃ (∃p)(A −− ⊃ B).(p = 1) p
¯ (A).(A −− ⊃ B) ⊃ (p ≥ 0) p
ADDITION: ¯ ⊃ (∃r)(A −− (A −− ⊃ B).(A −− ⊃ C).(A.B ⊃ C) ⊃ B ∨ C).(r = p + q) p
q
r
MULTIPLICATION (A −− ⊃ B).(A.B −− ⊃ C) ⊃ (∃w)(A −− ⊃ B.C).(w = p · u) p
u
w
While the notation is cumbersome, the axioms are intended to express three simple notions: The first axiom ensures that the value of a probability is unique, the second axiom ensures that any probability with a non-empty conditioning set has values between 0 and 1, inclusive. Axiom 3 is supposed to ensure that the probability of mutually exclusive events is the sum of the event probabilities, and axiom 4 is the chain rule: P (C, B|A) = P (C|B, A)P (B|A). The first three axioms are similar to Andrey Kolmogorov’s axiomatization of probability, however the third axiom only ensures finite additivity [Kolmogorov, 1933]. In Kolmogorov’s case, the chain rule follows from the previous three axioms, but Reichenbach requires the additional fourth axiom to switch between logical conjunction and mathematical multiplication. Unfortunately, these axioms are not sufficient to provide an axiomatization of probability, since they do not ensure that the space the probabilities are applied to is closed under complementation and countable union, i.e. that it forms a sigma-field. In fact, as van Fraassen shows, limiting relative frequencies in infinite sequences do not actually satisfy these constraints [van Fraassen, 1979]. The set of axioms of probability are extended by one additional rule — the rule of induction, or the so-called “straight rule” [Reichenbach, 1949c, p. 446]: “Rule of Induction: If an initial section of n elements of a sequence xi is given, resulting in the frequency f n , and if, furthermore, nothing is known about the probability of the second level for the occurrence of a certain limit p, we posit that the frequency f i (i > n) will approach a limit p within f n ± δ when the sequence is continued.” Reichenbach considers this rule8 to be the only necessary addition to the otherwise entirely formal logic to get inductive inferences off the ground: All inductive inferences on complex claims can be reduced by application of the earlier axioms to 8 Reichenbach took this rule to instantiate C.S. Peirce’s self-correcting method. See footnote on same page as citation.
372
Frederick Eberhardt and Clark Glymour
this simple induction by enumeration. The rule is part of the meta-language, and cannot, like other rules of deductive inference, be reduced to the object language. It therefore does not follow the logical form of the previous axioms. Leaving the details aside, if the axioms did form a logic that built up from sequences of propositions and performed the inferences of the mathematical probability calculus as sub-sequence selection operations on infinite sequences, then Reichenbach would have constructed a purely syntactic inductive calculus that includes probabilistic and deductive inferences based entirely on the enumeration of sequences. But the situation is not quite so clear: Despite the appearance of a standard syntax on the surface, the proposed set of axioms together with the rule of induction contain a mixture of notations which is never fully spelled out: The probability axioms use mathematical relations and existential quantification over variables representing real numbers. This suggests that arithmetic must form part of the language. But neither the mathematical machinery nor even its first order component is extended to the variables representing sequences. These are open formulas, presumably universally quantified, following a propositional language extended by the probability implication. Reichenbach leaves the task of an explicit account of the syntax covering these two systems to the reader. The formal semantics is thoroughly non-standard: a continuous truth value is determined by the limit of the relative frequency in the sequence associated with each proposition, or — for a complex formula – as a function (using standard probabilistic inference) of the relative frequencies in each of the sequences corresponding to the individual propositions, and the subsequence corresponding to the conditional probability. But this only hints at a formal semantics and it is by no means clear whether the gaps can actually be filled in, especially since the account depends on the accepted syntax: A formula with iterated probability conditionals, ⊃ C, must either be disallowed by the formal semantics, or it such as (A −− ⊃ B) −− p
q
must be interpretable in terms of subsequence selection rules. The former seems unlikely given Reichenbach’s desire to cover all types of probabilistic inferences. In the latter case a formal semantics in terms of subsequence selection rules is ill-defined for iterated probabilistic conditionals, because the antecendent of the second probability implication, i.e. (A −− ⊃ B), is not the type of object that lends p
itself to a sequence interpretation in any obvious way.
4.2 Interpretation In the first few decades of the 20th century, when Reichenbach developed his probability logic, there were several other proposals to generalizing two-valued logic to multi-valued and continuously valued logics to formalize modal reasoning. Reichenbach considered these attempts to be largely misguided, since they ended up as formal constructs with little or no relation to the use of modality in natural language [Reichenbach, 1934].
Hans Reichenbach’s Probability Logic
373
In contrast, with the inclusion of standard probabilistic reasoning, Reichenbach saw the crucial advantage of his probability logic in being able to model — in the sense of a rational reconstruction – scientific reasoning. That is, Reichenbach considered the formal semantics of his logic to be well “coordinated” with real scientific inference, because he had provided a procedure to go from experimental evidence to complex propositions, and back. As far as he was concerned, all the “coordination” work had already been done in developing the foundation of probability. Scientific evidence comes in the form of a sequence of events, the data. Scientific probabilities correspond to the relative frequencies of events in such a sequence of events. The evidence is described by propositions; each proposition simply describes an event, one datum. As a result, we obtain a sequence of propositions whose structure is isomorphic to the sequence of events, and consequently the probabilities can be used interchangably for the sequence of events and the sequence of propositons. The scientific situation is matched in the logic and therefore the interpretation and application of the logic to science is obvious — essentially inbuilt. Almost. First, probability logic is a calculus of infinite sequences, but in science data is always finite. Second, in natural language we often assign probabilities to singular propositions for which there is no obvious corresponding sequence. It appears at least possible, that there are similar situations in science. We start with the second: probabilities of singular propositions. Reichenbach claims that probabilities associated with single propositions are posits, or wagers (but not in any strict sense of Bruno DeFinetti). These posits can be either blind or appraised. Posits are blind when no data is available to inform the probability. Reichenbach does not give any explicit constraints on the form of a blind posit, but implicitly it is quite obvious that they should resemble a flat prior assigning equal probabilities to every possibility. A posit becomes appraised as soon as evidence becomes available, and should then correspond to the relative frequency of the relevant event in the data. For example (Reichenbach’s example [Reichenbach, 1949c, p. 366f]), consider the proposition that Caesar was in Britain at a particular time. This proposition can be associated with a probability p, which would be a guess (blind posit) if no relevant evidence is known. But one relevant sequence of data, Reichenbach suggests, is the sequence of reports by historians about Caesar’s activities. The probability of Caesar’s visit to Britain is then the relative frequency that Caesar’s presence in Britain at the time in question is reported in these historical records, and the initial blind posit p then becomes appraised.9 The suggestion is that probabilities of single propositions are ultimately fictive or elliptic, referring to implicit sequences of relevant events. Such a rendition seems unsatisfying. Perhaps Reichenbach’s own example is not an ideal illustration. The transfer of probabilities from sequences to single 9 We do not know how Reichenbach would distinguish the complete lack of mention of Caesar’s whereabouts at a particular time, from the explicit mention of the absence or presence. It is possible that the selection of the appropriate reference class of events is supposed to address this problem, but there is no explicit procedure.
374
Frederick Eberhardt and Clark Glymour
propositions seems more plausible for more scientific cases. For example, the claim that the probability that a particular individual atom will decay after time interval t may perhaps more reasonably be thought of as meaning that the probability refers to the relative frequency of decay after time t in a set of atoms. The reference to a larger population (even if perhaps not a sequence) seems more convincing here. Even if we are able to make sense of single event probabilities as elliptic references to some larger reference class, such a reference class will always be finite in the empirical sciences. This then leads us back into the first concern. Reichenbach argues that his rule of induction solves the problem: Given a finite sequence, we have a determinate relative frequency for the event of interest. We posit this frequency as the limit of the relative frequency. This is a blind posit, since there is no reason to believe that the empirical distribution is indicative of the limit. But this posit can become appraised if we have several sequences of similar type. Consequently, the initial blind posit of the limit in an individual sequence is revised to the appraised posit. Reichenbach’s basic idea is that in the sciences, as in his logic, one pretends as if : The empirical distribution in the finite initial segment of a sequence should be treated as if it were infinite and therefore indicative of the properties of the infinite sequence. The properties of the infinite sequence should be posited (blindly) based on the initial segment. By aggregating the data in different ways into different sequences, these blind posits are supposed to become appraised, as if many blind eyes make vision. Reichenbach argues that under the assumption of a flat prior, his rule of induction corresponds to a Bayesian update (at least for the first update), and that appraised posits simply correspond to informative priors.10 His hierarchy of higher order probabilities therefore reflects exactly the structure used in hierarchical Bayesian methods (though these were still unknown at his time). How exactly a posit becomes appraised, and why its appraised value is unique, remains unclear. If sequences of measurements can be arranged in different ways that change which events are regarded as first order events11 , then the simple rule of induction reflecting the empirical frequencies will conflict for higher order claims with the results of a Bayesian update (even if the prior at the lowest level is flat). It appears that the rule of induction should only be applied at the most fundamental data level, thereby (presumably) also preserving the objectivity of probability statements. Higher order claims about convergence, or the integration of information from different domains (so-called “advanced knowledge”), however, is supposed to occur by a Bayesian update. Reichenbach does not discuss the mixture of these updating techniques and the implications anywhere.
10 See
[Reichenbach, 1949c, p. 326-333 and p. 441]. example with the measurements of the gravitational constant, discussed in the section on probability above, appeared to involve exactly such different forms of representation of the data. 11 Reichenbach’s
Hans Reichenbach’s Probability Logic
4.3
375
Justification
Assuming the method of application of the probability logic is now clear, the remaining task is to explain why it is the right method to use. Reichenbach’s justification of the inductive rule, a convergence argument, is presumably his most disputed legacy. It goes like this: If we indeed use the empirical distribution to determine our probabilities as if it were the limiting distribution, then, as long as we adjust the empirical distribution whenever more data comes in, our probability judgments will converge to the true probabilities, if the empirical distribution has a limit.12 The last condition is crucial. Of course, convergence can only occur if there is something to converge to, but not every infinite sequence has a convergence limit for the distribution of relative frequencies of its items. If there is a limit, then there is for every some N such that the empirical distribution is at most different from the limiting distribution. The catch is at what point the quantification over all distributions enters into the convergence statement. Since Reichenbach only assumes the existence of a limit (and even that only conditionally), he is only guaranteed pointwise convergence, i.e. that for every , and every limiting frequency, there is an N that ensures that the empirical frequency is within
of the limit. This is to be distinguished from uniform convergence, where for each there is an N such that the divergence is bounded for all distributions. For uniform convergence, one can specify confidence intervals and convergence rates, for pointwise convergence one cannot. Consequently, Reichenbach’s convergence argument — he calls it the principle of finite attainability [Reichenbach, 1949c, p. 447] — is extremely weak: the existence of a limit alone provides little assurance that the empirical distribution is representative, no matter how large the sample size: Although for every positive , for some finite sequence the straight rule will be within of the limit ever after, the length of that sequence is unknown: at no point does one know whether one is in the vicinity of the true distribution. Reichenbach bolsters his justification with reference to his network of inductions: He argues that the network of higher order inductions ensures that convergence to the true distribution is faster than it would be just based on inductions on individual sequences. Reichenbach does not give a formal definition of the speed of convergence, but intuitively he thinks that convergence is faster because all the cross-inductions inform any inductive inference in a particular domain (via higher order inductions) by integrating findings from other domains. Given his use of higher order probability statements for the convergence statements, he even seems to suggest that uniform convergence may be obtained. Reichenbach gives no proof of such a result, it is not true without further assumptions, and if uniform convergence is not the appropriate characterization of faster convergence, then it remains unclear what the benefit of faster pointwise convergence is. These concerns only apply if there is a limit in the first place. Reichenbach 12 Incidentally, this might be the reason why Reichenbach did not consider it important to distinguish when the rule of induction and when a Bayesian update is appropriate: Since they both converge to the same limit, the concern is irrelevant. Of course, one may worry what happens before convergence.
376
Frederick Eberhardt and Clark Glymour
argues that if there is no limit, then all bets are off anyway, i.e. then there is no alternative procedure that could generate inductive knowledge. Reichenbach compares his procedure to a clairvoyant, who claims to know the limiting distribution of the sequence [Reichenbach, 1949c, p, 476]. One cannot check — other than by induction — whether the person is in fact clairvoyant. The advantage of the proposed procedure is, according to Reichenbach, that it at least guarantees pointwise convergence, whereas the person claiming to be a clairvoyant could simply be wrong. None of his arguments are compelling; at the crucial junctures he is hand-waving. An alternative way of looking at the question of justification is not to ask whether the proposed procedure succeeds in what it claims to do, but rather ask whether it is unique in what it claims to be doing. The answer is No, and Reichenbach discusses this.13 There are many other procedures that exhibit pointwise convergence if there is a limit. In particular, the set of procedures that work like the straight rule, but which have an arbitrary function added that also converges pointwise, or procedures that make any arbitrary guesses up to a point in a sequence and use the straight rule thereafter, or procedures that add a function to the straight rule that converges to 0 as the sequence length increases. Reichenbach’s response is somewhat confusing: On the one hand he acknowledges that such manipulated distributions might even lead to faster convergence under some circumstances, and that therefore these procedures are similarly legitimate; on the other hand he argues that his straight rule is in some not further specified sense unbiased and functionally simplest. Neither argument survives more careful analysis. Thus we are left with an extremely weak justification. Reichenbach thinks this is the best we can hope for without fooling ourselves with regard to the width of the inductive gap. At various points, Reichenbach’s justifications have a much less formal and objective character and appear more pragmatic in nature (e.g. [Reichenbach, 1949c, p. 481]). Reichenbach regarded probability as providing a guide for action, and so to a certain extent — since he denied that there can be any certain empirical knowledge — he regarded his theory as a reasonably workable procedure for providing a good guess as to which scientific theories might turn out to be useful. In particular, he claimed that knowledge of the existence of the actual limit of a data sequence would not make any difference to the practice of science. However, if this is his view, it is not clear why he did not simply focus on probabilities in finite sequences. At one point, in a reply to Russell (see below) he does just that, suggesting that all real probabilities are finite frequencies, and the discussion of limiting convergence is a fiction to justify our inductive procedures. Last, let us return to the initial intention of representing inductive reasoning in science. Do we find evidence in scientific practice of the type of justification Reichenbach gives? In many cases pointwise convergence is not considered adequate. Instead, many scientists often work with much stronger assumptions — such as Gaussianity or some other parametric assumptions about the distributions 13 See
[Reichenbach, 1949c, p. 447 and section 88].
Hans Reichenbach’s Probability Logic
377
of events — which then allow them to derive rates of convergence and confidence intervals. Reichenbach may reply that such scientists are simply in denial about the limits of their knowledge, and that his inductive logic tells them why. Some of the more recent developments in statistics and computer science using non-parametric approaches work with weaker assumptions. The assumptions are stronger than those underlying Reichenbach’s probability logic, but the assumptions are still so weak that they only support asymptotic normality, i.e. that confidence intervals can only be given for the limiting distribution. If convergence rates are not known such methods are subject to similar concerns of what inferences one can draw given finite empirical data. In some sense, large simulations that are nowadays used to get an idea of the convergence rates for these procedures can be seen as providing the basis for “appraised posits” regarding the convergence rate. Maybe Reichenbach would feel vindicated. 5 CRITICS: POPPER, NAGEL AND RUSSELL Reichenbach’s proposal(s) for an inductive logic were widely read by the “scientific philosophers” of his time, and criticism came from all sides, most prominently from Karl Popper, Ernest Nagel and Bertrand Russell. Perhaps the most detailed and concise summary of points of criticisms was given shortly after the publication of the German edition of The Theory of Probability [Reichenbach, 1935c] in a review by Nagel in Mind [Nagel, 1936] (but see also [Nagel, 1938]). Some of the issues Nagel points to were not new, but had already been made by Popper, Tarski, Hertz and others. Some of the criticisms can be found again in Russell’s Human Knowledge, its Scope and Limits [Russell, 1948]. Apportioning particular aspects of criticism to particular authors would therefore be misleading. Instead we focus primarily on Popper’s criticism with regard to Reichenbach’s assessment of probabilities for scientific theories, Nagel’s criticism of the straight rule, and Russell’s criticism of the logical foundation of the probability logic, full-well acknowledging that in each regard other authors (also beyond these three) have contributed to the relevant points. In The Logic of Scientific Discovery [Popper, 1934] Karl Popper strongly rejects Reichenbach’s probability logic, even before its most comprehensive exposition in The Theory of Probability is published in German in 1935. Popper had read many of Reichenbach’s papers outlining his approach in Erkenntnis. He reiterates his views in a review of Reichenbach’s Theory of Probability in Erkenntnis in 1935 [Popper, 1935]. He shares Reichenbach’s view that we cannot determine scientific theories to be true, but he further thinks that one cannot even assign a positive probability to them. He sees no way that Reichenbach can get around either an a prioristic foundation of probability or an infinite regress of higher order probabilities, the first of which Reichenbach himself would deny, and the second of which Popper regards as inadequate to determine numerical values. Reichenbach had pointed to two alternatives to determine numerical probabilities of scientific theories. The first is to determine a sequence of singular statements that are
378
Frederick Eberhardt and Clark Glymour
experimentally testable consequences of the theory. The relative frequency of confirmed consequences in that sequence then determine the probability of the theory. The second alternative is to place the theory itself in a sequence of theories from the same reference class, and consider the relative frequency of true theories in that reference class to be indicative of the probability of the theory in question.‘indexRussell, B. Popper dismisses the second alternative as ludicrous. First, there is no unique reference class of theories (even if there were a sufficient number of theories) that would determine the relative frequency, and second, even if there were, then the fact that theories in that sequence can be determined to be true or false (in order to determine the relative frequency) makes the determination of a probability of a theory redundant in the first place. Popper, of course, doubts that individual scientific theories can be determined to be true at all. To say that Newton’s law of gravity is true, is just not the same as saying that the coin came up heads. If the theories themselves are only determined probabilistically, then that leads to an infinite regress of higher order probabilities. Since each of the probabilities in the infinite regress is smaller than 1, the probability of the statement at the first level must necessarily be zero. This point was pressed upon Reichenbach from many sides, including Russell, who saw it as an indication that two-valued logic is more fundamental than probability logic. C.I. Lewis debated the same point with Reichenbach in regard to the foundations of epistemology in an exchange of papers in the 1950s [Lewis, 1946; Lewis, 1952]. Reichenbach’s response on this matter in a letter to Russell is opaque, but the idea seems to be that the higher order theories are not independent in probability, and so their joint probability is not their product, and hence does not go to 0: “Combining the probabilities of different levels into one probability is permissible only if special conditions are satisfied; but even then this combination cannot be done by mere multiplication. Let a be the statement: ‘the probability of the event is 3/4’, and let b be the statement ‘the probability of a is 1/2’. In order now to find out what is the probability of the event, you have to know what is the probability of the event if a is false. This probability might be greater than 3/4. These values do not go to 0 (see Wahrscheinlichkeitslehre [German edition of The Theory of Probability], pp. 316-317). The product which you calculate is the probability, not of the event, but of the total conjunction of the infinite number of propositions on all levels, which of course = 0.” [Reichenbach, 1978] Popper (and Nagel) regard the first alternative as similarly hopeless, because they doubt that a scientific theory can be represented as an infinite conjunction of singular statements that can be tested individually. Furthermore, — an argument made by Popper, Nagel and several others — Reichenbach’s view would imply that a theory that was disconfirmed by 10% of its predictions would be considered true with probability 0.9. More likely though, the critics suggest, it would be
Hans Reichenbach’s Probability Logic
379
considered false. Reichenbach’s response (to Nagel’s version of this criticism) in [Reichenbach, 1939a] and [Reichenbach, 1949c, p 436], does not address the points. He evades the criticism by claiming that his account provides a consistent interpretation of what scientists might mean by assigning a probability to hypotheses, and in the English addition of The Theory of Probability he claims: “...if the limits of exactness are narrowly drawn, there will always be exceptions to scientific all-statements; [...] it is true that for wide limits of exactness, [...] a case of one exception is regarded as incompatible with the all-statement [...] This attitude can be explained in two ways. First, the degrees of probability for such all-statements are usually so high that one exception, in fact, must be regarded as a noticeable diminution of the degree of probability. Second, one exception proves that an all-statement is false, and we dislike using an all-statement as a schematization if it is known that the all-statement is false.” [Reichenbach, 1949c, p. 436] It is a puzzling claim for someone to make who takes all empirical claims to be probabilistic. In the case of a highly confirmed theory a purported counterexample would be a black sheep among many positive instances. Given that in Nagel’s example the probability of the theory is determined by the relative frequency of positive instances (rather than by a Bayesian update), the impact of the counterexample on the probability of the theory shoul be relatively minor. Hence, we do not claim to understand how this statement improves Reichenbach’s situation in light of the criticism. For Popper, any third procedure to determine the probability of scientific hypotheses — such as using posits — is subjective. Nagel [1936] commends Reichenbach’s efforts to give a precise presentation of the frequency interpretation of probability, but he does not share Reichenbach’s conviction that such an interpretation together with the proposed formal machinery in form of a probability logic is sufficient to represent and justify formally inductive inferences, nor does he believe that it is an adequate description of inductive inferences in the sciences. The whole problem, as Nagel sees it, is that probability statements are not verifiable, because they are based on sequences whose limit we do not know. We do not even know if the sequence has a limit at all. Nagel is unconvinced by Reichenbach’s proposal of “inductive verifiability”. He points out the lack of mathematical proof that a sequence of higher-order probabilities or a network of cross-inductions lead to faster convergence. And even if they did, Nagel does not see any value in a convergence guarantee if the point of convergence is not known. In a response to a similar point made by Hertz [1936], Reichenbach only reiterates his view that his inductive procedure is the best one can hope for and that no other procedure will do better [Reichenbach, 1936].
380
Frederick Eberhardt and Clark Glymour
Much later a detailed analysis of the possibility of proofs for the claims underlying Reichenbach’s justification of induction and his use of higher order inductions is given in [Creary, 1969]. Creary concludes his assessment with the remarks (ch. 5, p. 129): “...we have argued that: 1. MC [method of correction, using cross-inductions] provides no rationale for the choice of a lattice [e.g. as in the earlier example, generated from the sequences of measurements of the gravitational constant from different planets] into which to incorporate a given sequence. 2. Mere superiority of MC to RRI’ [Reichenbach’s Rule of Induction, i.e. the straight rule; see above14 ] is not sufficient to justify the choice of MC over simpler lattice-convergent alternatives other than RRI’. 3. The superiority (indeed, even the parity) of MC vis-a-vis RRI’ does not follow from theorems (1)-(4) (Reichenbach’s theorems of convergence underlying his justification, [Reichenbach, 1949c, pp. 466-467]; or [Creary, 1969, p. 119]). 4. The theorems (1)-(4) themselves depend upon assumptions which would prevent any results established with their help from having the sort of justificatory import intended by Reichenbach.” The assessment is negative in every respect and vindicates Nagel’s concern. More generally, Nagel does not think that Reichenbach’s probability logic provides an adequate description of scientific practice. According to Nagel, scientific statements are considered probable not on the basis of some formal account of probability, but instead because there are no alternative plausible candidate theories, the theory has a certain aesthetic appeal, or because it is (largely) consistent with the available evidence. Nagel emphasizes throughout his review that he believes that there are several more or less formal notions of probability, and that several other human aspects enter into the judgment of the probability of truth of a scientific theory. Nagel also picks up on a point of criticism pressed upon Reichenbach by Tarski [Tarski, 1935]. The concern is whether the probabilities (which are supposed to replace truth values) in the probability logic, constitute a syntactic relationship between statements only, or a semantic relation between statements and what is described by the statements. Nagel argues that Reichenbach’s probability implication appears to involve both a “semantic and syntactic characterization” of relations between statements. It is not entirely clear, what Nagel means, but one source of contention appears to be that the assignment of probability values to composite statements in Reichenbach’s probability logic involves not only the 14 Creary
includes some slight modification on p. 115 of [Creary, 1969].
Hans Reichenbach’s Probability Logic
381
probabilities of the individual statements, but a third (conditional) probability, specifying the amount of “coupling” between the statements. This third probability, since it is based on a subsequence selection procedure, appears to introduce an undesired further syntactic constraint into the semantic relation of the statements and what the statements refer to. Nagel (and Tarski) refer to this problem as the probability logic not being “extensional”. Reichenbach accepts that his probability logic does not conform to such strict constraints of “extensionality”,15 but that the probability logic is also far from being the opposite, namely “intensional”, i.e. that the probability values of statements depend on the content of what they describe. Reichenbach considers his introduction of the conditional probability to be a natural extension of the traditional dichotomy that is necessary when considering continuous valued logics [Reichenbach, 1939b]. It is doubtful whether this would have satisfied Nagel or Tarski, but it is also somewhat unclear in this discussion what the standards are, and how they can be applied to probabilistic logics. Russell [1948] shares the concern about the status of Reichenbach’s probability logic, but for a different reason. He uses a very intuitive mathematical example of the probability that an integer chosen at random will be a prime [Russell, 1948, pp. 366-368] to illustrate his point. He shows that depending on the order in which integers are arranged, Reichenbach’s definition of the probability, can be made to return any value between 0 and 1. This leads to the uncomfortable conclusion that Reichenbach’s definition of probability depends on the order within sequences. In fact, Reichenbach, in his response to Russell [Reichenbach, 1978] agrees with this conclusion, and emphasizes that this aspect is one of the important contributions of his theory over theories of probability based on classes, such as Maynard Keynes’. Russell, in contrast, takes this constraint to indicate that Reichenbach’s foundation of probability cannot be stated in abstract logical terms, because the order of events seems to introduce a contingent, non-logical aspect. Russell has little hope that the probability logic could be shown to be fundamental. He thinks that “...there is a great difficulty in combining a statistical view of probability with the view, which Reichenbach also holds, that all propositions are only probable in varying degrees that fall short of certainty. [...] Statistical probability can only be estimated on a basis of certainty, actual or postulated.” [Russell, 1948, p. 368f]; and he points to the fact that even in Reichenbach’s rendition, two-valued notions of truth still play the most fundamental role, which suggests that Reichenbach did not even achieve the goal he set himself. Russell’s view of Reichenbach’s theory can be summarized as follows: If the probability values of statements depend on the limit of an extensionally given 15 We
used the term “compositionality” above.
382
Frederick Eberhardt and Clark Glymour
sequence, they are for ever unattainable. If they depend on a hierarchy of higherorder probabilities, those higher-order probabilities cannot all be determined. If they are given by posits, then those posits are sensitive to the order of events. Russell did not share Reichenbach’s indifference with regard to the truth value of posits, and it is doubtful whether he accepted Reichenbach’s response that “blind posits are justified as a means to an end, and that no kind of belief in their truth is required” [Reichenbach, 1978, p. 407]. For Russell, infinite regresses and hypothetical posits or infinite (extensionally given) sequences cannot form the foundation of a logical calculus. Instead, Russell suggests that Reichenbach should build his probability logic on finite sequences. Probabilities, as they are used in the sciences only refer to reasonably large, but not infinite, populations anyway. The finite aspect would then also ensure that probability values can be precisely estimated, and probabilistic statements can be determined to be true or false, thereby returning to a two-valued logic as a foundation. To a certain extent, Russell’s criticism repeats Nagel’s, who regarded Reichenbach’s rule of induction not as progress across the inductive gap, but just as a restatement of the original problem. Nagel had argued that since Reichenbach’s definition of probability is given in terms of two-valued logic, the entire probability logic should be re-statable in two-valued logic (since Reichenbach places no constraints on the complexity of statements). If such a translation is possible, then what use is the probability logic? If such a translation is not possible, then in what sense does the purported solution to the inductive problem in probability logic provide a solution to the traditional problem of induction in two-valued logic? Reichenbach responds by distinguishing the formal probability logic, which he says can indeed be reduced to two-valued logic, and the applied probability logic, to which two-valued logic can at best be an approximation [Reichenbach, 1935d]. He compares the situation to a representation of non-Euclidean space in Euclidean space, but at least for us, the analogy does not make things clearer. 6
REICHENBACH ON THE ATTEMPTS OF OTHERS AND ON STANDARD PROBLEMS
Unsurprisingly, Popper’s falsificationist account of theory testing was most unpopular with Reichenbach. Reichenbach considered Popper to be in denial about scientific practice and the implications of Popper’s own theory [Reichenbach, 1935b]. Reichenbach thought one has a choice: Either one can take into account the actual non-definitive nature of scientific results, in which case a probabilistic account is necessary, or one can schematize the procedures and instead of probabilities, just use 0 and 1 as a discrete representation. If one does the latter, then one has to accept that scientific theories can both be verified and falsified. If one does the former, then it is impossible to verify or falsify with certainty. In either case, the asymmetry Popper tries to place between verification and falsification does not exist.
Hans Reichenbach’s Probability Logic
383
Furthermore, Reichenbach considered Popper’s measure of corroboration and severity of testing to be a probability in disguise, only that Popper’s account lacks the precision to actually make it a metric. Reichenbach was not genteel about the matter: He compares Popper to a fruit vendor, who stacks all the good fruit at the front of the display, but only fills the bags from the back, and then denies that his method has anything to do with the differential quality of the fruit. Popper uses a method to differentiate unfalsified theories, but presents no justification of why, what he is doing is reliable. Instead, he derides any attempts at justification of the methods as “metaphysical beliefs” (which neither Popper nor Reichenbach want their work to be accused of). Against the “deductive” approaches to probability logic, such as those of Rudolf Carnap and Carl Hempel ([Reichenbach, 1949c, p. 456-461], see also [Reichenbach, 1935a]), Reichenbach argues that they do not provide an adequate space to define a probability metric. In retrospect this criticism is ironic, since Reichenbach himself did not specify a space that satisfies the conditions of a sigma field. Carnap introduces probabilities on an a priori foundation, which leads Reichenbach to claim that it is therefore difficult to distinguish the resulting theory from “a prioristic methods like the principle of indifference” [Reichenbach, 1949c, p. 456]. It is difficult to know what Reichenbach might have meant here. His usual criticism against the principle of indifference concerns its subjective component, whereas his criticism of a priori methods usually concerns the justification for such knowledge. Whatever the specific concern, the basis for determination of the probability metric is at issue. Against Hempel (and Helmer and Oppenheim), Reichenbach argues that they acknowledge that there are no a prioristic grounds to select a probability metric, but that their efforts to use some initial data and a maximum likelihood assumption to determine the probability space depends on the assumption that the data are independent [Reichenbach, 1949c, p. 456]. In contrast, his network of inductions that integrates information from many sequences of measurements in “advanced knowledge” does not make such an assumption, but tests it. He appears to regard the fact that assumptions are not set in stone as an important advantage of his system (see e.g. [Reichenbach, 1949c, p. 464]). Approaches based entirely on maximum likelihood Reichenbach sees as answering the wrong question, and therefore as inappropriate tools for science. In a criticism of Fisherian statistics [Reichenbach, 1949c, p. 454f] Reichenbach points out that the likelihood of a hypothesis is not of interest, but rather the inverse probability, i.e. the probability of the hypothesis given the data. Maximization of the likelihood is therefore misleading, since a consideration of prior probabilities may still reverse the ordering of the hypotheses based on the likelihood. Once the inverse probability is known, likelihoods can be disregarded. Reichenbach believed that his account of induction by enumeration would provide the prior probabilities that are required. More generally, Reichenbach thought that once the probabilistic aspect of inductive inference had been taken into account in form of his probability logic all
384
Frederick Eberhardt and Clark Glymour
the so-called inductive paradoxes of his time also disappeared. He does not go into much detail, but it is obvious that he did not see reason for much debate. For example, he considered Hempel’s ravens paradox to be based simply on a misapplication of converse reasoning to probabilistic inference [Reichenbach, 1949c, p. 434f]: “If something is a raven, then it is black.” can be highly probable, but that does not imply that “If something is non-black then it is a non-raven.” is highly probable as well. It depends, as can easily be seen from a simple application of Bayes rule, on the base-line probability of ravens and black things. There is, according to Reichenbach, no paradox to be had in the first place, he simply takes a Bayesian view of probabilistic confirmation relations, and the problem is solved. Reichenbach’s discussion of Goodman’s new riddle of induction is similarly brief [Reichenbach, 1949c, p. 448f]. He argues that the “grue”-predicate does not form a good basis for induction because it violates the principle that one should use the narrowest reference class. Reichenbach admits that the rule of induction does not ensure against false posits in the short run. That is, if grue is defined to mean green until time t and then blue, and we have a sequence of elements that are in fact all green, then “All emeralds are grue.” would be confirmed (and therefore source of erroneous posits) up to time t, but then disconfirmed. As such, Reichenbach sees no problem with such an inference, since in the long run, one converges to the truth. But he argues that “...in advanced knowledge the inference can be shown to be inferior because it violates the rule, ‘Use the narrowest common reference class available.’ The property C [grue], by its definition, is identical with ‘not B’ [not green] from the n+1st element [time t]; since the reference class ‘not B’ [not green] is narrower than C [grue] it should be used as a basis for the inference (in other words, the property with respect to which the first n elements should be counted is ‘not B’ [not green]).” This argument makes no sense: Up to sequence item n grue and green have exactly the same statistics. How is one supposed to know in advance which one to choose? It appears that contrary to claims elsewhere, the “narrowness” of reference classes has nothing to do with stable statistics, but instead with other independent unspecified criteria of determining events. As we argued earlier, Reichenbach’s account of reference classes was never precise, but this response appears to misunderstand Goodman’s concern altogether. Goodman’s riddle points to the entire question of what should constitute a reference class (using Reichenbach’s terminology), i.e. which elements should be included in a sequence that is used as a basis for a probability judgment. As far as we can tell, Reichenbach’s response simply misses the point. 7 COMMENTARY Reichenbach’s inductive logic is a strange mix of mathematical precision and dodging the details. The aim is quite clear: Reichenbach intends to provide a prob-
Hans Reichenbach’s Probability Logic
385
ability logic that (a) is objective — hence the frequency interpretation; (b) not a prioristic — hence the empirical focus; (c) provides a rational justification of inductive reasoning in science — hence the straight rule; and (d) is sensitive to the uncertainties present in science — hence the many levels of probabilities. The problem is that beyond the formalization of probability in terms of limits of relative frequencies, there is no real need for all the logical machinery of his account. If anything, it makes the account more cumbersome and confusing. The main lacuna of the inductive logic, as pointed out by several others, is the justification in terms of the straight rule. Reichenbach does not provide any mathematical proofs for his justification of faster convergence in terms of higher order probabilities, and it is doubtful whether they can be supplemented without adding further substantial assumptions at some level. But even with regard to representing the uncertainty present in scientific inferences, it is not clear whether Reichenbach’s probability logic really captures what is going on. The problem is that in science there are many different forms of uncertainty that Reichenbach represents in terms of just one type of probability: First, there is uncertainty because data is noisy, because measurements are subject to many residual influences, even in a well controlled experiment. This is the kind of uncertainty the theory of error deals with. Second, even if the data were not noisy, a finite number of measurements always underdetermines the law-like relationship which exists between the physical quantities. Third, there may be true uncertainty in the physical quantity. Interpreting Heisenberg’s uncertainty principle as implying metaphysical uncertainty would yield uncertainty of this third type. Reichenbach at different points considers all three types of uncertainty. While the first two sources of uncertainty can be regarded as epistemic, the last one is metaphysical. That is, if we had direct access to the truths of the real world, only the third type of uncertainty would remain. We do not have direct access to the truths of the world, so the scientific task is to reduce, as far as possible, the uncertainties of the first two types. Traditional accounts of inductive logic only attempted to account for the uncertainty of the second type, i.e. the uncertainty of which hypothesis is true given the data. Reichenbach considered these views to be wishful thinking, since they assumed that our scientific data was certain. Due to errors, scientific measurements are never certain, and so Reichenbach held the more general view that no empirical knowledge can be certain. The problem is that his probability logic does not separate the contribution of uncertainty from the different sources. Consider the following example. Suppose for the moment, that we have a finite set of noise-free data for quantities (x, y) that are known to not be subject to any metaphysical uncertainty, i.e. we only have uncertainty of type two. Suppose further, that these data points all happen to lie on a straight line described by the formula ax + b for some real numbers a and b. Given that the data are noisefree, whatever the true functional form of the law, it must pass through each data point. But obviously that does not uniquely define the function, since anything
386
Frederick Eberhardt and Clark Glymour
could happen in those parts of the space, for which no data points are available. So, the evidence does not imply the true law, and the question for Reichenbach is which functional form, if any, is better confirmed, more likely to be true or best justified — and in what sense? Reichenbach does not resort to a naive answer that the data implies or that it is uniquely rational to believe that the simplest function (on some measure of simplicity) must be the true or best confirmed function. Instead he argues that we know for each function f that passes through the data points an (objective) probability pf that f is the true function. This probability pf is derived from several other, say, k data sets. If in k · pf of those datasets the same function f describes the data, then pf is considered to be the probability of function f being true.16 So far this is not some naive Bayesian (or other) bias based on the original data set that prefers simpler theories. But Reichenbach cannot (and does not seem to think he can) avoid the simplicity bias entirely, since of course there are still several functions that are confirmed by all the data-sets, but which differ from one another in parts of the space where none of the k data-sets contains a data-point. Here Reichenbach then does resort to a Bayesian account that prefers simpler theories — and no more detailed argument is given. Even if this were an acceptable account for this second type of uncertainty, we still have to integrate the two other types: noise, and metaphysical uncertainty. One may, given the appropriate data sets, be able to use the differential likelihood to distinguish hypotheses that constrain parameters representing metaphysical uncertainty, from those hypotheses that do not. However, the moment that uncertainty due to noise is considered, the problem of identifying and allocating the uncertainty is in principle massively underdetermined. The hope of every scientist is that one finds circumstances in which the different sources of uncertainty can be teased apart. But Reichenbach makes no attempt in this direction with his formalism, although he certainly was aware of the problem: He was familiar with Heisenberg’s results, which he took to imply metaphysical uncertainty, and he was well-versed in the field of error statistics. The clean way to account for different sources of uncertainty is to represent them in the inference explicitly and separately, as is done in (for example) Jeffrey conditionalization [Jeffrey, 1983]. Jeffrey conditionalization explicitly represents the noise in the data as separate from the uncertainty with regard to functional form. It makes the confounding of these two types of uncertainty explicit, but also indicates how different assumptions can be used to tease them apart. Reichenbach, of course, never knew Jeffrey conditionalization, but we can guess that he would object to the proliferation of different probability distributions, which he would want to regard as objective probabilities, but for which an objective foundation seems far-fetched. 16 There is an issue of whether the data sets are independent samples, but we leave that aside here.
Hans Reichenbach’s Probability Logic
387
Our rendition of Reichenbach’s probability logic will surely not have done justice to every aspect of Reichenbach’s attempts to provide an account of inductive reasoning. We have our doubts whether a clean account of the probability of scientific theories can be given in terms of Reichenbach’s probability logic, or — for that matter — in terms of any purely formal system at all. We have also been critical of Reichenbach’s justification of the straight rule. However, we do think that Reichenbach’s approach to inductive inference that started by looking at actual scientific reasoning, is valuable. It presents a “proper” opposition to Popper’s falsificationist accounts and avoids getting stuck in the philosophical quagmire of logical confirmation theory. Most of Reichenbach’s probability logic is now mainly of historical interest, but some of his ideas regarding the search for objective probabilities in a Bayesian framework are still present in philosophical circles, and some ideas similar to his mathematical theory have been developed together with the appropriate precise mathematical formalism in modern statistics. Notes on Sources: Our reconstruction of Reichenbach’s probability logic is based primarily on his account in The Theory of Probability [Reichenbach, 1935c; Reichenbach, 1949c] and the various papers in Erkenntnis and elsewhere before [Reichenbach, 1931b; Reichenbach, 1932b; Reichenbach, 1934; Reichenbach, 1935e]. The development of Reichenbach’s foundations of probability can be found in publications throughout his life, but especially in the following: [Reichenbach, 1915; Reichenbach, 1920; Reichenbach, 1925; Reichenbach, 1929; Reichenbach, 1932a; Reichenbach, 1933]. For further reference see also [Reichenbach, 1930; Reichenbach, 1931a]. Experience and Prediction [Reichenbach, 1938a] helps to piece together the big picture, and his many comments and responses to criticism in various journals, primarily [Reichenbach, 1935d; Reichenbach, 1936; Reichenbach, 1939b; Reichenbach, 1940; Reichenbach, 1949a], but see also [Reichenbach, 1938b; Reichenbach, 1939a; Reichenbach, 1941; Reichenbach, 1948], fill in some of the issues that remain unclear in the more thorough presentations of his view. Needless to say, on certain aspects we remained at loss with regard to what exactly Reichenbach had in mind.
BIBLIOGRAPHY [Creary, 1969] L. Creary. A Pragmatic Justification of Induction, A Critical examination. PhD thesis, Princeton University, 1969. [Hertz, 1936] P. Hertz. Kritische Bemerkungen zu Reichenbachs Behandlung des Humeschen Problems. Erkenntnis, 6, no. 1:25–31, 1936. [Jeffrey, 1983] R. Jeffrey. The Logic of Decision. 2nd edition, University of Chicago Press, 1983. [Kamlah and Reichenbach, 1977] A. Kamlah and Maria Reichenbach, editors. Gesammelte Werke in 9 B¨ anden. Vieweg, Braunschweig-Wiesbaden, 1977. Reichenbach’s Collected Works in German in 9 volumes. [Kolmogorov, 1933] N.A. Kolmogorov. Grundbegriffe der Wahrscheinlichkeit. Springer, Berlin, 1933. Engl. transl. Foundations of the Theory of Probability, New York: Chelsea, 1950. [Lewis, 1946] C. I. Lewis. An Analysis of Knowledge and Valuation. Open Court, 1946. [Lewis, 1952] C. I. Lewis. The given element in empirical knowledge. The Philosophical Review, 61(2):168–172, 1952. [Nagel, 1936] E. Nagel. Critical notices. Mind, XLV (180):501–514, 1936.
388
Frederick Eberhardt and Clark Glymour
[Nagel, 1938] E. Nagel. Principles of the theory of probability. In R. Carnap, C. Morris, and O. Neurath, editors, Foundations of the Unity of Science. University of Chicago Press, Chicago, 1938. [Popper, 1934] K. Popper. The Logic of Scientific Discovery. Routledge, 2002, 1934. [Popper, 1935] K. Popper. “Induktionslogik” und “Hypothesen-wahrscheinlichkeit”. Erkenntnis, 5, no. 1:170–172, 1935. [Reichenbach and Cohen, 1978] M. Reichenbach and R.S. Cohen, editors. Selected Writings: 1909-1953, in 2 volumes. Dordrecht-Boston: Reidel, 1978. Principal translations by E. Hughes, ? Schneewind, further translations by L. Beauregard, S. Gilman, M. Reichenbach, and G. Lincoln. [Reichenbach, 1891-1953] H. Reichenbach. Unpublished notes. In Reichenbach Collection. University of Pittsburgh Library System, 1891-1953. All rights reserved. [Reichenbach, 1915] H. Reichenbach. Der Begriff der Wahrscheinlichkeit f¨ ur die mathematische Darstellung der Wirklichkeit. PhD thesis, University of Erlangen, Barth, Leipzig, 1915. Reprint in [Reichenbach, 1916], and [Kamlah and Reichenbach, 1977], vol. 5, and a summary in [Reichenbach, 1919]. Engl. trans. with German reprint in [Reichenbach, 2008]. [Reichenbach, 1916] H. Reichenbach. Der Begriff der Wahrscheinlichkeit f¨ ur die mathematische Darstellung der Wirklichkeit. Zeitschrif f¨ ur Philosophie und philosophische Kritik, 161: 210239; 162: 9-112, 223-253, 1916. Reprint of [Reichenbach, 1915]. [Reichenbach, 1919] H. Reichenbach. Der Begriff der Wahrscheinlichkeit f¨ ur die mathematische Darstellung der Wirklichkeit. Die Naturwissenschaften, 7, no. 27:482–483, 1919. Summary of [Reichenbach, 1915]. [Reichenbach, 1920] H. Reichenbach. Die physikalischen Voraussetzungen der Wahrscheinlichkeitsrechnung. Die Naturwissenschaften, 8:46–55, 1920. Reprinted in [Kamlah and Reichenbach, 1977], vol. 5. Engl. transl. ‘The Physical Presuppositions of the Probability Calculus’, in [Reichenbach and Cohen, 1978], vol. II: 293-309. [Reichenbach, 1925] H. Reichenbach. Die Kausalstruktur der Welt und der Unterschied von Vergangenheit und Zukunft. Sitzungsberichte - Bayerische Akademie der Wissenschaften, mathematisch-naturwissenschaftliche Klasse, pages 133–175, 1925. Reprinted in [Kamlah and Reichenbach, 1977], vol. 8. Engl. transl. ‘The Causal Structure of the World and the Difference between Past and Future’, in [Reichenbach and Cohen, 1978], vol. II: 81-119. [Reichenbach, 1929] H. Reichenbach. Stetige Wahrscheinlichkeitsfolgen. Zeitschrift f¨ ur Physik, 53:274–307, 1929. [Reichenbach, 1930] H. Reichenbach. Kausalit¨ at und Wahrscheinlichkeit. Erkenntnis, 1:158– 188, 1930. Reprint of part III in [Kamlah and Reichenbach, 1977], vol. 8. Engl. transl. of part III ‘Causality and Probability’ in [Reichenbach, 1959], 67-78. Engl. transl. in [Reichenbach and Cohen, 1978], vol. II. [Reichenbach, 1931a] H. Reichenbach. Bemerkungen zum Wahrscheinlichkeitsproblem. Erkenntnis, 2, nos 5-6:365–368, 1931. [Reichenbach, 1931b] H. Reichenbach. Der physikalische Wahrheitsbegriff. Erkenntnis, 2, nos. 2-3:156–171, 1931. Reprinted in [Kamlah and Reichenbach, 1977], vol. 9. Engl. transl. ‘The Physical Concept of Truth’, in [Reichenbach and Cohen, 1978], vol. I. [Reichenbach, 1932a] H. Reichenbach. Axiomatik der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 34:568–619, 1932. [Reichenbach, 1932b] H. Reichenbach. Wahrscheinlichkeitslogik. Sitzungsberichte, Preussische Akademie der Wissenschaften, Phys.-Math. Klasse, 29:476–490, 1932. [Reichenbach, 1933] H. Reichenbach. Die logischen Grundlagen des Wahrscheinlichkeitsbegriffs. Erkenntnis, 3:410–425, 1933. Reprinted in [Kamlah and Reichenbach, 1977], vol. 5. Engl. transl. ‘The logical Foundations of the Concept of Probability’ in [Reichenbach, 1949b]. [Reichenbach, 1934] H. Reichenbach. Wahrscheinlichkeitslogik. Erkenntnis, 5, nos. 1-3:37–43, 1934. [Reichenbach, 1935a] H. Reichenbach. Bemerkungen zu Carl Hempels Versuch einer finitistischen Deutung des Wahrscheinlichkeitsbegriffs. Erkenntnis, pages 261–266, 1935. ¨ [Reichenbach, 1935b] H. Reichenbach. Uber Induktion und Wahrscheinlichkeit. Bemerkungen zu Karl Poppers Logik der Forschung. Erkenntnis, 5, no. 4:267–284, 1935. Engl. transl. ‘Induction and Probability: Remarks on Karl Popper’s The Logic of Scientific Discovery’, in [Reichenbach and Cohen, 1978], vol. II.
Hans Reichenbach’s Probability Logic
389
[Reichenbach, 1935c] H. Reichenbach. Wahrscheinlichkeitslehre. Eine Untersuchung u ¨ber die Logischen und Mathematischen Grundlagen der Wahrscheinlichkeitsrechnung. Sijthoff, Leiden, 1935. Engl. Transl. (with changes) of [Reichenbach, 1949c], and back-transl. to German in [Kamlah and Reichenbach, 1977]. [Reichenbach, 1935d] H. Reichenbach. Wahrscheinlichkeitslogik und Alternativlogik. In Einheit der Wissenschaft (Prager Vorkonferenz der Internationalen Kongresse fuer Einheit der Wissenschaft. Leipzig, 1935. [Reichenbach, 1935e] H. Reichenbach. Zur Induktionsmachine. Erkenntnis, 5:172–173, 1935. [Reichenbach, 1936] H. Reichenbach. Warum ist die Anwendung der Induktionsregel f¨ ur uns notwendige Bedingung zur Gewinnung von Voraussagen? Erkenntnis, 6, no. 1:32–40, 1936. [Reichenbach, 1938a] H. Reichenbach. Experience and Prediction. An Analysis of the Foundations and the Structure of Knowledge. Univ. of Chicago Press, Chicago, 1938. [Reichenbach, 1938b] H. Reichenbach. On probability and induction. Philosophy of Science, 5:21–45, 1938. Reprinted in S. Sarkar, ed. Logic, Probability and Induction, Garland, New York, 1996. [Reichenbach, 1939a] H. Reichenbach. Bemerkungen zur Hypothesenwahrscheinlichkeit. The Journal of Unified Science (Erkenntnis), 8, no. 4:256–260, 1939. ¨ [Reichenbach, 1939b] H. Reichenbach. Uber die semantische und die Objektauffassung von Wahrscheinlichkeitsausdr¨ ucken. The Journal of Unified Science (Erkenntnis), 8:50–68, 1939. Engl. transl. ‘On Semantic and the Object Conceptions of Probability Expressions’, in [Reichenbach and Cohen, 1978], vol. II. [Reichenbach, 1940] H. Reichenbach. On the justification of induction. The Journal of Philosophy, 37:97–103, 1940. Reprinted in Readings in Philosophical Analysis, H. Feigl and W. Sellars (eds.), Appleton-Century-Crofts, New York, 1949, 324-329. [Reichenbach, 1941] H. Reichenbach. Note on probability implication. Bulletin of the American Mathematical Society, 47, no. 4:265–267, 1941. [Reichenbach, 1944] H. Reichenbach. Philosophic Foundations of Quantum Mechanics. Univ. of California Press, Berkeley and Los Angeles, 1944. [Reichenbach, 1948] H. Reichenbach. Theory of series and G¨ odel’s theorems. unpublished mimeographed manuscript; first published in [Reichenbach and Cohen, 1978], vol. I., 1948. [Reichenbach, 1949a] H. Reichenbach. A conversation between Bertrand Russell and David Hume. The Journal of Philosophy, 46, no. 17:545–549, 1949. [Reichenbach, 1949b] H. Reichenbach. The logical foundations of the concept of probability. In H. Feigl and W. Sellars, editors, Readings in Philosophical Analysis, pages 305–323. AppletonCentury-Crofts, New York, 1949. Also reprinted in H. Feigl and M. Brodbeck (eds.) in Readings in Philosophy of Science, Appleton-Century-Crofts, New York, 1953, 456-474. Engl. transl. by M. Reichenbach of [Reichenbach, 1933]. [Reichenbach, 1949c] H. Reichenbach. The Theory of Probability. An Inquiry into the Logical and Mathematical Foundations of the Calculus of Probability. University of California Press, Berkeley-Los Angeles, 1949. Transl. by E.H. Hutten and M. Reichenbach. English version and second edition of [Reichenbach, 1935c], German back-transl. in [Kamlah and Reichenbach, 1977], vol. 7. [Reichenbach, 1959] M. Reichenbach, editor. Modern Philosophy of Science: Selected Essays. Routledge & Kegan Paul, London, 1959. [Reichenbach, 1978] H. Reichenbach. A letter to Bertrand Russell. In M. Reichenbach and R.S. Cohen, editors, Selected Writings: 1909-1953, in 2 volumes. Dordrecht-Boston: Reidel, 1978. [Reichenbach, 2008] H. Reichenbach. The Concept of Probability in the Mathematical Representation of Reality. Open Court, Chicago–La Salle, Ill., 2008. transl. of [Reichenbach, 1915] with introduction by F. Eberhardt and C. Glymour. [Russell, 1948] B. Russell. Human Knowledge, Its Scope and Limits. Simon Schuster, New York, and Allen & Unwin, London, 1948. [Tarski, 1935] A. Tarski. Wahrscheinlichkeitslehre und mehrwertige Logik. Erkenntnis, 5, no. 1:174–175, 1935. [van Fraassen, 1979] B. van Fraassen. Relative frequencies. In W. Salmon, editor, Hans Reichenbach: Logical Empiricist. Reidel, London, 1979. [von Mises, 1919] R. von Mises. Grundlagen der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 5:52–99, 1919.
GOODMAN AND THE DEMISE OF SYNTACTIC AND SEMANTIC MODELS Robert Schwartz
1
HISTORICAL BACKGROUND
The “problem of induction” in its modern form is most often traced back to David Hume [2006]; (See 2, 3.1 ) To put the issue in terms we will make use of later, consider these two arguments: I.
1. All emeralds are green 2. a is an emerald ∴ a is green.
II.
1. Emeralds 1, 2, 3, . . . , 999 are green ∴ All emeralds are green
Both seem to be good arguments, although there are significant differences between them. In Argument I, the conclusion follows deductively from its premises. If the premises are true so must be the conclusion. Rules of logic justify inferring one from the other. Not so with Argument II. No matter how many emeralds are examined and found to be green, it does not follow logically that all are. The question then is what sanctions or justifies concluding “All emeralds are green” in Argument II? One might claim that the conclusion is warranted on the grounds that it is an a priori truth; it states a relation of ideas. “Green” is part of the meaning of “emerald”, so of necessity all such gems are green. In that case, though, premise 1 in argument II plays no real role. There is no need to appeal to any empirical evidence to underwrite the conclusion, and there is no problem of induction. If, however, “All emeralds are green.” is taken to state an empirical fact, then 1 Chapter numbers to articles appearing in this volume are in italics. Two anthologies, The Philosophy of Nelson Goodman: Nelson Goodman’s New Riddle of Induction [Elgin 1997] and Grue: the New Riddle of Induction [Stalker 1994] present both early and more contemporary papers on the topics to be discussed. Grue is particularly helpful in that it ends with a 316entry annotated bibliography, many in detail. A good number of the papers cited in this essay can be found in these anthologies, and most are abstracted in the bibliography of Grue. The bibliography for Part III of Scheffler [1963] is also useful.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
392
Robert Schwartz
questions do arise about the status and validity of Argument II. What is the relationship between the premise and the conclusion that entitles inferring the latter from the former?2 Unlike Argument I, the conclusion does not follow from its premises on the basis of logic alone. At the same time, the conclusion of Argument II does seem to be reasonable, given the evidence. The problem is to account for the soundness of this inference. One widely accepted solution was thought to lie in making explicit an assumed yet unstated, premise, the principle of the Uniformity of Nature. Since Nature is uniform, not chaotic, it can be assumed that past regularities will continue into the future. By adding this uniformity principle as a premise to Argument II the argument can be turned into a valid deductive inference. Hume does not deny this. Rather he questions our right to assume the principle as a premise. Hume argues that the Uniformity of Nature principle is not an a priori truth. It is not a relation of ideas, and it is possible to doubt. There is no way to establish the principle by thought or reason alone. Hume then argues that it is not possible to justify the principle on empirical grounds either. Any such justification will be circular. It will depend on assuming the uniformity principle or something equivalent to it. By their very nature inductive conclusions are under-determined by their premises. There is no guarantee that past regularities will continue to hold. With respect to empirical matters we can never be certain. Our bestsupported hypotheses are always subject to refutation by new evidence. Hume concludes that ultimately induction is founded on habit, our habit of predicting future regularities on the basis of those experienced in the past. There is no further ground or firmer warrant for inductive practice. For Hume the problem of induction also raises issues concerning the justification of causal claims. Hume maintains that it is a mistake to believe that we actually observe causes. We can, for example, observe one billiard ball striking another and the latter moving away, but we do not observe any causal connection per se. There is no such thing to observe. All we can perceive is A followed by B. Hume goes on to point out, however, that simply observing B to follow upon the occurrence of A is not enough to sanction the claim that A caused B. That it starts to rain immediately after Claire sneezes does not justify the claim that her sneeze caused the rain. The difference between these cases cannot lie in a failure to observe the causal connection between sneeze and rain. It is not possible to observe any such connection between the billiard balls either. Nevertheless, it seems correct to say that the one billiard ball striking the other causes it to move. According to Hume what distinguishes the cases are facts about previous observations. A record of the past observations indicates that whenever billiard balls or similar items collide the stationary item moves. This constant conjunction instills a habit, the expectation that the next collision observed will be followed by movement. We have not observed comparable sneezes followed by rain regularities. 2 Recently Harman and Kulkarni [2006] question whether the rules of induction and rules of deduction should both be characterized as rules of inference.
Goodman and the Demise of the Syntactic and Semantic Models
393
Hence no habit is established that engenders the expectation that rain will appear on the heels of local sneezes. The crucial feature of Hume’s analysis of causation is its reliance on past regularities to underwrite causal claims. The cause/effect relation is not a necessary connection. No amount of reason and thought about A entails that B will follow upon its occurrence. Nor can causal connections themselves be observed. Without the experience of past regularities there is no basis for expecting events to turn out one way or another. Hume’s account of causation is usually referred to as a “regularity” theory. The relation between Hume’s problem of induction and his analysis of causation is on the surface. Valid judgments of causation depend on there being relevant regularities to back them up. Justification for believing there are such regularities depends on empirical evidence. But as Hume’s analysis of the problem of induction shows, trust that observed regularities will continue into the future can not be further grounded. Thus judgments of causation, too, are never certain. They are only as hardy as the observed regularities that support them. The bedrock warrant for both inductive and causal claims is habit.3 Scholars debate whether Hume’s analysis shows that he is a skeptic. Many think Hume is. Since Hume maintains that inductive judgments are neither a priori nor empirically certain they cannot “really” be known. Moreover, it is argued, on Hume’s account of induction the confidence we do have in such judgments is not objectively grounded. The confidence rests on habit, and habits are subjective. Habit might explain a practice, but habit cannot justify or provide norms for the practice. Other scholars argue that it is not correct to assume that Hume is a skeptic. Hume thinks that inductive judgments based on observed regularities are justified. Objective inquiry requires there be sufficient evidence to back up our statements, not that the evidence makes them certain. If a belief is well supported by the empirical evidence, it has all the justification that is possible and all that is needed to warrant its acceptance. Hume’s analysis of inductive reasoning met resistance from the start, and efforts to explain his “real” position, explain away the problem, criticize his solution or propose new ones have continued. (See [Swinburne, 1974; Foster and Martin, 1966].) I think it fair to say that the majority opinion is that Hume’s core ideas about the problem of induction have survived these challenges. His account of causality, although still influential, has not remained as firm.
2 DEVELOPMENTS IN THE TWENTIETH CENTURY For the purposes of this entry it is possible to skip ahead and pick up the story in the early twentieth century. (See 4, 5, 6 to fill in the period between.) In light of exciting progress in symbolic logic and the related development of formal tools of analysis there was much interest in constructing a syntactic theory of inductive 3 Lewis [1973] claims that Hume actually offered a second account of causation in terms of counterfactuals.
394
Robert Schwartz
logic, comparable to that developed for deductive inference. Much of this work sought to formulate a quantitative measure of confirmation in terms of probability. The most detailed and influential work on this task was that of R. Carnap. [See 9.] In his monumental Logical Foundations of Probability, Carnap [1950] laid out the basics for a formal system of inductive logic. Others joined Carnap in the project, criticizing, altering and expanding on what he accomplished. Carnap’s ideas and approach remain topics of discussion. By mid-century work had begun on a related, more modest program. The aim here was neither a quantitative nor a comparative measure of confirmation, but a qualitative theory. It was to spell out the relation between a body of evidence (E) and a hypothesis (H) such that E confirms S. The initial thought was that the relation could be defined syntactically. (See [Hempel, 1943; 1945; Achinstein, 1983].] The first step was to lay down a set of conditions of adequacy that any satisfactory qualitative theory of confirmation must meet. C. Hempel (op cit) with input especially from N. Goodman and P. Oppenheim produced the most discussed work on this topic. The task turned out to be more complicated and elusive than supposed. Firm intuitions about obviously correct conditions of adequacy were put in doubt when they were shown to sanction counter-intuitive results. It became increasingly clear that it was not possible to design a system that adopts all the conditions of adequacy that intuition deems necessary. Such a system would be inconsistent.4 A few examples of particular relevance to the issues to be taken up later can give the flavor of the problems encountered. [See 8.] Presumably, evidence that boosts the credibility of a hypothesis should be taken to confirm it. If this principle is adopted, though, evidence that appears irrelevant to a hypothesis’s full content will count as confirming it. For example, let H∗ be the conjunction of H1, “All emeralds are green,” and any arbitrary, independent H2, say, “Snow is white.” Take the observational evidence to be that emeralds 1– 999 are green. This evidence appears to count in favor of H1, while irrelevant to an evaluation of H2. Nevertheless, if this evidence makes H1 more credible it should also increase somewhat confidence in H*. Yet this result runs counter to another intuition, namely the intuition that confirming evidence for a hypothesis should spread its support to all of its instances. In the case being considered, although the 999 observed emeralds do favor H1, they do not spread their support to all items that fall within the scope of H∗. The evidence does not speak one way or the other about the color of snow. Adopting the intuitively sound condition of adequacy, “Evidence that confirms H confirms all consequences of H, however, would require that if the evidence of 999 green emeralds confirms H∗, then it must confirm H2 itself. Problems of this sort suggest drawing a distinction between evidence that supports or makes a hypothesis more credible and evidence that spreads its support to all instances of the hypothesis in question. Only a body of evidence that spreads its support throughout the hypothesis will count as confirming evidence. Obser4 For argument that any of the proposed conditions of adequacy can be given up, see Hanen [1971)].
Goodman and the Demise of the Syntactic and Semantic Models
395
vation of green emeralds, therefore, does confirm H1, but it does not confirm H*. It does not spread its support to the instances of H2 that fall within the full scope of H∗. Although the 999 green emeralds speak in H∗’s favor, lending it additional credibility, they do not confirm H∗. Another troubling conflict of intuitions arises from the so-called “Raven’s Paradox”. It seems obvious that a confirming instance of a hypothesis should count as confirming logical equivalents of that hypothesis. This condition of adequacy is usually labeled the “equivalence condition”. Now evidence of a black raven is surely a paradigm case of a confirming instance of the hypothesis “All ravens are black”. By parity of reason, evidence of a non-black object that is not a raven should count as a positive instance of the hypothesis “All non-black items are nonravens” which is logically equivalent to “All ravens are black”. It would follow then that an observation of a red herring (i.e. something that is non-black and not a raven), in confirming “All non-black things are non-ravens.” confirms the hypothesis “All ravens are black”. At first blush it strikes us as highly implausible that we should be able to confirm a hypothesis about the color of ravens by examining herrings for color.5 3
THE NEW RIDDLE OF INDUCTION
While attempts to resolve these issues were in full swing, Nelson Goodman [1946] pointed out a problem for both quantitative and qualitative theories of confirmation. He noted that the same observational data could support conflicting hypotheses, depending on how the evidence is described.6 Suppose a hitherto unknown machine were to toss up a total of 999 marbles in sequences of 2 red and 1black. Calculating the odds that the next toss is red will differ if the evidence is described as simply a total of 666 red and 333 black marbles or if it is described in more detail as 333 sequences of 2 red and 1 black marbles. Another example of the problem can be constructed if instead of a total of 666 red and 333 black tosses, the machine were to toss 999 marbles that were all red. The intuition is that this evidence supports the hypothesis that marble 1000th will be red, not some other color. Goodman argued, though, that there is another characterization of the evidence that predicts the 1000th marble is black. The trick depends on introducing a new, “peculiar” predicate, S. S is defined as applying to marbles M 1 & M 2 & . . . M 999 and red, or not M 1–M 999 and black. Describing the evidence as consisting of all and only Ss is no less true than describing each of the evidence instances as “red”. Hence, the hypothesis “All marbles are S” has the same number of confirming instances as the hypothesis “All marbles are red”. Projecting S leads to the prediction that ball 1000th and those encountered thereafter will be black, but this conflicts with the seemingly better prediction 5 In addition to Hempel’s work there is a vast and increasing literature concerning the proper analysis of this paradox. Of special relevance to the issues to be considered, see [Quine, 1970; Scheffler, 1963; Scheffler and Goodman, 1972]. 6 To simplify presentation here and elsewhere I have slightly altered several of Goodman’s particular examples.
396
Robert Schwartz
that future cases will be red. Actual inductive practice, of course, sanctions the projection of red and not the projection of black. The evidence of 999 red marbles does spread its support to all instances of the red-hypothesis, it does not do so in the case of the S-hypothesis. Projecting the S-hypothesis is not warranted, the evidence does not confirm it. Projecting the S-hypothesis would be a sign of some sort of inductive irrationality. Carnap [1947] recognized the force of Goodman’s examples. He also recognized that this meant there could not be a strictly syntactic inductive logic, as many had hoped. Semantic features of the predicates employed had to be taken into consideration. Carnap proposed that a sound logic of confirmation should rule out the use of Goodman’s peculiar predicate and others like it. The challenge then was to specify a principle for determining those predicates that should be allowed in and those that should be excluded. Carnap’s answer was that predicates mentioning particular objects, times and places were to be discounted. Only non-positional, purely qualitative predicates were to be employed in a theory of induction. In a brief reply Goodman [1947a] argued that Carnap’s solution had serious shortcomings. Carnap’s plan to divide the good predicates from the inadmissible ones depends on there being a set properties that are absolutely simple into which all other acceptable predicates can be analyzed. Goodman countered that all analysis is relative to the categories or concepts a system takes as its base. Properties or predicates analyzable in one system may be primitive or unanalyzable in another. Goodman mentions, too, that various “good” predicates do in fact mention particulars (e.g. “arctic”, “solar”, “Sung”) and are used in making projections.7 Goodman later elaborated and defended his position in his book, Fact, Fiction and Forecast. Here Goodman labels his puzzle the “New Riddle of Induction”. It is the version of the puzzle laid out in this book that has become the primary focus of discussion. To explain the riddle, Goodman introduces the predicate “grue”. The definition of “grue” is: x is grue = x is examined before (a future time) t and is green, or not so examined (before t) and is blue. Suppose again that all emeralds examined before t (e.g. emeralds 1– 999) are observed to be green. Then they will each be grue as well. The hypotheses “All emeralds are green.” and “All emeralds are grue.” have equal support, 999 positive instances. Yet their predictions about the color of emeralds examined after t conflict. Goodman does not deny that the evidence warrants projecting the green-hypothesis and not the grue-hypothesis. The New Riddle of Induction is to explain and justify the choice. Why do we predict that emeralds that will be examined after t are green rather than blue, and what warrants our doing so? In Fact, Fiction and Forecast Goodman introduces several more “peculiar” predicates, “bleen” and “emeruby” among others. “Bleen” is defined as: x is bleen = x is examined before t and is blue, or not so examined and is green. “Emeruby” is defined as: x is an emeruby = x is an emerald examined before t or a ruby not so examined. These additional predicates are used to flesh out Goodman’s arguments and highlight the variety of ways the New Riddle can arise. 7 Carnap
[1948] is a response.
Goodman and the Demise of the Syntactic and Semantic Models
4
397
SOME MISUNDERSTANDINGS
Attempts to resolve the New Riddle have often taken a wrong turn because their proponents misread how “grue” and other predicates introduced apply. To label an emerald “grue”, before or after t, is not to claim that the emerald or any emerald ever changes its color from green to blue. An emerald is tenselessly green, tenselessly grue or both if examined before t and green. An emerald examined after t is (tenselessly) blue, if it is grue and green if it is green. The role “grue” plays in the puzzle is to pick out or separate two subsets of emeralds, the examined from the then unexamined. It is not meant to suggest a worry over whether any emerald ever turns from green to blue. Another point that is often unappreciated is that the riddle does not depend essentially on considerations time. The version of Goodman’s predicate, S, mentioned above, employs names for each examined case and makes no reference to time in separating items in the evidence class from those in the projected class. In fact, pretty much any property or way of characterizing the subsets will do. (See [Elgin, 1993; Scheffler, 1963].) A variant of the riddle can also be demonstrated with the use of graphs. (See [Hempel, 1960; Hullett and Schwartz, 1967].) X
Y •
Z
• •
Figure 1. Each curve in Figure 1 can be taken to represent a distinct hypothesis. Indeed, an unbounded number of conflicting curves could be drawn that pass through the data points. All of these curves/hypotheses take into account the body of available evidence, and no curve is a more accurate characterization of the data than the others. Each is a true description of the evidence. The different curves, however, make conflicting predictions about items/values not as yet determined. Again, time need not enter the picture. Another questionable response to Goodman’s puzzle is the argument that the predicate “grue” and others raised in generating the New Riddle are not really relevant to science. When the puzzle is formulated in terms of actual scientific properties and theories, it is held that the riddle can be readily solved or dissolved.
398
Robert Schwartz
It is undoubtedly true that little science goes on at the level of the hypotheses, “All emeralds are green.” or “All rubies are red”, and discussion of the New Riddle in the context of high-level theory can be profitable. (See [Earman, 1985; Wilson, 1979].) Unfortunately, many of the attempts to escape the puzzle by citing its failure to be scientific enough tend to lose sight of the fundamental issue at stake. The real significance of the New Riddle is that it forces an examination of the way everyday and scientific concepts and vocabularies shape practices of inquiry. In particular, it focuses attention on how the choice of concepts and vocabularies play a role in warranting the acceptance of some hypotheses and rejecting others. There is another reason why simply setting the problem in the context of more theoretical concepts and vocabularies of science cannot be the whole solution. These loftier predicates and properties can themselves be “grue-ified”, and the problem then reappears with respect to high-level laws and generalizations. (See below.) Over the years it has been periodically maintained that there is nothing new in the New Riddle. It is said that Goodman’s puzzle is essentially no different from Hume’s problem of induction.8 All it does is reaffirm Hume’s point that inductive claims are never certain, and that such inductive indeterminacy cannot be eliminated. The introduction of “grue” it is said merely presents an old story in a new guise. Collapsing Hume’s problem and Goodman’s problem in this way is to misunderstand the significance of both. Give or take some alterations and modernization, Goodman accepts Hume’s analyses of induction and causality. He contends, though, that Hume did not go far enough. The problem of induction runs much deeper. Hume argued that inductive reasoning cannot be ground in a principle of the uniformity of nature, and that attempts to do so end in circularity. Goodman wishes to show that the uniformity principle by itself is in a sense vacuous. Regularities are free for the asking. There are an unbounded number of past regularities that can be projected and the projections conflict. The hypothesis “All emeralds are green.” projects the regularity of observed cases of green emeralds into the future. The hypothesis “All emeralds are grue.” projects the regularity of observed cases of grue emeralds into the future. The New Riddle is not meant to restate Hume’s thesis of inductive indeterminacy. The New Riddle asks instead how is it that we project some past regularities and not others, and what guides and justifies such practices. This difference between the New Riddle and the old problem of induction can be explicated with the help of Figure 1. Hume’s claim is that although we project curve X on the basis of past regularities, we have no logical guarantee that its predictions will be born out. In empirical matters there is always risk. Goodman asks why we project curve X over the rest of the curves, given that there is equal evidence for them all. Granted inductive indeterminacy is a factor with any projected curve, what is it about curve X that warrants its projection, rather than the other curves that conflict with X. 8 There have been many different arguments for this claim. For a quite recent one see Norton [2006].
Goodman and the Demise of the Syntactic and Semantic Models
399
Hume’s problem does not raise worries about the need to distinguish between justified predictions based on past regularities and conflicting predictions equally supported by past regularities. He is concerned with the justification for believing that observed regularities will continue to hold in the future. The New Riddle asks instead why based on the same evidence we project some past regularities and not others. Goodman and Hume both ask what warrants inductive practices. But they are concerned to explain and justify two different aspects of these practices. 5
PROPOSED ASYMMETRIES
Most of those who have grappled with the New Riddle have looked for a semantic/ epistemic solution. Like Carnap they argue that there is a semantic asymmetry between “green” and “grue” that accounts for the difference in the projectibility of the two predicates. In Fact, Fiction and Forecast Goodman makes an effort to turn back several such solutions he thinks people might propose. I will mention a few of the most prominent. Carnap we have seen claims that his inductive logic is not to be applied to hypotheses that contain explicit reference to particulars (e.g. times, places, or individuals). Goodman’s earlier response was that whether or not a predicate refers to particulars depends on the system of analysis adopted. Now Goodman fleshes out this argument with an example. Relative to a system that takes “green” and “blue” as primitives the definition of “grue” makes explicit reference to a particular, i.e. time t. However, if “grue” and “bleen” are taken as primitives of the system and “green” and “blue” are defined using them, green and blue will be the predicates that make explicit reference to time t. Another response Goodman considers concerns the use of higher-level properties to draw distinctions that could not be made in terms of less general properties. Granted, for example, “All emeralds are grue.” does not assert that any emerald changes its color, it does project blue to the unexamined emeralds, and blue is a different color from that of the observed emeralds. By contrast, the hypothesis, “All emeralds are green.” predicts that newly examined emeralds will be the same color as those in the evidence base. There are several problems with this sort of solution. First, it assumes that grue and bleen are not colors. Second, even if there is reason not to consider them colors, it is possible to introduce a new predicate, color*, that results from grue-ifying the general predicate “color”. Then the grue hypothesis projects that new emeralds are the same color* as those in the past, and the green hypothesis projects that unexamined emeralds are different in color* from those found in the evidence. (See [Scheffler, 1963].) This, of course, does not show that there is no way to distinguish “grue” from “green”. For example, instances of green perceptually match, while grue items do not. The problem, though, is to explain how this perceptual distinction in color can be used to solve the riddle. Why should failure of instances of a predicate to match perceptually rule out its use in projection? Sapphires and diamonds come in a variety of un-matching colors, but the predicates “sapphire” and “diamond” do
400
Robert Schwartz
find their way into sound inductions. More significantly, an asymmetry based on perceptual matching is of very limited use. It cannot be employed with respect to an overwhelming number of scientific predicates (e.g. conducts electricity, gravity, soluble). Thus solutions to the New Riddle that depend on features peculiar to color, like those that depend particularly on time, are not adequate for the task. I have switched back and forth between talking of properties and talking of predicates and have made no attempt to distinguish between them. In light of the many problems with the notion “property” Goodman, prefers to speak of ”predicates.” Some think, however, it is a mistake not to separate the two. [See: Armstrong 1978 Shoemaker 1980.] Predicates are linguistic entities. Properties are abstract objects whose existence or non-existence is not a matter of language. Predicates are ours for the making; properties are mind-independent. Now this is no place to examine the pro’s and con’s of the ontological, metaphysical and epistemic difficulties associated with properties. Be that as it may, merely countenancing abstract properties will not provide a solution to the New Riddle. Properties may be eternal occupants of Plato’s heaven, but the rules for gaining entrance into this paradise are not clear. If “green” and “blue” have a place, what about “grue” and “bleen”? And if they too are properties, the New Riddle is reintroduced. Simply declaring that grue and bleen are peculiar and therefore cannot be “real” properties begs the question. Efforts to distinguish real and faux properties in terms of “what can be analyzed into simples” or on the basis of time, color and perceptual matching run into many of the same obstacles encountered in formulating criteria for distinguishing between projectible and unprojectible predicates. One response to the difficulty posed by the possible overpopulation of Platonic heaven is to distinguish between properties and universals. [See: Lewis 1983] Only some properties are universals, and universals are projectible. Right off this move does not solve the New Riddle. Settling the riddle now requires explaining why green and blue are universals, but grue and bleen are not. So unless there are acceptable criteria for determining which properties are universals and which not, this proposal by itself does not significantly advance matters. There has also been a spate of attempts to establish an asymmetry between projectible and unprojectible hypotheses by appealing to counterfactuals. The details of these solutions differ, and I can give only the flavor of the approach. [See: Jackson 1975; Godfried-Smith 2003] On this account, the difference between “grue” and “green” is to be drawn along the following lines. We generally believe that observing an object does not change its properties. In the case of “All emeralds are green.” it seems true to say that if an emerald in the evidence base, say emerald 12, had been first observed after t it would, nonetheless, be green. Not so with ”All emeralds are grue”. We are not inclined to assert that had emerald 12 been initially observed after t, it would be grue. To be grue it would have to be blue. But we do not believe the color an object possesses depends on when it is first observed.
Goodman and the Demise of the Syntactic and Semantic Models
401
This approach, too, is not without difficulties. [See: Schwartz 2005] As discussed, to the extent any solution depends crucially on supposed time or color features of grue it is not general enough. The New Riddle can be raised without employing these features. Some versions of counterfactual solutions also seem to conflict with essentialist assumptions about identity when applied to our old example, “All emeralds are S”, where members of the evidence class are specified by name. And it must not be forgotten that these types of solutions depend on the analysis of counterfactual conditionals adopted. For many, especially Goodman, an analysis of counterfactuals involves an appeal to laws. In Chapter I of Fact, Fiction and Forecast, Goodman argues that true counterfactuals are those that have laws to back them up.9 Accidental generalizations do not offer such support. Goodman then goes on to claim that distinguishing lawlike generalizations from accidental generalizations requires a distinction on par with that between projectible from unprojectible hypotheses. (More on this issue soon.) There have been numerous attempts to recast the New Riddle and the other paradoxes of induction along Bayesian lines (see [Good, 1975; Salmon, 1970; Jeffrey, 1983]).10 Examining the plusses and minuses of a Bayesian approach in general and its treatment of projectibility in particular are beyond the boundaries of this essay. A main feature of most such Bayesian analyses of the New Riddle is assigning the grue-hypothesis a prior probability that is clearly much less than the prior probability assigned the green-hypothesis. The question then is to account for this asymmetry in probability assignments. Depending on how this issue is settled there may be no incompatibility with Goodman’s statement and solution of the riddle. Many other proposed solutions to the New Riddle go wrong because they too depend on factors peculiar to “grue”, or because they fail to heed Goodman’s warnings about the “obvious” semantic solutions, or because they depend on tools of analysis Goodman eschews. Over the years more careful semantic/epistemic analyses and critiques of the New Riddle emerged, and Goodman and his colleagues have made efforts to respond. There is now a very extensive literature discussing these solutions and their fate (see [Goodman, 1972; Stalker, 1994; Elgin, 1997; papers in Journal of Philosophy 1966, 196711 ; and new solutions crop up on a regular basis). 6 THE ENTRENCHMENT SOLUTION In Fact, Fiction, and Forecast Goodman offers his own solution to the New Riddle. It is neither syntactic nor semantic. It is pragmatic. According to Goodman projectible predicates are just those that have a history of past use. ”Green”, for 9 This
chapter is based on Goodman [1947]. [1964] reviews related issues and provides an extensive bibliography of work on inductive logic. 11 The Journal of Philosophy 1966 and 1967 each had a large section of papers, replies, etc by a variety of people on the subject. Please refer to Volume 63, 1966 and Volume 64, 1967. 10 Kyburg
402
Robert Schwartz
example, has played a role in many past projections. “Grue” has no such history of gainful employment. In Goodman’s term, “green” has become entrenched as a result of actual inductive practice, “grue” has not. Such differences in entrenchment he claims are a major, but not the only, factor that distinguishes projectible from unprojectible hypotheses. To fasten on entrenchment as a key to the New Riddle is one thing, to formulate a set of rules using entrenchment to separate projectible from unprojectible hypotheses is another. In the first edition of Fact, Fiction, and Forecast, Goodman defined a number of technical concepts and used them to formulate a system of rules intended to handle both the sorts of cases discussed above, as well as deal with various complications that I have ignored in laying out the New Riddle. In the book, Goodman also extends and clarifies his initial account of entrenchment. Predicates earn entrenchment by both their own use and by the projection of coextensive predicates. They can also inherit entrenchment from over-hypotheses. “Copper” and “aluminum” can gain entrenchment from projections of “metal”. A notion of presumptive projectibility is introduced, offering additional routes for enhancing entrenchment. These rules were revised, simplified and made more intuitive in “An Improvement in the Theory of Projectibility” [Schwartz, Scheffler and Goodman, 1970]. Goodman’s three original rules were reduced to one: A hypothesis is projectible if all conflicting hypotheses are overridden, unprojectible if overridden and non-projectible if in conflict with another hypothesis and neither are overridden. An hypothesis H is said to override an hypothesis, if they conflict and H is the better entrenched.12 Adoption of these new rules showed that it was necessary to rethink several earlier issues and claims. For instance, on the new version of the rules the assumed connection between confirmation, projectibility and lawlikeness is less direct (see [Schwartz, 1971]). In addition, alternative responses to earlier criticisms become available (see [Davidson, 1966]); The new rules like the old are given extensional formulations. Entrenchment is understood in terms of the entrenchment of extensions. The projection of coextensive predicates lifts the degree of entrenchment of them all. Hypotheses are assumed to conflict, when it is thought that they assign an object to incompatible extensions, and so forth. Remaining within extensional boundaries does have its costs, but those who doubt that intensional notions are clear enough to use have little option. What is more, in the context of a theory of projection, extensionality can make sense independent of qualms over “meanings”, “modalities” and “essences”. If, for instance, the evidence warrants the projection of a hypothesis, it would seem safe to project a hypothesis that simply replaces one of its predicates with another predicate assumed to be coextensive. A similar case can be made for the propriety of the extensional definitions of the concepts “conflict,” “positive 12 I
have omitted here additional complications.
Goodman and the Demise of the Syntactic and Semantic Models
403
instance” and the others employed.13 Continued efforts either to defend or criticize the details of the proposed rules of projection, however, are likely to be unproductive (see [Schwartz, 1999]). The rules are at most tentative and of limited scope. They are formulated with respect to a simple language and apply strictly only to hypotheses of the form (x)(P x → Qx). The rules and definitions are not geared to deal with relational predicates and statistical hypotheses. Formulating rules that do take into account such richer realms of hypotheses awaits further development, and it would not be surprising if this results in a need to rework or to drop earlier principles and definitions. The current limitations on the applicability of the projection rules also means that over-hypotheses not in the canonical form cannot be taken into account. This can have a significant affect on entrenchment values, and also make it especially hard to justify the introduction of more complex new predicates that have no entrenchment of their own. When all is said and done, though, the underlying idea, made more apparent in the new rules, seems to point in the right direction. Entrenchment is only one among many properties (e.g. evidential support, conservatism, scope and simplicity) that contribute to a hypothesis’s status. Current opinion is that any useful account of hypothesis acceptance must pay attention to a range of good-making properties and cannot rely only on observational support. All these values/virtues have a say in determining which among competing hypotheses to adopt. A hypothesis having sufficiently more of these “good-making” properties than its rivals will win out. Precise measurements of all these values, though, are notoriously hard to come by. And even when rough and ready assessments can be made, the virtues often compete. The more you have of one virtue the less you can have of the other. Trade-offs are required. As a result, a hypothesis that comes out on top according to one weighting scheme may not be ranked as high by another. Choice among hypotheses relies as much on good scientific sense as on rules and calculation. The hope of accounting for hypothesis choice solely in terms of the evidence has not seemed a viable goal for some time (see [Hempel, 1966; Kuhn, 1977; Quine and Ullian, 1978]). 7
IMPLICATIONS
The issues raised by the New Riddle have significant implications well beyond the problem of induction and the development of a logic of confirmation. Goodman discussed several such issues in Chapter I of Fact, Fiction and Forecast before introducing the New Riddle. The first concerns the proper analysis of counterfactual conditionals. The statements, “Had this piece of butter been heated to 100 F it would have melted.” and “Were this piece of butter to be heated to 100 F it would melt.” are true. The statements “Had this rock been heated to 100 13 An extensive debate related to this issue began with a paper by Zabludowski [1974] and a reply by Ullian and Goodman [1975]. It continued on for a number of years, primarily in the Journal of Philosophy with others joining in (e.g. [Scheffler, 1982; Kennedy and Chihara, 1975]).
404
Robert Schwartz
degrees it would have melted.” and “Were this rock to be heated to 100 degrees it would melt.” are false. Wherein lies the difference? Goodman’s answer, in part, is that in the first case there are accepted generalizations underpinning the counterfactual claims, while in the second case there are no comparable established generalizations on hand. But not all true generalizations are capable of supporting counterfactuals. Although all objects on desk d have weighed less than a half pound, we are not inclined on the basis of this observed regularity to maintain that any item that would have been or were to be on the desk would have/will weigh no more. The fact that all the items on d have and may in the future turn out to be under a half pound is thought to be an “accident”, and true accidental generalizations, are incapable of supporting counterfactuals. By contrast, Goodman’s example of a true counterfactual, “Had this piece of butter been heated to above 100 F it would have melted.” is backed up by a law, “All butter melts when heated above 100 F.” The generalization “All items on d are under a half pound.” may be true of the evidence cases and may turn out to be true of all objects ever to be on d. This generalization, however, is not lawlike, and it takes laws to support counterfactuals. This raises the problem of explaining what accounts for some general hypotheses being lawlike and others not. One clue is that although the so far observed instances of items on d lend some support for the tenseless hypothesis “All objects on d are under a half pound”, the evidence does not confirm it. Positive instances do not spread their support to all items that could have or might fall under the hypothesis. On this score, accidental generalizations are similar to the grue-hypothesis. The evidence that all so far encountered emeralds are grue does not make it credible that all the unexamined instances will be grue. We are not inclined to project “All emeralds are grue.” even though it has as many positive instances as “All emeralds are green”. As Goodman presents it, an evaluation of counterfactuals presupposes a distinction between lawlike hypotheses and accidental generalizations. And this distinction itself is intimately connected with the distinction between projectible and unprojectible hypotheses. In turn, these considerations impinge on the analysis of “causation”. A regularity theory of causation depends on generalizations that are lawlike, not accidental. The main alternative to a regularity account is one that explicates the notion of “cause” in terms of counterfactuals, and on Goodman’s analysis support for counterfactuals comes from laws. Possible world analyses of counterfactuals are thought to offer a wedge into these interrelations. One difficulty with this approach is its dependence on the existence of possible worlds, an ontological commitment many are leery to take on. Another difficulty is that possible world analyses usually depend on an ordering, similarity or accessibility relation among possible worlds. Typically worlds that violate the laws of the actual world are thought to be further apart than worlds that violate accidental generalizations. And as we have seen Goodman argues that drawing this latter distinction rests on the distinguishing projectibile from unprojectibile
Goodman and the Demise of the Syntactic and Semantic Models
405
hypotheses. For many, any pragmatic solution to the New Riddle and related issues like that of “lawlikeness” are in principle unsatisfactory. Critics want and seek a firm, fixed foundation. To many the idea of “natural kinds” has seemed well-suited to the task. Natural kinds as opposed to made-up or artifactual kinds are said to be objective; their boundaries are mind- independent. “Grue,” “bleen,” “emerubies” along with the kinds employed in accidental generalizations (e.g. all things on desk d) are peculiar and defective, because the kinds they pick out are not natural. That the properties/predicates “grue,” “bleen” and “emerubies” are peculiar, defective and have little scientific import is undeniable. The question that awaits an answer is how to characterize or distinguish natural kinds from other kinds. There does not seem to be any purely syntactic or semantic means to accomplish the division. Nor does the world come ready-made with its ontological joints delineated. A pragmatic answer is that natural kinds are those that have been used and found useful in practice. The naturalness of kinds and the predicates that denote them is not intrinsic. Rather, natural kinds are kinds we rely on in projections, and they become natural as a result of use/entrenchment. This solution is, of course, unpalatable to those who believe that pragmatic explanations are by nature subjective. They look for an objective, “naturalistic” account of natural kinds. It has been thought, for example, that it is possible to explain the naturalness of natural kinds in terms of the inherent similarity among the members of the kind. Members of a natural kind are similar one to another; not so with the kinds picked out by “grue” “emerubies,” “items on desk d” and other peculiar predicates. The immediate problem then is to specify satisfactorily what “similarity” means in this context. The items in the extension of “green” are similar in being green, and those in the extension of “emerald” are similar in their emerald-ness. But the same can be said of “grue” and “emerubies”. Grue objects are similar in being grue, and emerubies are similar in being emerubies. Drawing a useful distinction between natural and unnatural kinds, therefore, requires a narrower, more restrictive notion of “similarity”. One proposal is to define “similarity” in terms of perceptual matching. Green objects perceptually match one another, while items that are grue do not (see [Shoemaker, 1975]). Matching differences then can be used to explain why it is correct to claim that “green” picks out a natural kind and “grue” does not. Several things count against this solution to the projectibility puzzle. The most obvious is that instead of being too inclusive, a perceptual matching criterion is too exclusive. Once again this sort of solution is overly dependent on features special to grue. A perceptual matching test for similarity and natural kinds is not applicable to the full range of kinds found in science and in everyday use. The difficulty with a similarity solution, however, runs much deeper. There is something wrong with the very idea of an absolute notion of similarity [Goodman, 1970]. Similarity is no more an intrinsic feature of the world than is the naturalness of kinds. To be asked to group a set of items according to their similarity pure and simple does not make much sense. Similarity judgments are relative
406
Robert Schwartz
to the task at hand; they are not absolute. There is no single correct way, say, to group a collection of pills on the basis of similarity. With equal justice, the pills may be grouped with respect to color, shape, chemical structure, diseases used for, manufacturer, price range or whether they have been prescribed to Mrs. Smith. Judgments of similarity are relative to context, interests, past experience, perceptual skill and purpose in mind. Attempts have been made to salvage a fixed, non-relative standard of similarity by appealing to evolution and innateness.14 This approach, too, runs into obstacles. First, if the innate determinants or biases take the form of innate perceptual quality spaces, the solution will lack generality. Positing non-perceptual innate constraints runs into other difficulties. At present, there is no convincing account of how even comparatively low-level everyday and scientific predicates would/could be encoded in our genes. Moreover, one of the hallmarks of human cognition is the ability to forge concepts that pay no heed to innate or “natural” psychological groupings. Most frequently in science kinds gain their importance by cutting across psychologically “primitive” boundaries. Second, evolutionary pressures, by themselves, cannot explain the survival of “green” projectors over “grue” projectors. Up until time t (or more generally, until unexamined cases are confronted), there will be no difference in their survival value. Both the green-hypothesis and the grue-hypotheses jibe with the environment thus far encountered and both make the same predictions for the period before t. The response that other predicates like “grue” and “bleen” have turned out to be bad for survival begs the question. For what does it mean to be “like grue” or “like bleen”. What is the property common to “grue” and “bleen” that makes them peculiar and unsuitable for projection? Finally, it cannot be argued that natural, evolutionary favored categories are necessarily better than those not so sanctioned. In retrospect, we can see that schemes that strike us as biologically unlikely could have had important advantages over those we found natural and did employ (see [Putnam, 1983; Schwartz, 1999; Elgin, 1996]). It is important to realize as well that solutions to the New Riddle that cite similarity judgments or other psychological factors offer pragmatic, not syntactic or semantic solutions. Nor do such solutions necessarily challenge an entrenchment account. They might be part of an explanation of entrenchment, rather than a denial of its affects. 8
VALUES, VIRTUES AND HYPOTHESIS SELECTION
Entrenchment along with evidential support, conservatism, simplicity and scope were mentioned separately as “good-making” properties that influence choice among competing hypotheses. I wish now to discuss some relations among them. 14 Quine [1990] suggests an approach along this line while recognizing its limitations. Some, for example, N. Stemmer [2004] have pursued and continue to pursue this approach. Other theorists have reversed this line of argument and cite the New Riddle to justify innateness claims. See the debates concerning this strategy in Piatelli-Palmarini [1980].
Goodman and the Demise of the Syntactic and Semantic Models
8.1
407
Evidential support:
Hume showed that no matter how much evidence there is for an empirical hypothesis there is no guarantee that regularities of the past will carry into the future. The New Riddle grants Hume’s point but goes further. Not only are empirical hypotheses always subject to refutation, there will always be an unlimited number of conflicting hypotheses that encompass the data available. “All emeralds are grue.” has as many positive instances as “All emeralds are green.” The difference is that the positive instances of the latter confirm its projection and positive instances of the former do not. Entrenchment is offered to help account for this disparity in practice.
8.2
Conservatism:
Pragmatists have long argued that a major problem with Cartesian and other prominent accounts of inquiry is their failure to appreciate the full force of scientific conservatism. Although science continually finds it necessary to break with past commitments, it needs a certain amount of stability of its background assumptions to progress. Inquiry would be stymied if everything were up for grabs at the same time. Inquiry, as we know it, always begins in the middle of things against a corpus of accepted hypotheses. It is constrained by these beliefs and tries to preserve them. Entrenchment is a type of conservatism. Entrenched predicates are just those that make use of concepts and patterns of projection that have been relied upon in the past. Predicates that have occurred frequently in earlier projections are preferred to those that have not. To project “grue” where “green” is compatible with the data is to go against conservatism.
8.3
Scope:
Conservatism does not entail standing pat. Devising new hypotheses that fruitfully go beyond the old is the job of science and inquiry more generally. But how far it is reasonable to leap? The decision is not obvious, and is complicated by the fact that desiderata for settling on an appropriate stopping place compete. In particular, the less a theory sticks its neck out, the less likely it is to be refuted. Adopting the most conservative, minimal risk strategy, however, would prevent progress. It would countenance inertia or possibly retreat. Alternatively, maximizing the scope of projections maximizes risk. Welcomed gains in coverage are offset by diminished credibility. A related problem was raised in discussing the support/confirmation distinction. Recall the hypothesis H∗ is the conjunction of H1 and H2. The evidence favors H1, but seems to say nothing relevant to H2. H∗ does have wider scope than H1 and is made more credible by the evidence, nevertheless, the added coverage is not warranted. H∗ overshoots the mark. Goodman [1961] highlights the trade off between scope and credibility in his paper, “Safety, Strength and Simplicity”. Suppose, he says, we have examined a
408
Robert Schwartz
large, widely distributed number of maple trees and determined that they were all deciduous. The following three hypotheses each incorporate all the evidence. 1. All maple trees, except perhaps those in Eagleville, are deciduous. 2. All maple trees are deciduous. 3. All maples whatsoever, and all sassafras trees in Eagleville are deciduous. Hypothesis 2 has wider scope than hypothesis 1 and therefore is more susceptible to refutation. Hypothesis 3 goes beyond 2, but its adoption also increases risk. Of the three hypotheses, the safest projection is hypothesis 1 and the strongest projection is hypothesis 3, yet hypothesis 2 appears to be the one that gets the tradeoff right. Why? What determines the correct balance between intellectual bravado and modesty? Goodman’s answer is that the predicate “maple tree” is entrenched while the predicates “maple trees except those in Eagleville” and “maple trees plus sassafras trees in Eagleville” are not. Hypothesis 2 is the simplest, and simplicity is a value or virtue, a good-making property of hypotheses.
8.4 Simplicity: Simplicity can be understood in several distinct ways (see [Goodman, 1951; 1972]). On the one hand, there is formal simplicity, as exemplified when the number of primitive predicates or axioms of a system are reduced. Mere counting, though, is not a reliable measure of formal simplicity. Any set of axioms can be reduced to a single axiom by conjunction. The claim that such a conjunctive axiom is really complex, a compound of simpler ideas, encounters the difficulty already noted that such analyses are always relative to the set of predicates taken to be the primitives of a system. Hypotheses are neither intrinsically conjunctive nor non-compound. Relative to one set of primitives, a hypothesis is simple. Start with a different set of primitives and the hypothesis is syntactically and semantically complex. Other proposed measures of formal simplicity run into serious roadblocks when they are employed in explanations of our actual practices of induction and hypothesis choice. Curve fitting can provide an example of some of these issues. The claim that the line plotted by X in Figure 1 is simpler than the lines of the conflicting hypotheses seems reasonable. Yet how is such simplicity to be characterized and measured? One approach has been to develop a measure of the simplicity of curves on the basis of the complexity of the functions that describe them. Thus it has been proposed that straight-line functions are mathematically simpler than those of curves, and that the functions of periodic curves are simpler than curves that wander in no regular order. But again the problem of system relativity raises its head. Lines that are straight when using rectangular coordinates are not straight when plotted according to polar and other coordinate systems (see [Hempel, 1966]). Thus it
Goodman and the Demise of the Syntactic and Semantic Models
409
is questionable if and how any absolute measure of geometric or mathematical simplicity can itself solve the New Riddle. The alternative to formal simplicity is psychological simplicity. Some tasks are easier to do than others, some conceptual schemes easier to apply than others and some tools easier to use than others. The five logical connectives found in most introductory logic textbooks can be reduced to a single one, the Sheffer stroke, and this reduction may be taken to be a gain in formal simplicity. From a psychological standpoint, however, it is easier to state and prove arguments using the five standard logical connectives. Goodman argues that there is an intimate relationship between psychological simplicity and entrenchment. The more a concept is employed, the more natural and simpler it is to use. Innate biases surely have some affect on psychological simplicity, but once again their influence is limited. As science progresses the concepts found most useful tend to cut across boundaries that are plausibly imposed by innate constraints. This does not mean that entrenchment and simplicity amount to the same thing, but it would seem to indicate that they cannot be easily separated. Psychological simplicity depends on entrenchment and vice versa.15 Although Goodman’s notion of “entrenchment” is spelled out with respect to languages, it suggests a natural extension to non-linguistic systems of representation. Perhaps curves plotted in Cartesian coordinates may be more entrenched than those using non-Cartesian coordinates. At the same time, curves in different systems that are assumed “co-extensive” in content will have the same entrenchment. At present there has been little work attempting to develop a theory of projectibility for non-linguistic systems.16 I maintained earlier that it probably did not pay to worry much at present about the details of the old or new proposed rules of induction. They only sketch an approach to the problem of projection. As they stand the rules are inadequate to deal with a good deal of science and everyday practices. When the affects of other recognized good-making properties of hypotheses (e.g. evidential support, conservatism, scope and simplicity) are added in, the inadequacy of the rules of projection to settle inductive conflicts on its own is even more apparent. Goodman’s work offers a pragmatic picture of a particular and important virtue of hypotheses, one that seems central to an analysis of inductive practice. There is no reason to assume future developments will accord entrenchment as crucial a role as it is now assigned in resolving the New Riddle and related problems. Nevertheless, as things now stand taking heed of entrenchment does seem to provide more insight into a range of topics than the available alternatives.
15 There is an extensive literature on the notion of “simplicity” and its uses. See A. Zellner et. al., [2001]. 16 See [Goodman, 1976; Elgin, 1997] for use of an entrenchment approach to a wide range of issues, well beyond those mentioned here.
410
Robert Schwartz
8.5 Justification and Norms: Granted entrenchment may help explain our inductive practices, there remains the knotty problem of justifying these practices. Here Goodman borrows a page from Hume. Habits not only explain the practice, they are also the basis for its justification. Predicates are projected because they are entrenched, and they are entrenched because they have been projected. There is no important metaphysical or epistemic difference between the two versions. Likewise it is equally correct to say: (i) a hypothesis is lawlike, because it is projected and projected because it is taken to be lawlike (ii) a hypothesis is projectible, because it is confirmable by its instances, and it is confirmable by its instances, because it is projectible (iii) kinds are projectible, because they are natural, and they are natural, because they have been used in past projections (iv) hypotheses are entrenched, because they are simple, and they are simple, because they are entrenched. Although these pairs of notions are not identical, they are inseparably linked. One props up the other, and together they provide a solid enough base to support weighty normative claims. According to Goodman, as the rules of deductive logic aim to capture and articulate the norms of deductive practice, the rules of inductive logic aim to capture and articulate the norms of inductive practice. As we criticize arguments for failing to come up to the deductive standards the rules of logic lay down, so we criticize inductive arguments that violate the normative rules implicit in accepted inductive practices (pace [Stich and Nisbett, 1980; see Elgin, 1996]). There is no deeper foundation. There is no stepping out of the circle, and were it possible to do so, nothing helpful will be found. The idea that a primary justification for projecting “All emeralds are green” rather than “All emeralds are grue.” is entrenchment remains very hard to swallow. Entrenchment it is felt is simply the wrong kind of thing to underwrite norms. Habit is too subjective a factor to ground or warrant inductive policies. Justification of inductive practice cannot rest on the practice of induction itself, for then there would be no objective justification for the accepted practice. There would be nothing unassailably fixed to keep inquiry from spinning out of control. Pragmatic solutions, such as Goodman’s, blur the difference between describing what a practice is and principles that determine what a practice should be. A firm, mind-independent, non-pragmatic foundation is required. Unless constraints are imposed from outside actual practice, whatever standards are adopted will merely express temporary subjective human commitments and preferences.17 17 Likewise it is thought that there must be something more than practice and habit to justify the distinctions between primitive ideas and those that are derived, between laws and accidental
Goodman and the Demise of the Syntactic and Semantic Models
411
Goodman and other pragmatists, of course, do not deny that there is an important difference between the is and the ought of practice. What they do claim is that norms emerge from critical reflection on accepted practice. There is neither an a priori nor a neutral perspective outside of practice from which to impose norms on practice. Norms, however, are not mere descriptions. They aim to capture and advance our understanding of best practices. The principles that emerge from reflection on practice are not absolute, necessary or eternal. If a principle’s ruling on cases is out of sync with informed intuitions and commitments, there will be pressure to rethink the authority of the principle. If the principle has served well and holds promise for continued fruitful application, its authority may exert enough force to withstand the discomfort provoked by some of its rulings. A constant need to strike a balance between the push of new cases and the pull of principle is to be expected, and it is most unlikely that there are algorithms for determining where to locate the equilibrium point. It is always possible, as well, that in setting norms more than one resolution of the trade-offs is acceptable. Goodman’s stance is that the norms of inductive logic are objective and justified in the only way it can make sense for them to be objective and justified. Hume was on solid ground in his appeal to habit both to explain and justify inductive practice. Pragmatic factors cannot be eliminated, and there is no good reason to seek their removal. Pluralism and the impermanence of norms does not entail anything goes. At any given time there is usually a reasonable amount of consensus around which practitioners can rally, if not completely agree. That the consensus may shift is not only possible but is welcome. What was good practice yesterday may not be acceptable today. Still standards can be adopted and a practice can be praised or criticized accordingly. Goodman’s bootstrap approach to justification has gained many more supporters since Rawls [1971] proposed a “reflective equilibrium” rationale for his principles of a just society. In the Theory of Justice (p. 48) Rawls, in fact, notes that his views have much in common with Goodman’s views on the justification of inductive practices. This, perhaps, is another indication of how far afield problems and issues related to the New Riddle extend.
BIBLIOGRAPHY [Achinstein, 1983] P. Achinstein, ed. The Concept of Evidence. Oxford: Oxford University Press, 1983. [Armstrong, 1978] D. M. Armstrong. Universals and Scientific Realism, Vol. II of A Theory of Universals. Cambridge: Cambridge University Press, 1978. [Carnap, 1947] R. Carnap. On the Application of Inductive Logic. Philosophy and Phenomenological Research 8: 133-148, 1947. [Carnap, 1948] R. Carnap. Reply to Nelson Goodman. Philosophy and Phenomenological Research 8: 461-462, 1948. generalizations, between natural and unnatural kinds, between items that are similar and those that are not and between simple and complex hypotheses. Descriptions of practice lack epistemic status.
412
Robert Schwartz
[Carnap, 1950] R. Carnap. Logical Foundations of Probability. Chicago: University of Chicago Press, 1950. [Davidson, 1966] D. Davidson. Emeroses By Other Names. Journal of Philosophy 63: 778-780, 1966. [Earman, 1985] J. Earman. Concepts of Projectibility and the Problems of Induction. Nous 19: 521-535, 1985. [Elgin, 1993] C. Elgin. Outstanding Problems: Replies to ZiF Critics. Synthese 95: 129-140, 1993. [Elgin, 1997] C. Elgin, ed. The Philosophy of Nelson Goodman V. 2 New York: Garland Publishing, 1997. [Elgin, 1996] C. Elgin. Considered Judgment. Princeton: Princeton University Press, 1996. [Elgin, 1997] C. Elgin. Between the Absolute and the Arbitrary. Ithaca: Cornell University Press, 1997. [Foster and Martin, 1966] M. H. Foster and M. L. Martin, eds. Probability, Confirmation, and Simplicity: Readings in the Philosophy of Inductive Logic. New York: The Odyssey Press, Inc., 1966. [Godfried-Smith, 2003] P. Godfried-Smith. Goodman’s Problem and Scientific Methodology. Journal of Philosophy 100: 573-90, 2003. [Good, 1975] I. J. Good. Explicativity, Corroboration and the Relative Odds of Hypotheses. Synthese 30: 39-73, 1975. [Goodman, 1946] N. Goodman. A Query on Confirmation. Journal of Philosophy 43: 383-385, 1946. [Goodman, 1947a] N. Goodman. On Infirmities of Confirmation Theory. Philosophy and Phenomenological Research 8: 149-151, 1947. [Goodman, 1947b] N. Goodman. The Problem of Counterfactual Conditionals. Journal of Philosophy 44: 113-128, 1947. [Goodman, 1951] N. Goodman. The Structure of Appearance. Cambridge: Harvard University Press, 1951. [Goodman, 1983] N. Goodman. Fact, Fiction, and Forecast. Cambridge: Harvard University Press, 1st edition 1955, 4th edition 1983. [Goodman, 1961] N. Goodman. Safety, Strength, Simplicity. Philosophy of Science, 28: 150-151, 1961. [Goodman, 1970] N. Goodman. Seven Strictures on Similarity. In Experience and Theory, L. Foster and J. Swanson, eds., pp. 19-29. Boston: University of Massachusetts Press, 1970. [Goodman, 1972] N. Goodman. Problems and Projects. Indianapolis: Bobbs Merrill Co., 1972 [Goodman, 1976] N. Goodman. Languages of Art (2nd ed.) Indianapolis: Hackett 1976. [Hanen, 1971] M. Hanen. Confirmation and Adequacy Conditions. Philosophy of Science 38: 361-368, 1971. [Harman and Kulkarni, 2006] G. Harman and S. Kulkarni. The Problem of Induction. Philosophy and Phenomenological Research LXXII: 559-575, 2006. [Hempel, 1943] C. G. Hempel. A Purely Syntactical Definition of Confirmation. Journal of Symbolic Logic 8: 122-143, 1943. [Hempel, 1945] C. G. Hempel. Studies in the Logic of Confirmation. Mind 54: 1-26 and 97-121, 1945. [Hempel, 1960] C. G. Hempel. Inductive Inconsistencies. Synthese 12: 439-469, 1960. [Hempel, 1966] C. G. Hempel. Philosophy of Natural Science. Edgewood Cliffs: Prentice Hall, 1966. [Hullet and Schwartz, 1967] J. Hullet and R. Schwartz. Grue: Some Remarks. Journal of Philosophy 64: 259-271, 1967. [Hume, 2006] D. Hume. An Enquiry Concerning Human Understanding, (ed.) T. Beauchamp, Oxford: Oxford University Press, 2006. [Jackson, 1975] F. Jackson. Grue. Journal of Philosophy 72: 113-131, 1975. [Jeffrey, 1983] R. Jeffrey. The Logic of Decision 2nd ed. Chicago: University of Chicago Press, 1983. [Kennedy and Chihara, 1978] R. Kennedy and C. Chihara. Beyond Zabludowski and Competitors. Philosophical Studies 33: 229-53, 1978. [Kuhn, 1977] T. Kuhn. Objectivity, Value Judgment and Theory Choice. In The Essential Tension, Chicago: University of Chicago Press, 320-39, 1977.
Goodman and the Demise of the Syntactic and Semantic Models
413
[Kyburg, 1964] H. E. Kyburg. Recent Work in Inductive Logic. American Philosophical Quarterly 4: 249-287, 1964. [Lewis, 1973] D. Lewis. Causation. Journal of Philosophy 70: 556-567, 1973. [Lewis, 1983] D. Lewis. New Work for a Theory of Universals. Australasian Journal of Philosophy 61: 343-377, 1983. [Norton, 2006] J. Norton. How Formal Equivalence of Grue and Green Defeats What is New in the New Riddle of Induction. Synthese 150: 185-207, 2006. [Piatelli-Palmarini, 1980] M. Piatelli-Palmarini, ed. Language and Learning: the Debate between J. Piaget and N. Chomsky. Cambridge : Harvard University Press, 1980. [Putnam, 1983] H. Putnam. Foreword to the Fourth Edition Fact, Fiction, and Forecast. Cambridge: Harvard University Press, vii-xvi, 1983. [Quine, 1970] W. V. Quine. Natural Kinds. In Essays in Honor of Carl G. Hempel, (ed) Nicholas Rescher et al., pp. 5-23, 1970. [Quine and Ullian, 1978] W. V. Quine and J. Ullian. The Web of Belief. 2nd ed. New York: Random House, 1978. [Rawls, 1971] J. Rawls. A Theory of Justice. Cambridge: Harvard University Press, 1971. [Salmon, 1970] W. C. Salmon. Bayes’s Theorem and the History of Science. In Historical and Philosophical Perspectives on Science. In Minnesota Studies in the Philosophy of Science V, (ed) R. Stuewer. Minneapolis: University of Minneapolis Press, 68-86, 1970. [Scheffler, 1963] I. Scheffler. The Anatomy of Inquiry: Philosophical Studies in the Theory of Science. New York: Alfred A. Knopf, Inc., 1963. [Scheffler and Goodman, 1972] I. Scheffler and N. Goodman. Selective Confirmation and the Ravens. Joural of Philosophy 69: 78-83, 1972. [Scheffler, 1982] I. Scheffler. Projectibility: A Postscript. Journal of Philosophy 79: 334-36, 1982. [Schwartz, Scheffler and Goodman, 1970] R. Schwartz, I. Scheffler, and N. Goodman. An Improvement in the Theory of Projectibility. Journal of Philosophy 67: 605-608, 1970. [Schwartz, 1971] R. Schwartz. Confirmation and Conflict. Journal of Philosophy 68: 483-487, 1971. [Schwartz, 1999] R. Schwartz. Reflections on Projection. Protosociology: After the Received View. (eds.) G. Preyer & A. Ulfag, 1999. [Schwartz, 2005] R. Schwartz. Note on Goodman’s Problem. Journal of Philosophy, 375-79, 2005. [Shoemaker, 1975] S. Shoemaker. On Projecting the Unprojectible. Philosophical Review 84: 178-219, 1975. [Shoemaker, 1980] S. Shoemaker. Properties, Causation and Projectibility. In Applications of Inductive Logic (eds.) L. J. Cohen & M. Hesse, Oxford: Oxford University Press, 291-312, 1980. [Stalker, 1994] D. Stalker, ed. Grue: The New Riddle of Induction. Chicago: Open Court, 1994. [Stemmer, 2004] N. Stemmer. The Goodman Paradox: Three Different Problems and a Naturalistic Solution of Two. Journal for General Philosophy of Science 35: 351-370, 2004. [Stich and Nisbett, 1980] S. Stich and R. E. Nisbett. Justification and the Psychology of Reasoning. Philosophy of Science 47: 188-202, 1980. [Swinburn, 1974] R. Swinburne, ed. The Justification of Induction. Oxford: Oxford University Press, 1974. [Ullian and Goodman, 1975] J. Ullian and N. Goodman. Bad Company: A reply to Mr. Zabuwdowski and Others. Journal of Philosophy 72: 142-145, 1975. [Wilson, 1979] M. Wilson. Maxwell’s Condition - Goodman’s Problem. British Journal for the Philosophy of Science 30: 107-123, 1979. [Zabludowski, 1974] A. Zabludowski. Concerning a Fiction about How Facts are Forecast. Journal of Philosophy 71: 97-112, 1974. [Zellner, Keuzenkamp and McAleer, 2001] A. Zellner, H. Keuzenkamp, and M. McAleer, eds. Simplicity, Inference and Modeling. Cambridge: Cambridge University Press, 2001.
THE DEVELOPMENT OF SUBJECTIVE BAYESIANISM
James M. Joyce The Bayesian approach to inductive reasoning originated in two brilliant insights. In 1654 Blaise Pascal, while in the course of a correspondence with Fermat [1769], recognized that states of uncertainty can be quantified using probabilities and expectations. In the early 1760s Thomas Bayes [1763] first understood that learning can be represented probabilistically using what is now called Bayes’s Theorem. These ideas serve as the basis for all Bayesian thought.
1.1
Pascal’s Insights: Probability and Expectation
In modern terms, Pascal’s insight is that uncertainty about the occurrence of an event can be expressed as a probability and, more generally, that uncertainty about the value of a quantity can be expressed as a mathematical expectation. The basic objects of uncertainty can be thought as propositions or events in a non-empty Boolean algebra Ω that is closed under negation and countable disjunction. A probability function on Ω is a mapping P of Ω into real numbers that obeys these laws: Normality. For any A ∈ Ω, P (A ∨ ¬A) = 1 and P (A ∧ ¬A) = 0. Finite Additivity. P (A ∨ B) + P (A ∧ B) = P (A) + P (B). Continuity. If A1 ⊆ A2 ⊆ A3 ,. . . is a countable sequence of events with A = ∨n An , then P (An ) converges to P (A). These laws make probabilities countably additive, so that P (∨n An ) = n P (An ) for any countable set of contraries {A1 , A2 , . . .}. They also ensure that probabilities respect logical relationships, so that P (A) = 1 when A is a logical truth, P (A) = 0 when A is a contradiction, and P (A) ≥ P (B) when A entails B. A random variable is a function f that assigns a real number f (An ) to each element of apartition1 {A1 , A2 , . . .} in Ω. f ’s expected value relative to P is Exp P (f ) = n P (An ) · f (An ). Pascal maintained that the expected value of a quantity provides the best estimate of its actual value. A useful example is provided the puzzle that inspired Pascal to invent the concept of an expectation. 1A
partition is a set of contrary propositions whose disjunction is a logical truth.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
416
James M. Joyce
Problem of the Points. Joe and Moe are tossing a fair coin until five heads or five tails come up. If it’s heads, Joe wins a pot of $100. If it’s tails, Moe wins the pot. After tosses of h, t, h, t, h, h the game is interrupted, and Joe and Moe must split the pot. An even split is unfair to Joe since he was likely to win, but giving it all to Joe would be unfair to Moe, who still had a chance of winning. Pascal proposed that each player’s fair share is his expected payoff. To obtain these values one finds probabilities for each endgame — P (h) = 1/2, P (t, h) = 1/4, P (t, t, h) =1 /8 , P (t, t, t) =1 /8 — and computes Exp($ Joe) = 1/2·$100 + 1/4·$100 + 1 /8·$100 + 1 /8·$0 = $87.50 and Exp($ Moe) = 1/2·$0 + 1/4·$0 + 1 / ·$0 + 1 / ·$100 = $12.50. 8 8 Though not explicit about it, Pascal clearly understood that expectations satisfy the following laws: Linearity. If f (•) = a · g(•) + b · h(•) + c, then Exp(f ) = a·Exp(g) + b·Exp(h) + c. Dominance. If f (An ) ≥ g(An ) for all An then Exp(f ) ≥ Exp(g). If, in addition, f (An ) > g(An ) for some An with P (An ) > 0, then Exp(f ) > Exp(g). Continuity. If A1 ⊆ A2 ⊆ A3 ,. . . is a countable sequence of events with A = ∨n An , then Exp(An ) converges to Exp(A). These principles follow easily from the laws of probability. Conversely, they entail the laws of probability when applied to indicator functions or truth-valuations that assign elements of Ω truth-values in {0, 1} in a consistent way, where v(A) = 1 signifies truth and v(A) = 0 indicates falsity. Dominance ensures Exp(v()) ≥ Exp(v(A)) ≥ Exp(v(⊥)) for all A. Linearity entails Exp(v()) = 1, Exp(v(⊥)) = 0, and Exp(v(A∨B)) + Exp(v(A∧B)) = Exp(v(A)) + Exp(v(B)) since v(A∨B)+ v(A ∧ B) = v(A) + v(B). Exp(v(An )) converges to Exp(v(A)) by Continuity. So, expectations of truth-value obey the laws of probability. Expectations of truthvalue, however, just are probabilities since Exp P (v(A)) = P (A)·1 + P (¬A)·0 = P (A). Thus we see that Pascal’s view that uncertain events should be evaluated using probabilities and his view that random variables should be evaluated using expectations are at root one idea.
1.2 Bayes’s Insights: Conditional Probability and Bayesian Updating Thomas Bayes’s insight was to recognize the central role that conditional probabilities play in learning. Unlike unconditional probabilities, which reflect all-thingsconsidered uncertainties, conditional probabilities reflect uncertainties about one event on the supposition that another occurs. The probability of A conditional on C, written P (A|C), is required to satisfy the following law: Conditional Probability. P (A ∧ C) = P (C)·P (A|C)
The Development of Subjective Bayesianism
417
Clearly, P (A|B) can be expressed as P (A ∧ C)/P (C) when C has positive probability, but the law leaves P (A|C) unspecified when P (C) = 0. Some theories allow for the assignment of probabilities conditional on the supposition of events of probability zero. See, e.g., [Popper, 1959; R´enyi, 1955]. Bayesians use conditional probabilities both to capture evidential relationships and to describe the effects of learning. The following five principles are essential to understanding the role of conditional probabilities in Bayesian accounts of inductive reasoning. Conditional Probability. P (·|C) is a probability on Ω for which P (C|C) = 1. Total Probability. If {C1 , C2 , . . .} is a partition, then P (A) = n P (Cn )·P (A|Cn ). Correlation. A and C are positively correlated (i.e., P (A ∧ C) > P (A) · P (C)) exactly if P (A|C) > P (A), and they are uncorrelated exactly if P (A|C) = P (A). Preservation. Conditioning on C does not disturb ratios of probabilities for events entailed by C, so that P (A ∧ C|C)/P (B ∧ C|C) = P (A ∧ C)/P (B ∧ C) for all A, B. Bayes’s Theorem. P (A|C) = P (A)·(P (C|A)/P (C)) Conditional Probability tells us that conditioning always produces a new probability function that makes the condition certain. Total Probability expresses A’s unconditional probability as a weighted average of its conditional probabilities. Correlation, a key element in Bayesian theories of evidence, captures the idea that one event is positively/negatively correlated with another to the extent that the occurrence of the first raises/lowers the second’s probability. The “Bayes factor” βP (A, C) = P (A|C)/P (A), which provides one way of expressing the change that conditioning on C makes to A’s probability, is a measure of this correlation. A and C are positively correlated when βP (A, C) > 1, perfectly correlated when βP (A, C) = βP (A, A) = 1/P (A). A and B are independent when βP (A, C) = 1. They are anti-correlated when βP (A, C) < 1, and perfectly so when βP (A, C) = βP (A, ∼ A) = 0. Preservation ensures that conditioning on C produces a probability that “minimally departs” from P : if Q is a probability with Q(C) = 1 and Q(A ∧ C)/Q(B ∧ C) = P (A ∧ C)/P (B ∧ C) for all A, B, then Q(•) = P (•|C). Bayes’s Theorem sits at the heart of Bayesian approaches to inductive reasoning. Bayes is remembered not so much for discovering the theorem, a mathematical triviality, but for recognizing its significance. It relates the “direct” probability of one event condition on another to the unconditional probabilities of the two events and the “inverse probability” of the second event conditional on the first. As Bayes realized, there are many circumstances in which (a) one is interested in knowing the “direct” probability of some hypothesis conditional on certain data, (b) it is fairly easy to discover or deduce the “inverse” probability of the data
418
James M. Joyce
conditional on the hypothesis, and (c) one has “prior” information that allows one to estimate the probability of the hypothesis in the absence of the data. In such situations, Bayes’s little theorem provides a way of arriving at the desired quantity in (a) from the information in (b) and (c). More generally, imagine that one might receive an item of data x that is relevant to assessing the probability distribution over a partition of hypotheses H. If one knows each of the “inverse probabilities” P (x|h) and has a “prior” probability P (h) for each for h ∈ H, then Bayes’s Theorem allows one to compute P (g)·P (x|g)] P (h|x) = P (h)·P (x|h)/[ g∈H
In this way, “posterior” probabilities for hypotheses conditional on data are entirely determined by “prior” probabilities of hypotheses and “inverse probabilities” of data given hypotheses. Notice also that the expression P (x|h)/[ g P (g)·P (x|g)] is just the Bayes factor β(h, x). So, the theorem says P (h|x) = P (h)·β(h, x), which makes it clear that the Bayes factor measures the change that conditioning on x makes to h’s probability. Another illuminating form of the theorem reveals itself when we focus on odds and likelihoods rather than probabilities. The unconditional (conditional) odds of h to g is the ratio of P (h) to P (g) (or P (h|x) to P (g|x)). Statisticians use the term “likelihood” to denote inverse probabilities. They call the map Lx : H → [0, 1] defined by Lx (•) = P (x|•) the likelihood function for x, and Lx (h)/Lx (g) = P (x|h)/P (x|g) the likelihood ratio of h to g. In this jargon, Bayes’s Theorem says that the ratio of the posterior odds to the prior odds is the likelihood ratio: [P (h|x)/P (g|x)]/[P (h)/P (g)] = Lx (h)/Lx (g). The likelihood ratio is thus the factor by which we multiply unconditional odds to get conditional odds. In terms of Bayes factors, Lx (h)/Lx (g) = β(h, x)/β(g, x). This formulation is noteworthy because it shows that, in contrast with probabilities, changes in odds among hypotheses produced by conditioning on x do not depend on prior probability over H: ratios of likelihoods suffice. Example. Joe is being tested for the presence of a rare gene. We want to know how a positive result should affect our estimate of the chances that he has it. We know the test’s true positive rate L+ (gene) = P (+test | gene) = 0.9, and its false positive rate L+ (¬gene) = P (+test |¬gene) = 0.3. We also know the gene occurs naturally in only one in a thousand cases, and have no reason to think Joe is special. So, P (gene) = 0.001. Under these circumstances, Bayes’s theorem tells us that P (gene | +test) = 0.001·0.9 / [0.001·0.9 + 0.999·0.3] ≈ 0.003. We can also use the likelihood ratio to determine how a positive test will alter the odds of Joe having the gene. Since L+ (gene) /L+ (gene) = 3, the odds will triple. If our pre-test situation had been different and, say, we knew that Joe’s mother has the gene, and so P (gene) = 0.5, then P (gene | +test) = 3 /4 owing to the higher unconditional probability. But, the likelihood ratio is still 3.
The Development of Subjective Bayesianism
419
Bayes’s other great insight was to recognize that the conditional probabilities governed by his theorem are closely tied to learning. In modern terms, we would express his idea like this: Learning as Bayesian Updating. Imagine a person’s whose state of uncertainty is characterized by a “prior” probability P0 on Ω, and who is not dogmatic about x, so that 1 > P0 (x) > 0. If the person undergoes a learning experience in which the only new information she acquires is that x is certainly true, then her post-learning “posterior” probability P1 should coincide with her pre-learning probability conditional on x, P1 (•) = P0 (•|x). Some people call this “conditioning,” others “conditionalization,” but the basic idea is the same: dogmatic learning, the kind in which one becomes certain of x (and this is all one learns), involves reapportioning probabilities so that all probability that was previously invested in ¬x is shifted onto x in a way that preserves probability ratios among propositions that entail x. The laws of conditional probability ensure that Bayesian updating has features that seem desirable in any dogmatic learning rule. Dogmatism. Updating on x makes x certain: P1 (x) = 1. Preservation. Updating on x leaves certainties intact: P1 (y) = 1 whenever P0 (y) = 1. Coherence. P1 is a probability if P0 is a probability. Responsiveness to Evidence. If x is evidence for (against)h according to the pre-learning probability, then updating on x raises (lowers) x’s probability. Minimal Change. Updating on x does not disturb ratios of probabilities of events entailed by x, i.e., the Bayes update factors P1 (y ∧ x)/P0 (y ∧ x) = 1/P0 (x) are constant. Accumulation. Learning is accumulative in the sense that updating on x1 and then x2 is equivalent to updating on their conjunction.2 Commutativity. The temporal order in which data is acquired is irrelevant to its evidential import. If Q is obtained from P0 by updating on x1 and then on x2 , and if Q∗ is obtained from P0 by conditioning on x2 and then on x1 , then Q(•) = Q∗ (•). All these features seem like strengths given that Bayesian updating on x is only appropriate as a response to learning experiences whose entire content is that x, and nothing stronger, is certainly true, and where x does not contradict anything previously known. One might have worries about how common these sorts of 2 This should not be confused with monotonicity, the idea that if learning x supports h then learning x and y will support h as well. Bayesian updating is highly non-monotonic.
420
James M. Joyce
experiences are (see §3.2), but Bayesian updating is the right way to model them when they do occur.
1.3
The Basic Bayesian Apparatus for Inductive Reasoning
As we have seen, Pascal and Bayes had four big ideas: • Uncertainty is best represented using probabilities. • Estimates of uncertain quantities are expectations. • Conditional probabilities are fruitfully interpreted using Bayes’s Theorem. • Dogmatic learning experiences involve conditioning on the data received. Let’s call this the basic Bayesian apparatus for inductive reasoning. To see how it works, consider an abstract model of inductive problems. A Bayesian experimental setup is a triple H, X , P where H is a partition of hypotheses, X is a partition of potential data, and P is a probability defined over the Boolean algebra H ∧ X generated by all conjunctions h ∧ x with h ∈ H and x ∈ X . A learning experience is an exogenous change in the probability distribution on X whose direct effect is to replace each “prior probability” P (x) by a “posterior probability” Q(x).3 In some experiences a single data item is learned for certain, in which case Q(x) = 1 for some x, but the model allows for experiences that readjust probabilities over X in other ways as well. It will, however, be required that learning never “wakes the dead” by raising an event of prior probability zero to positive probability. The goal of Bayesian inference is to explain how learning-induced changes in the probabilities over X ramify through the rest of H ∧ X , in particular how they alter probabilities on H. Given a “prior” P defined over H ∧ X and a partial posterior QX defined over X alone, the goal is to find the extension Q of QX over all of H ∧ X that is best justified in light of both the information encoded in the prior and the new evidence. Bayesian updating rules amalgamate this evidence into a single posterior that agrees with the new observations about X and reflects the input of prior information. Different rules are appropriate for different evidential inputs. In simplest learning experiences the experiment will show only that the truth lies within some subset X of X . Learning as Conditioning then requires that Q(•) = P (•|X ) and Bayes’s Theorem tells us that, for each h ∈ H, Q(h) = P (h)·[P (X|h)/P (X)]
= P (h)·[x P (x|X)·(P (x|h)/P (x))] = P (h)·[ x (Q(x)/P (x))·P (x|h)]
3 I am slurring over the distinction between probability functions and probability densities. One needs to be careful about this when either H or X is uncountable.
The Development of Subjective Bayesianism
421
If we wanted to estimate the value of some random variable f defined on H in light of this information we would use Exp Q (f ) = h f (h)·Q(h) = h f (h)·P (h)·[P (X|h)/P (X)] = h f (h)·P (h)·[ x (Q(x)/P (x))·P (x|h)] Example. Three balls are about to be drawn, with replacement, from an urn that contains black and white balls. You know that the urn can be of two types: Type-1 urns contain 20% black balls; Type-2 urns contain 60% black balls. You want to know the urn’s type, and are also interested in estimating the number of black balls that will appear among the last two balls on the basis of information about the first ball. Here H = {Type 1 , Type 2 } and X = {Black, White}. Let’s also suppose that you have information that leads you to think that P (Type 1 ) = 0.25,P (Type 2 ) = 0.75. The prior probability then looks like this: ∧ Black (0.5) White (0.5)
Type 1 (0.3) 0.05 0.20
Type 2 (0.7) 0.45 0.30
Suppose the first ball drawn is black. Conditioning on this information requires you to move all the posterior probability on to Black and to preserve ratios among events of the form Type m ∧Black, so that the posterior looks like this: ∧ Black (1) White (0)
Type 1 (0.1) 0.1 0
Type 2 (0.9) 0.9 0
To estimate the number of blacks that will appear in the next two draws, first use the law of total probability to compute the probabilities: Q(BB ) ≈ 0.33, Q(BW ) = Q(WB ) ≈ 0.235, Q(WW ) ≈ 0.2, and then the expectation is given by Exp Q (#B) = 2·Q(BB) + 1·Q(BW) + 1·Q(WB) + 0·Q(WW) = 1.6
1.4
Comparison with Frequentist Approaches
It may be instructive to contrast the Bayesian approach to inductive inference with the likelihood based approaches of “frequentist” statisticians. One can think of the probability that appears in a Bayesian experimental setup as being factorable into
422
James M. Joyce
a prior probability P H over H and a family of normalized4 likelihood functions H Lx : H → [0, ∞),one for Heach x ∈ X . The likelihoods must agree with P in the sense that x h P (h) · Lx (h) = 1. Together the prior and likelihoods determine an unconditional probability for each atomic element of H ∧ X via the rule P (x ∧ h) = P H (h)·Lx (h), and this fixes P over all of H ∧ X . For example, the probability of a data item x ∈ X is the expected value of its likelihood function P (x) = n P H (hn )·Lx (hn ). Frequentist statisticians also draw inductive inferences using likelihoods, but they eschew priors. In frequentist experiments the probability P is replaced by a family of general likelihood functions lx : H → [0, ∞) that are only determined up to multiplication by a positive constant, so that lx (h) = λx · P (x|h) where λx > 0 is a constant that can depend on x but not h. The values of lx have no meaning taken individually. In particular, they cannot be directly identified with inverse probabilities in a Bayesian model. They can, however, be used to express facts about the relative degree to which a datum x is predicted by hypotheses in H. For example, one can say that a given lx assumes its maximum at hx or one can compute likelihood ratios, lx (h)/lx (g) = P (x|h)/P (x|g), which compare x’s expectedness given distinct hypotheses. Since lx (h)/lx (g) = β(h, x)/β(g, x), this means that classical statisticians can, if they want, describe certain kinds of changes in the probabilities of hypotheses. They can say, e.g., that learning x increases the probability of h as a proportion of its prior more than it increases the probability of g as a proportion of its prior. But, whatever happens, they will not say anything about the absolute probabilities of hypotheses in light of the data since this information cannot be extracted from the likelihood function without invoking unconditional probabilities for elements of H. On a classical picture, all inductive reasoning boils down to drawing inferences from observed data based on facts about likelihoods. This suggests a very different picture of inductive inference from what one finds in the Bayesian approach. For example, to estimate the value of a random variable f : H → in light of datum x the classical statistician cannot calculate f ’s expected value conditional on x since that invokes a prior over H. Instead, she might use maximum likelihood estimation and estimate f ’s value as f (hx ) where hx is the hypothesis in H for which lx (h) attains its maximum. Likewise, the classical statistician cannot adopt a broadly Bayesian policy of assessing the acceptability of hypotheses on the basis of their posterior probabilities, since these depend on priors. She might, instead, decide to reject a hypothesis if, in the event it were true, the probability of observing the data actually observed or even more unlikely data falls below some threshold value. By adopting a model that permits both likelihoods and priors, Bayesians have been able to secure a far richer and more coherent theory of inductive inference than anything to which frequentist statisticians might aspire. As frequentists like P is normalized when thereP are non-negative constants μx , one for each x ∈ X , with x μx = 1 such that the numbers px = n μx ·Lx (hn ) sum to one. This ensures that the Lx (hn ) can be consistently thought of as being equal to the probability of x conditional on hn . 4L
The Development of Subjective Bayesianism
423
Fisher [1959] and Neyman [1950] have been quick to reply, however, there are substantial costs associated with these benefits. The use of a prior in drawing inductive inferences requires one to trust its probabilities. Sometimes this is fine. In situations where there are determinate, objectively measurable and agreed upon probabilities for hypotheses and potential data items, everyone will agree that there is no better way to draw conclusions than by using the Bayesian apparatus. Everybody should be a Bayesian in the casino! Even so, there is no getting around the fact that Bayesianism is a garbage-in-garbage out enterprise: if one applies the apparatus using a prior that is accurate and well unjustified, the conclusions derived will be accurate and well justified as well; if one applies the apparatus using a prior that is inaccurate or unjustified, the conclusions will also inaccurate or unjustified. This is the heart of frequentist misgivings. From the perspective of frequentist statisticians, Bayesian methods carry a massive uncollateralized risk or error. By analogy, suppose that a rogue group of logicians proposed to redefine the notion of logical consequence as follows: p counts as a logical consequence of q not only when p can be deduced from q by the laws of logic, but also when p can be deduced form q together with r, s, t, . . ., where r, s, t,. . . are “prior” premises the rogue logicians find reasonable. While the rogues will be able to deduce far more that any classical logician can, their reliance on “priors” introduces an objectionable new source of error. Indeed, it would seem that part of the reason to have a logical consequence relation is to avoid risking such errors. Frequentist statisticians see Bayesians as rogues of the same sort. They are skeptical of Bayesian methods because they doubt that prior probabilities can be made epistemologically respectable, and feel that their introduction threatens to undermine the accuracy and objectivity of our inductive reasoning. 2 THE PROBLEM OF THE PRIORS Bayesians must address this criticism head-on by offering some rationale for the use of priors in inductive reasoning. Three broad sorts of rationales have been proposed. Objective Bayesians maintain that certain priors can be justified a priori as the uniquely correct way to represent uncertainty. Subjectivists argue, in contrast, that priors reflect the subjective degrees of belief, or credences, of agents. They are bound by no requirements, save the laws of probability. Tempered Bayesians believe that priors reflect subjective credences, but suggest that agents who update in light of evidence will, at least eventually, end up using “priors” that are well justified and likely to be accurate. Let’s consider these three approaches in turn.
2.1
Objective Bayesianism and Ignorance Priors
The first objective Bayesian was Bayes himself, but the driving force behind the approach was unquestionably Pierre-Simon Laplace whose Principle of Insufficient
424
James M. Joyce
Reason 5 was meant to provide an a priori rationale for imposing uniform prior distributions in contexts where the data provides no basis for distinguishing among rival hypotheses. In Laplace’s terminology, the hypotheses in H are “equipossible” when nothing in the available evidence favors any one over any other. When the evidence is symmetrical in this way, Laplace reasoned, the probability assignment that best reflects this evidence is symmetrical as well. He codified this insight in the following principle: PIR. If the available evidence provides no reasons to favor any hypothesis in H over any other, then the uniquely correct prior to assign is the uniform distribution in which P (h) = P (g) for all h, g ∈ H. Notice that the Principle does not say anything about the quantity or quality of the “available evidence”. This is one of its most controversial features since the main purpose of PIR is to allow for the assignment of prior probabilities in contexts where little or no evidence exists. These assignments — which have been called “ignorance priors,” “uninformative priors,” “informationless priors” or “reference priors” — are supposed to provide the input for the Bayesian apparatus in contexts where there is not much hard data to be had. Example. An urn was selected from a population {U0 , U1 ,. . . , U10 } where Ui contain i black balls and 10 – i white balls. What is the prior probability that a ball randomly drawn from the urn will be black? Before answering, consider three ways of fleshing out the story. Case 1
We know that our urn is U5 .
Case 2
We know that our urn was selected from via a random process in which each Ui had an equal objective chance of being chosen.
Case 3
We know nothing about the identity of the urn or about the process by which it was selected.
Since our evidence for Black and ¬Black is symmetrical in all three cases, PIR tells is that the right probability is P (Black ) = P (¬Black ) = 1/2 in all three cases. Case 3 is the controversial one. In Case 1 and Case 2 PIR gets things right, but the answer can be derived independently from the plausible requirement that priors should line up with known objective chances (see § 4.2). In Case 3 , however, we know almost nothing about the chances of Black. How are we supposed to go from this sparse evidential basis to the same probability assignment that is warranted in the other two cases? The standard rationale goes like this: Consider all possible chance hypotheses about the way the urn was selected, i.e., the set Π of probability distributions on 5 The name comes from von Kries [1871]. The name Principle of Indifference is used by Keynes [1921].
The Development of Subjective Bayesianism
425
{0, 1,. . . , 10}. Since our evidence provides no grounds for preferring any π ∈ Π to any other, our prior should not play favorites among Π’s elements. The only way to avoid playing favorites is by assigning each π ∈ Π the same probability. Thus, the right prior for this problem is uniformly distributed over Π, which forces P (black ) to be 1/2. In effect, Case 3 is reduced to case Case 2 . This reasoning offers the glittering prospect of a Bayesian inductive logic in which prior probabilities are justified a priori on the basis of the sound epistemological principle that one should not include information in one’s prior that is not found in the data. Once such a uniquely correct “ignorance prior” is in place, the Bayesian apparatus tells us everything there is to know about inductive reasoning. To obtain the posterior probability distribution that is best justified in light of one’s data one should update the ignorance prior by conditioning on that data. Here is a noteworthy application of this method. Example. You are presented with a coin of unknown bias. You toss it once and it comes up heads. What is the probability that it will land heads on the second toss? Laplace [1774] argued that this probability should be exactly 2 /3 . Since you know nothing about the coin’s bias p, he reasoned, you should invoke PIR and adopt a prior with the uniform density dp over the unit interval. The probability of a head on the first toss is then ∫ 10 p dp = 1/2. If you observe a head and update via Bayes’s theorem, you obtain a posterior density of (2pdp), and the probability of getting a second head is ∫01 p·(2pdp) = 2/3 . If you observe a second head and update, you get a density of (3p2 dp) and the probability of a third head is ∫01 p·(3p2 dp) = 3/4 . More generally, if you keep tossing and conditioning you will emulate Laplace’s rule of succession, which says that the probability of observing a head on the N + 1st trial given s heads and N − s tails on previous trial is s+1 /N +2 . If this broad approach to inductive reasoning is correct, then objective Bayesians have an answer for frequentist critics. It is legitimate to invoke ignorance priors when assessing the impact of data on hypotheses, they will argue, because the “added premises” can be justified a priori. Among all the probability distributions that could be applied to a given inductive problem, the ignorance prior is the one that introduces the least amount of additional information: any other way of proceeding would require drawing distinctions among hypotheses that are not justified by the data. This Laplacian picture of inductive logic, which was greatly advanced by Jeffreys [1939], finds it fullest expression in the work E. T. Jaynes [2003]. Jaynes, a militant objectivist about priors, writes: Consistency demands that two persons with the same relevant prior information should assign the same prior probabilities. . . Objectivity requires that a statistical analysis should make use, not of anybody’s personal opinions, but rather the specific factual data on which those opinions are based. [1968, p. 53]
426
James M. Joyce
Jaynes seeks to secure consistency and objectivity by using information theory to generalize PIR. In Jaynes’s [1973] picture, an inductive problem is defined by a set of objective constraints that stipulate expected values6 c1 , . . ., cK for random variables f1 , . . ., fK defined on H. An ignorance prior P ∗ for such a problem must satisfy two conditions: it must yield the required expectations, so that Exp P ∗ (fm ) = cm for all m; and it must maximize entropy Ent(P ) = Σm P (hm ). ln(P (hm )) across all probabilities P that yield the required expectations. Under broad conditions, there will always be a unique such P ∗. Moreover, since Ent measures the amount of information that a probability convey about elements of H, Jaynes argues that using P ∗ as one’s prior introduces the least amount of additional information into the problem consistent with constraints. The proposal, then, is this: MaxEnt. If the available evidence specifies that Exp P (fk ) = ck for all k, then the uniquely correct prior P ∗ is such that Ent(P ∗ ) > Ent(P ) for all P that produce the required expectations. This generalizes PIR since, in the absence of constraints, the uniform distribution uniquely maximizes entropy. But, MaxEnt has wider application. Example. In the strange country of Bulmania the expected number of boys among families with two children is 1.5. To find the MaxEnt prior over sex-pairs BB, BG, GB, GG, use Lagrange multipliers to obtain P ∗ ≈ 0.564, 0.186, 0.186, 0.064. Should we embrace this “objective” method for selecting priors? Many philosophers and statisticians reject PIR and MaxEnt on the grounds that they produce inconsistent results when applied to single situation under different descriptions. This objection was first raised by John Venn [1866] and has been recapitulated many times since, perhaps most forcefully in Keynes [1921]. Here are two famous versions of it: Example (Venn’s Paradox). Your car’s gas tank is a cube with sides between 20 and 40 centimeters in length. This is all you know. Partitioning the possibilities by side-length and distributing your prior probability uniformly over [20cm, 40cm] yields an expected side-length of 30cm and an expected volume of 27 liters. But, partitioning by volume, and distributing your prior uniformly over [8 liters, 64 liters] produces an expected volume of 36 liters and an expected side length of 33cm. Example. (Bertrand’s Paradox). Imagine an equilateral triangle of √ side length 3 that is inscribed within a circle of radius 1. What is the probability p that a chord drawn at random across the circle will √ have length greater than 3? Here are three plausible answers: 6 More generally, the constraints might specify an allowable range of expected values for each variable.
The Development of Subjective Bayesianism
427
• Chords with midpoints equidistant from the circle’s center are the same length, and√those for which this distance d is more than 1/2 are longer than 3. Thus, if we apply PIR to the possible values of d ∈ [0, 1], we get p = 1/2. • Each chord splits the circle into two arcs, a short one of length a ∈ [0, π] and a long one of length 2π − a. Chords with the same short arc length have the √ same length, and those for which 2π /3 < a ≤ π are longer than 3. If we apply PIR to the possible values of a, the interval [0, π], we get p = 1 /3 . • Chords with midpoints that fall inside the circle of radius√1/2 that is inscribed within the triangle have lengths that exceed 3. Imposing a uniform density over all possible midpoints for the chord (points in the circumscribing circle) produces a probability equal to the ratio of the area of the smaller circle, π /4 , to the area of the larger circle, π. Thus, p = 1/4. While opponents of PIR often regard these objections as dispositive, its proponents argue that the principle is being misapplied. In response to Venn’s objection, Jeffreys [1939] argued that an ignorance prior should not be uniform over either length in centimeters or volume in liters since these involve arbitrarily chosen measuring units, and it is clear a priori that policies for selecting priors should not rely on arbitrary choices. An acceptable prior should be scale-invariant: if T (x) = u·x (for u > 0) is a transformation that alters the unit of distance from centimeters x into, say, inches (u = 0.3937), then the rule for assigning priors should yield the same results whether applied to x or to T (x). More precisely, the prior should be defined by a density p(·) such that P (a < x < b) = ∫ab p(x)dx = ∫ab p(T (x))dx for all a and b. It can be shown, see [Lee, 1997, p. 101], that the only density 40 (u · x)−1 dx. This makes that fits the bill for all u > 0 is p(u · x) = (u · x)−1 / ∫20 P (u·a < u·x < u·b) = ln(b/a)/ln(2) the unique scale invariant prior for lengths. In effect, Jeffreys applies PIR not to length itself, but to its logarithm. By similar reasoning, P (u·c < u·v < u·d) = ln(c/d)/ln(8) is the unique scale-invariant prior for volume. With these priors the contradiction vanishes since for any side lengths a < b and any unit of length u > 0, P (u3 ·a3 < u3 ·v < u3 ·b3 ) = ln(b/a)/ln(2) = P (u·a < u·x < u·b). Jeffreys and his followers have generalized this sort of maneuver to cover a variety of applications. A particularly beautiful example is Jaynes’s [1973] solution to Bertrand’s paradox. Jaynes agues that, in addition to rotational symmetry (which all three proposed solutions have), an adequate rule for choosing priors should not vary with changes in the size of the circle or its position in space. Thus, we should look for a prior density that is invariant under rotations, under variation in scales for measuring lengths, and under translations of the midpoint of the circle in space. Remarkably, this suffices to fix p = 1/4 as the uniquely correct answer! The moral is supposed to be that PIR/ Maxent can only be applied after one has identified all relevant symmetries that apply in a situation. A “well
428
James M. Joyce
posed” inductive problem will include evidential constraints that express all of these various symmetries and will treat symmetrical alternatives as “equipossible”. The paradoxical character of PIR/Maxent then disappears, says Jaynes. The Jeffreys/Jaynes approach is subject to a number of criticisms. First, there is the technical worry that ignorance priors are often “improper” in the sense that their probability densities go infinite. For example, the Jeffreys density for the Venn problem blows up both when zero is in x’s range and when the range is unbounded.7 A more serious issue concerns the status of symmetry principles. While objective Bayesians portray these as a priori constraints on priors, they clearly import a posteriori information into the situation. There is no reason, in principle, why the units in which length or volume is measured could not matter to the probabilities of various results. Example. You know that all Bulmanian cars have cubical gas tanks with a capacity of B buliliters, but you have no idea whether the buliliter is a unit of length or volume. When you ask how much gas is your rental car you are told only that the attendant who filled it always chooses some number b ∈ [0, B] that strikes her as lucky and puts b buliliters in the tank. It seems in the spirit of PIR to impose a uniform distribution over b, rather than ln(b). The units matter since the attendant makes her choice on the basis of the unitless number. The imposition of rotational and translational symmetries seems even less a priori. Take the Bertrand paradox. Jaynes suggests that one can deduce the physically relevant symmetries a priori, and even argues on this basis that the observed frequencies must conform to p = 1/4. He reports an experiment that confirms this: “The Bertrand experiment has, in fact, been performed by. . . tossing broom straws from a standing position onto a 5-in.-diameter circle drawn on the floor. . . . 128 successful tosses confirmed [p = 1/4] with an embarrassingly low value of chi-squared.” The key word in this quote is the “the”. While p = 1/4 is a natural solution when we think of the chords being generated by throwing straws at random across a circle on the floor, there is no reason to think of this is the Bertrand experiment. It is one chance setup that fits with Bertrand’s story, but there are physically possible random experiments for which the other two solutions make sense. Suppose that, instead of tossing straws, we think of the circle as a rapidly spinning wheel of fortune that repeatedly carries a distinguished point past a stationary pointer. If the wheel spins at a constant rate until, at some random time, it suddenly stops and the chord is identified with the line segment from the fixed point to the pointer, then p = 1 /3 is right since the fixed point spends a third 7 There are ways to finesse this. For instance, the uncertainty inherent in a problem can often be confined to a finite interval. It is also possible to show that certain improper priors generate proper posteriors when updated. For example, if you begin being absolutely uncertain where a length falls in the interval [0, ∞) your Jeffreys prior will be improper, but conditioning on any item of data of the form “x ∈ [a, b]”, for 0 < a < b < ∞, yields a proper posterior. It is controversial whether these maneuvers succeed, see Howson [2002, 53-56] for relevant discussion.
The Development of Subjective Bayesianism
429
√ of its time farther than 3 away from the pointer. Alternatively, if we imagine ourselves tossing circles of fixed size onto a single line painted on the floor, p = 1/2 is correct. Since nothing precludes these scenarios a priori, symmetry conditions cannot be a priori either. All Bayesians are, of course, happy to impose empirically motivated symmetry conditions when appropriate, but most do not believe that they can be deduced a priori. A deeper problem with PIR/ MaxEnt, whether augmented with symmetry principles or not, is that they seek to capture states of ambiguous or incomplete evidence using a single probability function. Many people see this as an illegitimate way of smuggling in information. Recall Jaynes’s assertion that “consistency demands that two persons with the same relevant prior information should assign the same prior probabilities.” As we shall see, some Bayesians reject the idea that believers with the same objective evidence should end up in the same epistemic state. But, even if one grants that perfect symmetry in one’s evidence requires symmetry in one’s epistemic attitudes, it is a further step to say that the best way to capture these attitudes is by assigning equal prior probabilities. When your evidence is highly unspecific, see Joyce [2005], it might be better not to assign any determinate prior probabilities at all. Recall the example of the ten urns. Suppose the urn in front of you is painted red, and that you started out knowing that U5 is red and being confident (but mistaken) that is the only red urn. Imagine that you undergo a series of experiences in which you become increasingly uncertain about the number of red urns. First, you learn that U4 and U6 are red. Then, you learn that U3 and U7 are red, and so on until you end up knowing that all eleven urns are red. PIR and MaxEnt say that all these symmetrical states of evidence should be represented by a P (Black ) = 1/2 prior. While this is proper in the first case, it seems increasingly inappropriate as we move down the line. As the number of red urns grows you steadily lose information about the proportion of balls in the urn in front of you, but this loss is nowhere reflected in your unconditional probabilities.8 The problem with this, as R. A. Fisher 1922, p. 326] forcefully argued, is that it extracts “a vitally important piece of knowledge, that of the exact form of the distribution. . . out of complete ignorance.” Fisher’s point is that using a single probability function to represent your ignorance in all these different evidential situations requires smuggling in information that is not found in the data. This 8 There will be differences in the resilience of conditional probabilities in light of various potential data items. When one knows that U5 is the only red urn, P (black | data) will remain fixed at 1/2 for every pattern of black and white balls might be drawn (with replacement). For any other case, black ’s probability will vary with changes in the initial data, and greater variation will occur for less specific data. This difference, though important for other purposes, does not answer the objection being pressed here. If the evidence is that n black and N – n white balls are observed in the first N draws, then the conditional probability in the least specific case (= 11 red urns) is given by P (Uk |N, n) = P (N, n)−1 · (1 /11 ) · (k /10 )n · ((10−k) /10 )N −n , where P (N, n) = N !/n!·(N −n)! Σk (1 /11 ) · (k /10 )n · ((10−k) /10 )N −n is the prior probability of receiving that particular sequence of data. P (N, n) reflects the prior information contained in the uniform distribution. In this context, Fisher’s concern resurfaces as an objection to using the prior P (N, n) over the data sequences to compute conditional probabilities.
430
James M. Joyce
information is vividly revealed when we focuses on the prior distribution over possible data sequences. If P (N, n) denotes the prior probability an arbitrary sequence of N draws, with replacement, in which n black balls are observed, then P (N, n) = N !/n!·(N −n)!·1/2N when we know that U5 is the only red urn. In the least specific case where we know all the urns are red, however, P (N , n) = N ! /n!·(N −n)! Σk (1 /11 )·(k /10 )n·((10−k) /10 )N −n , for k ∈ {0, 1, . . ..10}. This is a very specific set of numbers.9 The reason we get such a specific distribution, of course, is that we are calculating it exactly as we would if we knew that each urn had an equal objective chance of being selected. But, since we know no such thing, we have no right to such specific numbers. Objective Bayesianism is just bad epistemology from Fisher’s perspective. It may be that the amount of added information encoded in the PIR/MaxEnt prior is, in some sense (e.g., in terms of entropy) the minimum that can be achieved using a single probability function, but this does not change the fact that the decision to represent uncertainty using a single probably function often involves adding information. In the urn case, for example, you smuggle in information in every case except the first, and the more red urns there are the more information you smuggle in. Bayesians who share Fisher’s worries can go a number of ways. Pure subjectivists feel that any probability not directly contradicted by the constraints of an inductive problem may legitimately be used as a prior, so that, e.g., if you know the urn is in {U3 U5 , U7 , U8 } then it is permissible to set P (Black ) to 0.3, 0.5, 0.7, 0.8 or to any mixture of these values. Subjectists thus agree with objective Bayesians that it is permissible to introduce information not included in the prior data to for purposes of making inductive inferences, but they deny that we should attach any special importance to any one way of adding information as opposed to any other. Other Bayesians — see [Levi, 1980; Jeffrey, 1983; Walley, 1991; Kaplan, 1996; Joyce, 1999] – represent symmetrical but incomplete states of evidence not with symmetrical probability values, but by symmetrical sets of probability functions. Instead of using one probability function to capture prior uncertainty this approach uses the family of all probability functions that the evidence does not explicitly exclude. When one knows only that the urn is {U3 , U5 , U7 , U8 }, for example, one’s “prior” would be the set of all probability functions defined over {3, 5, 7, 8} and the event of a black ball being draws would have the imprecise, “interval-valued” probability P (black ) ∈ [0.3, 0.8].10 Proponents of such “imprecise” probabilities argue that they are both more psychologically realistic than sharp probabilities for representing degrees of belief, and that they are also the proper response to incomplete, ambiguous, or unspecific evidence. 9 For
N = 5 it generates the distribution 0.20075, 0.14775, 0.1515, 0.1515, 0.14775, 0.20075. all imprecise probabilities will be interval-valued, and not all families of probability functions that may be used to represent uncertainty will be convex. While some have required this, see [Levi, 1980, p. 402], it is not plausible in the (fairly common) case in which the prior information specifies that two events are independent without specifying any definite probability for either one. For relevant discussion see [Jeffrey, 1987]. 10 Not
The Development of Subjective Bayesianism
431
Even if one adopts one of these “non-objectivist” approaches, however, there is no denying that PIR, Maxent and other methods for assigning priors have often been successful in practical applications. As a result, many Bayeisans who reject the idea that “ignorance priors” can be justified a priori will still agree with Gillies [2000, p. 48] that PIR and MaxEnt are very fruitful heuristic devices even if it is not valid as logical principles. Indeed, when approaching an inductive problem from a Bayesian perspective, it is often useful to start with an ignorance prior, and then to be willing to modify one’s thinking in light of empirical information as well as expert opinion. There is nothing wrong with this, anti-objectivist Bayesians will say, provided that one always keeps in mind that “ignorance priors” are empirical hypothesis like any other.
2.2
Subjective Bayesianism and the Requirement of Coherence
Subjective Bayesianism is the work of many hands, but key contributions have been made by Frank Ramsey, Bruno de Finetti, Leonard Savage, I. J. Good, David Lindley and Richard Jeffrey. The unifying ideas of the subjectivist approach are these: • Beliefs come in varying gradations of strength. Instead of asking whether a person accepts or rejects a proposition outright, we must speak of her level of confidence in it. • A person’s level of confidence in a proposition corresponds to the extent to which she is disposed to presuppose its truth in her theoretical and practical reasoning. • The goal of an account of inductive reasoning is to explain how the gradational beliefs, or credences, of rational agents change in light of changes in their evidence. • In a Bayesian experimental setup, the prior distribution of credences over H reflects the initial amount of confidence that an agent invests in the various hypotheses in H. The likelihood function represents her subjective probabilistic predictions about what data from X she is likely to receive conditional on various hypotheses obtaining. • In idealized cases, a person’s credences can be represented by a single function b from H ∧ X into [0, 1]. (In more realistic cases, sets of functions will be employed.) • The Requirement of Probabilistic Coherence. A rationally permissible credence assignment must conform to the laws of probability.11 11 In the idealized case, b must be a probability. In less ideal cases, every credence function consistent with an agent’s attitudes must be a probability. For example, the agent cannot judge that A is more probable than B and that A∨C is less probable than B ∨C when C is incompatible with A and B since no probability function condones these judgments.
432
James M. Joyce
• The Requirement of Conditioning. A rationally permissible method of belief revision must involve conditioning on data using a probabilistically coherent prior. • Radical Subjectivism. Any prior credences that satisfy the laws of probability may be permissibly held, and any posterior credences that are arrived at by conditioning on data using a probabilistically coherent prior may be permissibly held. This last principle repudiates any “objectivist” element in inductive logic. It says that a person’s reasoning cannot be criticized as long as her prior credences present a probabilistically coherent picture of the world and she updates using Bayes’s Theorem. This rejects demands, by frequentists and objective Bayesians, to “solve” the problem of the priors by identifying criteria that portray certain coherent credence assignments as better than others. According to the radical subjectivist, there is no problem of the priors. It is simply wrong to think, as Jaynes does, that “consistency demands that two persons with the same relevant prior information should assign the same prior probabilities.” Consistency only demands that priors obey the laws of probability: everything else is a matter of “inductive taste”. Example. Pierre and Bruno have seen a coin tossed ten times and have observed seven heads. Neither has any other information about the coin’s bias. Pierre begins with a uniform prior over the possible biases, and so concludes that the probability of a head on the next toss is 2 /3 (by the rule of succession). Bruno, for no particular reason, is certain that the bias is either 1 /10 , 1/2 or 9 /10 , and, on a hunch, assigns these priors of 0.01, 0.01 and 0.98, respectively. He deduces that the probability of a head on the next toss is about 0.82. For the radical subjectivist, there is no disparaging either Pierre or Bruno. Each began with a probabilistically coherent prior, and each arrived at his estimate by conditioning on the evidence received. So, each has reasoned perfectly. Bruno, of course, plays favorites while Pierre is evenhanded, but these are merely inductive predilections. Since the only substantive constraints radical subjectivists impose on believers are those of probabilistic coherence and Bayesian updating, these requirements bear a lot of normative weight. Why should we accept them? If the assignment of priors is a matter of taste, why isn’t it also a matter of taste, say, that one’s credences for A and ¬A cannot sum to more than one, or that one cannot update except by conditioning? Bayesians have offered a variety of justifications for both the probabilistic coherence and updating requirements. In the rest of this section we will discuss four rationales for coherence: Dutch book arguments, R. T. Cox’s a priori derivation, rationales generated from qualitative constraints on credences, and accuracy based justifications. Reasons for updating using Bayes’s Theorem will be discussed in §3.
The Development of Subjective Bayesianism
2.2.1
433
Dutch Book Arguments
Dutch book arguments purport to show that having probabilistically incoherent credences will, of necessity, lead believers to make unwise decisions. This approach was pioneered by Ramsey in his [1931] and has been developed by many authors. Perhaps the most sophisticated presentation is found in De Finetti [1974]. De Finetti imagines an agent who announces a real number pk for each member of an arbitrary set of propositions A1 , A2 , . . ., AK . For each k, the agent receives a prize of S(pk , v(Ak )) = 1 − (pk − v(Ak ))2 in units of something she values, where v(Ak ) is one or zero depending upon whether Ak is true or false. In effect, the agent is offered a choice among all wagers of the form Win S(pk , 1) = 2·pk − p2k if Ak ; Win S(pk , 0) = 1 − p2k if ¬Ak where she selects the value of pk . Since any pk ∈ (0, 1) assures a positive outcome whether Ak is true or false, it will always be in the agent’s interest to specify a vector pk = p1 , . . ., pK . De Finetti calls these numbers her previsions for the Ak . In conjunction with any logically consistent truth-value assignment v to the Ak , the act of specifying a prevision vector will produce a total prize of k S(pk , v(Ak )) = K − k (pk −v(Ak ))2 . Notice that this prize decreases as a function of the distance between previsions and truth-values. Of course, some acts produce larger prizes than others, and the agent will try to secure the best prize possible. This is easy if she knows the truth-values, for she can simply announce these as her previsions and claim the maximum prize of K. When she is unsure about the truth-values it will typically not be in her interest to set each pk to zero or one. While such a strategy offers the possibility of obtaining the maximum payoff, it also threatens to yield the minimum prize of 0. Depending on the agent’s views about the chances that various truth-value assignments have of being actualized, she will usually be better off “hedging her bets” by selecting intermediate previsions that have high estimated payoffs. So, if A is the proposition that it will rain in Ann Arbor on June 11, 2020 then, depending on what the agent knows about the summer weather in Michigan, it might be best for her to announce previsions of 0.3, 0.7 for A, ¬A. It is no sin against practical rationality, of course, if she does not end up receiving the largest possible prize or if someone else secures a larger prize because they happen to know more; one does the best one can given the information one has. It would be a sin against rationality, however, if someone who knew no more than the believer could identify an act that was sure to secure a larger prize no matter what truth-value assignment is actual. Example. Suppose a believer chooses previsions of 0.3, 0.6 for A, ¬A. Someone can then do better, whether A is true or false, by choosing 0.35, 0.65. Choose 0.35, 0.65 Choose 0.3, 0.6
A true, v = 1, 0 Prize = 1.155 Prize = 1.15
A false, v = 0, 1 Prize = 1.755 Prize = 1.75
434
James M. Joyce
Notice how the upper act dominates the lower act. De Finitti, like Ramsey before him, saw a general principle in this. Say that a set of previsions pk is practically incoherent 12 if there are alternative previsions qk that dominate it the sense that Prize(pk , v) < Prize(qk , v) for all logically consistent assignments v of truth-values to the propositions A1 , . . ., AK . De Finetti maintained that practical incoherence is a defect of rationality, and so imposed the following: Requirement of Practical Coherence. A rational agent will report practically coherent previsions. If she reports pk , then for each alternative set of previsions qk there is a logically consistent truth-value assignment v with Prize(pk , v) > Prize(qk , v). The challenge is to determine the conditions under which previsions are practically rational. The celebrated Dutch book theorem provides the answer. As in our example, it turns out that probabilistic incoherence is the hallmark of practical incoherence. Dutch Book Theorem (De Finetti’s version): A set of previsions pk for Ak is practically coherent if and only if there is a Boolean algebra Ω containing all the Ak and a finitely additive probability function P on Ω such that P (Ak ) = pk for all k.13 In other words, announcing dominated previsions is equivalent to announcing previsions that violate the laws of probability. This lovely piece of mathematics does not yet establish the requirement of probabilistic coherence since we do not yet know how an agent’s previsions relate to her credences. To close the loop, defenders of Dutch book arguments maintain that previsions reveal credences directly. Elicitation. A practically rational agent will report previsions that coincide with her credences, so that pk = b(Ak ) for all k. Some early Bayesians, including Ramsey [1931] and de Finetti [1937], sought to justify this principle by arguing that talk of credences can only be scientifically respectable if degrees of belief are operationally defined as the previsions that believers in fact announce.14 Later on, many Bayesians, including Savage [1971] and de Finetti [1974], sought to justify Elicitation by invoking the idea that practically 12 Most authors just use the term “coherence” for both the practical and probabilistic requirements, but they are conceptually distinct ideas. 13 For a version of the Dutch book theorem that yields countable additivity see [Skyrms, 1984]. 14 There are many ways of eliciting credences. Here are two: (i) b(A) is the agent’s fair price for a wager that pays one unit of utility if A and zero units if ¬A; (ii) b(A) = l/(w + l) where l, w > 0 are any numbers such that the agent is indifferent between owning or covering a bet that pays w utiles if A and costs l utiles if ¬A. The Bayesian picture assumes that all of these methods will elicit identical credences. This is tantamount to assuming that rational agents act to maximize expected payoffs.
The Development of Subjective Bayesianism
435
rational agents will always seek the highest expected payoffs by reporting previ sions that maximize Exp(Prizepk ) = v b(v)·[ k S(pk , v(Ak ))], where v ranges over truth-value assignments. De Finetti was careful to choose S to be a strictly proper scoring rule, a rule for which the choice of pk = b(Ak ) uniquely maximizes Exp(Prizepk ). So, on the assumption that practical rationality involves maximizing ones expected prize, it follows that rational believers will announce previsions that reveal their degrees of belief, and the Dutch book theorem then shows that these degrees of belief must be probabilistically coherent on pain of practical incoherence. Two main types of objections are raised against Dutch book arguments. One relates to Elicitation. For this principle to be plausible it must be true that (a) prizes are paid out in some quantity that can be assigned units of utility that the agent values linearly and whose value does not depend on what previsions she might be asked to state, and (b) the agent assesses potential prizes on the basis of their expected utility. Point (b) is especially pressing because assessing a quantity on the basis of its expected value is equivalent to assigning probabilities to its possible values. This concern is ameliorated, to some extent, by the use of representation theorems, along the lines of Ramsey [1931] and Savage [1954], which aim to simultaneously derive both the requirements of probabilistic coherence and expected utility maximization from plausible constraints on rational preferences. Unfortunately, as a number of authors have noted, e.g., [Joyce, 1999], such representation theorems rely on strong structural assumptions about the richness of an agent’s preferences that cannot be convincingly motivated as requirements of practical or epistemic rationality. A second type of objection concerns the normative force of Dutch book arguments. Even if one agrees that having credences that sanction dominated choices is practically irrational, one might still wonder what specifically epistemic sin is committed. Some, notably Skyrms [1980], stress the penultimate sentence of Ramsey’s famous statement of the Dutch book argument: These are the laws of probability, which we have proved to be necessarily true of any consistent set of degrees of belief. . . . If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event.” [1931, p. 182] The idea is that probabilistically incoherent credences are defective not so much because they leave agents open to sure losses, but because they cause agents to assess actions differently when viewed one-by-one than when viewed as a package. For example, an incoherent agent presented with choices a1 = [Set pA to 0.3 or 0.35] and a2 = [Set p¬A to 0.6 or 0.65] might prefer 0.3 and 0.6, but then pick 0.35, 0.65 when given choice a3 = [Set pA , p¬A to 0.3, 0.6 or 0.35, 0.65]. In a similar vein, Howson and Urbach [1989] and Christensen [1996] suggest that agents with probabilistically incoherent beliefs are committed to logically inconsistent value
436
James M. Joyce
judgments. To paraphrase Howson and Urbach (p. 57), to see p as the best prevision to report for A is to make a kind of intellectual value judgment, not to possess a disposition to accept particular bets when offered or to take particular actions when available. The problem with probabilistic incoherence is that it forces these value judgments to be inconsistent. These arguments have had a mixed reception, and the normative significance of Dutch book arguments remain an active topic of philosophical controversy. For recent discussions see Joyce [1999; 2009], H´ajek [2008], Howson [2008]. 2.2.2
Cox’s Theorem
Another influential argument for probabilistic coherence is found in Cox [1961]. Cox imagines a conditional credence function (a “plausibility”) that maps each pair of propositions AandC in an algebra Ω, to a number b(A|C) ∈ [0, 1]. He shows that any b that obeys four seemingly reasonable conditions is order-isomorphic to a probability. The first two conditions are:15 C1
If A and B are logically equivalent, then b(•|A) = b(•|B) and b(A|•) = b(B|•).
C2
b(A∧B|C) is exclusively a function of b(A|C) and b(B|A∧C). More precisely, there exists a continuous binary operation ⊗ that is strictly increasing in each coordinate and such that b(A ∧ B|C) = b(A|C) ⊗ b(B|A ∧ C).
One example of such an operation is ordinary multiplication, which turns the equation into the usual definition of conditional probability. Since conjunction is associative, the combination of C1 and C2 entail (Assoc) b(A ∧ B ∧ C|D) = (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z) for x = b(A|D), y = b(B|A ∧ D) and z = b(C|A ∧ B ∧ D). A function for which F (F (x, y) = F (x, F (y, z)) is called associative. Acz´el [1966, p. 256] proves a theorem which has the consequence that any continuous, strictly increasing, associative function F defined everywhere on an interval [a, b]2 is orderisomorphic to multiplication, i.e., there exists a strictly increasing, non-negative continuous function h defined on F ’s range with h(F (x, y)) = h(x)·h(y). So, if ⊗ is defined everywhere on [0, 1]2 , then there is an increasing, non negative continuous function m such that m(b(A ∧ B|C)) = m(b(A|C))·m(b(B|A ∧ C)). In light of its continuity, we can ensure that ⊗ is defined everywhere on [0, 1]2 by requiring b’s range to be dense in (0, 1): C3
For any rational numbers r1 , r2 , r3 ∈ (0, 1) there are A, B, C, D ∈ Ω such that r1 = b(A|D), r2 = b(B|A ∧ D) and r3 = b(C|A ∧ B ∧ D).
15 I am not presenting Cox’s conditions or result exactly as he expressed them. My presentation benefits from Halpern [1999], Paris [2004] and Jaynes [2003]. Also Cox’s axioms are understood to hold for all elements of Ω subject to the proviso that propositions conditioned upon are never contradictory.
The Development of Subjective Bayesianism
437
Under these conditions, it becomes possible to scale the function m so that m(b(|C)) = 1 and m(b(⊥|C)) = 0 for all C. Completing the proof requires one further principle to govern negation: C4
There is a continuous, non-negative non-increasing function N : [0, 1] → [0, 1] such that b(¬A|C) = N (b(A|C)).
One example of such an operation is N (b(A|C)) = 1−b(A|C), a choice which would ensure that b(A|C) + b(¬A|C) = 1, an important special case of the additivity law. Given the existence of functions m and N with these properties, Cox establishes the following: Cox’s Theorem. Given a credence function b on Ω that satisfies C1–C4 there exists a continuous, strictly increasing function p : [0, 1] → [0, 1] such that the composite mapping p◦ b is a probability. In other words, every belief function b that satisfies Cox’s axioms is a member of a class of functions with domain [0, 1] that contains a unique probability as well as every continuous, increasing function of that probability which preserves endpoints. A rational belief function, in other words, is always order-isomorphic to a probability. This is not the same as saying that b is a probability. For all Cox shows, it might be that bk for some probability P and any k > 0. In light of this one might wonder how Cox’s result serves to justify probabilistic coherence. The usual answer is to say that there is no substantive difference between representing a belief function using a probability or using any other function that is order-isomorphic to that probability. Suppose, say, that we represented beliefs not with probability functions but with their squares, so that b2 . The laws of rational belief would seem different. Instead of being additive, “credences” would be “square-additive,” i.e., 1 they would obey the law b(A ∨ B) = b(A) + b(B) + 2(b(A)·b(B)) /2 for contraries A and B. But, the argument continues, this difference between this law and additivity is superficial: in terms of real constraints imposed on rational beliefs, the two are identical. Instead of saying that credences must be additive, we now say that their square roots must be additive; instead of saying that the credences of A and ¬A must sum to one, we say that their square roots must sum to one; and so on. The form of expression is different, but the content is the same. To make this methodological stance explicit, let’s add a principle to Cox’s argument: C5
If two functions f, g : [0, 1] → [0, 1] with f (0) = g(0) and f (0) = g(1) are order-isomorphic (i.e., if each is a continuous strictly increasing function of the other), then the two provide equivalent representations of belief states.
With C5 in place there is no harm in saying that credences must be probabilities, provided that one understands that this only means that they must be orderisomorphic to probabilities. Cox proves no more.
438
James M. Joyce
Cox’s theorem has a “beauty is in the eyes of the beholder” quality. Some, e.g., Jaynes [2003] and Howson [2008], take it to decisively justify probabilistic coherence. Others, e.g., Halpern, [1999] think it delivers less than advertised because it ignores credence functions that fail to cover a dense subset of [0, 1]. In the end, the issue comes down to how compelling one finds the axioms. While they seem innocuous on a first reading, C1–C5 do impose substantial constraints on rational credences. Indeed, it is hard to see how to justify C2 or C4 without appealing to explicitly probabilistic considerations. Consider, for example, Jaynes’s [2003, p. 24-25] justification of C2. For A ∧ B to be a true proposition, it is necessary that A is true. Thus, [b(A|C)] should be involved [as a component of b(A ∧ B|C)]. In addition, if A is true it is further necessary that B should be true, so [b(B|A ∧ C)] is also needed. But, if A is false, then of course A ∧ B is false independently of whatever one knows about B. . . so if the [agent] first reasons about A, then the plausibility of B will only be relevant if A is true. Thus, if the [agent] has b(A|C) and b(B|A ∧ C) [she] will not need b(B|C). Similarly Howson [2008, p. 18] writes: Why should [C2] be the case? Well, knowing how likely B would be if I could assume B was true will not of course tell me how likely B is; for that I would also need to know how likely B is. But once I know that then it seems that I should know, at any rate in principle, how likely both B and A are. Nothing in this piece of informal conditional reasoning depends on any scale of measurement. This is right as far as it goes, but misleading. Given that it is being imposed in a context where C1, C3 and C5 hold, it is unclear how C2 differs from saying outright that b(A ∧ B|C) is directly proportional to b(A|C) and b(B|A ∧ C). After all, we know that a continuous, non-negative function whose domain is dense in [0, 1]2 is associative if and only if it is order-isomorphic to multiplication, and we have accepted the idea that there is no meaningful difference between a probability and its order-isomorphisms. In such a context, saying b(A ∧ B|C) is a function of b(A|C) and b(B|A ∧ C) is no different from saying that, up to order-isomorphism, b(A ∧ B|C) is b(A|C)·b(B|A ∧ C). And, if this is all we are saying, then it seems that we have given up on trying to justify the laws of probability from more basic principles: we are imposing the laws directly, up to order-isomorphism. The situation is the same for C4. In the presence of the other requirements, if we insist that b(¬A|C) be a non-decreasing continuous function of b(A|C), then we are stipulating that b(¬A| C) is order-isomorphic to 1 − b(A|C). Again, in the presence of C5, this seems no different from an overt invocation of the probabilistic requirement that b(A|C) and b(¬A|C) must sum to one. Overall then, while Cox’s result offers an illuminating way of rewriting the requirement of probabilistic coherence, it is not clear how much it provides in the way of an independent justification for that requirement.
The Development of Subjective Bayesianism
2.2.3
439
Quantitative Probability from Qualitative Probability
Another approach begins by assuming that believers make comparative probability judgments that can be represented by orderings over an algebra Ω. Intuitively, A. ≥. B means that the believer is at least as confident in A as she is in B. Relations of strict confidence, . >., and equi-confidence, . =., are defined as one would expect. Such comparative probability rankings are coherent when they conform to the judgments that some probability function would sanction, i.e., when there exists a probability on Ω such that A. ≥. B only if P (A) ≥ P (B). The challenge is to determine the conditions under which the ordering can be represented in this way, and to decide whether these conditions are requirements of rationality. De Finetti and others initially thought that the following laws of comparative probability would be both necessary and sufficient for probabilistic representability: CP1 (Normalization): . ≥. A. ≥. ⊥, and . >. ⊥. CP2 (Completeness): A. ≥. B or B. ≥. A. CP3 (Transitivity): If A. ≥. B and B. ≥. C, then A. ≥. C. CP4 (Quasi-additivity): If C is incompatible with both A and B, then A. ≥. B if and only if A ∨ C. ≥. B ∨ C. CP5 (Archimedean): If A. >. for every A in some subset A of Ω, then A is countable. In addition, if A. =. B, for all A, B ∈ A, then A is finite. CP6 (Continuity): If {A1 , A2 , A3 , ...} is a countable set of mutually incompatible events in Ω and if B. ≥. (A1 ∨...∨AN ). ≥. C for all n, then B. ≥. ∨n An . ≥. C. Unfortunately, Kraft, et al. [1959] exhibited a finite ranking that obeys CP1 –CP6 and yet cannot be ordinally represented by any probability function. There are two ways to go at this point. One can seek sufficient but non-necessary conditions for probabilistic representability by restricting one’s attention to orderings defined over rich structures, or one can try to identify stronger necessary conditions that end up being jointly sufficient. The best result of the first type is found in Villegas [1964]. Building on earlier work of Koopman [1940] and Savage [1954], Villegas showed that CP1 –CP6 suffice to determine a unique countably additive probability representation when (Ω, . ≥. ) is atomless, i.e., when for each A. >. there are disjoint A1 and A2 with A = A1 ∨ A2 and A1 , A2 . > .. Important results of the second type are found in Scott [1964] and Suppes and Zanotti [1976]. Scott, who was inspired by Kraft et al. [1959], imposes the following condition (which makes a number of the others redundant): CP7 (Scott): If A1 , A2 , ..., AN and B1 , B2 , ..., BN are sequences of propositions from Ω, possibly with repeats, that contain the same number of truths as a matter of logic, then An . ≥. Bn for all n ≤ N only if Bn . ≥. An for all n ≤ N .
440
James M. Joyce
Suppes and Zanotti [1976] impose a somewhat different requirement, and both show that their axioms are necessary and sufficient for (non-unique) probabilistic representation.16 Of course, the significance of these results depends what justifications can be given for the various constraints imposed on comparative probability rankings. Some, e.g. Savage [1954], seek to derive the constraints from decision-theoretic considerations. On this approach, A. ≥. B is taken to mean that the believer prefers a prospect in which she stands to enjoy a prize if A and to suffer a specific if ¬A to a prospect in which she stands to enjoy the prize if B and to suffer the penalty if ¬B. Constraints on . ≥. are then shown to follow from allegedly reasonable principles of rational preference. An alternative strategy is to argue that the constraints capture something central to rational belief. Joyce [1998], for example, has argued for Scott’s Axiom on the grounds that (a) a person’s credences commit her to making estimates of truth-values, (b) in particular, given any sequence of propositions A1 , A2 , ..., AN , a person with the sharp credences b(A1 ), ..., b(AN ) is committed to using b(A1 ) + ... + b(AN ) as her estimate of the number of truths among the An , and (c) under these conditions Scott’s Axiom is a straightforward non-dominance principle that forbids believers from making different estimates for the number of truths among the An and among the Bn when, as a matter of logic, these numbers must be the same. 2.2.4
Nonpragmatic Vindications of Probabilism
A fourth sort of justification for probabilistic coherence is proposed in Shimony [1988] and van Fraassen [1988], and developed in Joyce [1998; 2009]. The driving idea behind these “nonpragmatic vindications” is that because credences sanction estimates of epistemically salient quantities – truth-values, objective chances, frequencies – one can evaluate the quality of a person’s credences in terms of the accuracy of the estimates they sanction. Credences that sanction estimates that are necessarily less accurate than they need to be are epistemically defective. This approach has affinities to de Finetti’s prevision-based Dutch book argument, which explicitly rewards accurate truth-value estimates. But, whereas de Finetti’s approach rests on Elicitation, this method aims to assess the accuracy of credences without the mediation of desires or choices. Following Joyce [2009], one can think of these assessments as codified in an epistemic scoring rule S(b, v) that maps each credence function b and truth-value assignment v into a non-negative number S(b, v) that measures the epistemic disutility 17 of having credences b when the truth-values are given by v. S’s values should reflect the sorts of traits that make beliefs worth holding from an epistemic perspective. One such trait, indeed the cardinal one, is accuracy. S should produce lower (= better) values when b 16 To secure uniqueness either atomlessness or the existence of a set of atoms A , A , ..., A 1 2 N with Am . ≥. An for all m, n is required 17 I use epistemic disutility rather than epistemic utility so as to more easily relate this investigation to the work on proper scoring rules in statistics and economics.
The Development of Subjective Bayesianism
441
sanctions accurate estimates of epistemically salient quantities than when it sanction inaccurate estimates of those quantities. Once an acceptable epistemic scoring rule S is identified, the objective of a nonpragmatic vindication of probabilism is to show that probabilistically incoherent credences sanction estimates of epistemically important quantities that are S-dominated whereas coherent credences never have this undesirable feature. This is supposed to establish that incoherent credences are defective from an epistemic perspective because they fare less well, in terms of epistemic utility, than some alternative set of credences no matter what the world is like. Shimony and van Fraassen feel that credences are best assessed by looking at the degree to which they generate well-calibrated frequency estimates. The laws of probability declare that a coherent believer’s estimate for the proportion of truths in a set A must be A b(A)/#(A) where A ranges over the propositions in A and #(A) is A’s cardinality. When trying to justify probabilistic coherence, however, one cannot construe estimates as expectations without begging the question. One must begin instead from the thought that, whether coherent or not, b sanctions x as the estimate of the relative frequency of truths in any finite set of propositions all of which have credence x. In other words, setting b(A) = x commits one to estimating that propositions with A’s credence are likely to be true x proportion of the time. In light of this, one can partition any finite set of propositions A into subsets Ax = {A ∈ A : b(A) = x}, and can measure the accuracy of b’s truth-frequency estimate for Ax relative to a truth-value assignment v using the quantity Cx (b, v) = (Fv (Ax )−x)2 , where Fv (Ax ) is the proportion of Ax ’s elements that are true in the assignment v. One can then obtain an overall measure of the accuracy of b’s frequency estimates using the calibration score Cal (b, v) = x [#(Ax )/#(A)] · Cx (b, v). This is minimized when all the propositions with credence 1 are true, three-quarters of the propositions with credence 3/4 are true, two thirds of the propositions with credence 2 /3 are true, and so on. By invoking some structural assumptions, Shimony and van Fraassen are able to show, in slightly different ways, that incoherent credences sanction poorly calibrated frequency estimates. In particular, for any incoherent credence function b there is a coherent credence function c such that Cal (b, v) > Cal (c, v) for all logically possible truth-value assignments v. To the extent that one believes that calibration is a reasonable measure of epistemic accuracy (and accepts the structural assumptions), this shows that incoherent credences are inherently defective: they lead to frequency estimates that are, as a matter of necessity, farther from the actual frequencies than they need to be. The problem with this argument, as pointed out in Seidenfeld [1985] and Joyce [1998] is that calibration is a poor measure of epistemic accuracy. There are a variety of problems, but the decisive one is that a believer can, in some circumstances, improve her calibration score, relative to a given truth-value assignment, by decreasing her credence in every truth and increasing her credence in every falsehood. Despite the fact that the person seems to have made herself “more wrong” about every single proposition, Cal says that she has improved accuracy overall. This
442
James M. Joyce
happens, in part, because a credence function affects Cal both by fixing frequency estimates for the various Ax but also by determining the composition of the Ax . To avoid this sort of thing, Joyce [1998] advocates focusing on the role of credences in estimating truth-values rather than frequencies. A clear requirement is then: Truth-Directedness. If b’s credences are uniformly closer than c’s are to the truth-values in v, then S(c, v) > S(b, v). One rule like this is the Brier score: Brier (b, v) = [1/#(A)]· A (v(A) − b(A))2 . This score, Brier [1950], has been used by meteorologists to evaluate the accuracy of probabilistic weather forecasts, and de Finetti relied on it to elicit previsions in his version of the Dutch Book argument. In addition to being truth-directed, the Brier score has a variety of features that make it an excellent measure of epistemic accuracy, including the following: Extensionality. S(b, v) depends only on the credences in b and the truth-values in v. Normality. S(b, v) depends only on the absolute differences between the credences in b and the truth-values in v. Convexity. λ·S(b, v) + (1 − λ)·S(c, v) > S(λ·b + (1 − λ)·c, v) for all λ ∈ (0, 1). Symmetry. If S(b, v) = S(c, v), then S(λ · b + (1 − λ) · c, v) = S((1 − λ)·b + λ·c, v) for all λ ∈ (0, 1). Propriety. S is a proper scoring rule: if b is coherent, then the minimum expected Brier score is uniquely attained by b itself when expected values are computed using b. Coherent Admissibility. If b is coherent, then for any alternative c there is a truth-value assignment v such that S(c, v) > S(b, v). In addition, the Brier score can be decomposed, see Murphy [1973], into a sum of the calibration score and another epistemically significant quantity called the discrimination index, which measures to degree to which the credences in b discriminate truths from falsehoods. Joyce [1998], inspired by both de Finetti’s Dutch book argument and the Shimony/van Fraassen approach, tries to improve on these results. He imposes a series of constraints on epistemic scoring rules — including Truth-Directedness, Extensionality, Normality, Symmetry and Convexity — and argues that the property of being S-dominated relative to any rule S which meets the requirements is an epistemic defect.18 Joyce then proves that incoherent credences are S-dominated relative to any rule that satisfies the constraints. 18 More precisely, the claim is that b is defective when, for each scoring rule S than satisfies the constraints, there is a credence function bS such that S(b, v) > S(bS , v) for every logically consistent truth-value assignment v.
The Development of Subjective Bayesianism
443
Maher [2002] criticizes some of Joyce’s constraints, especially convexity, by claiming that the non-convex, improper absolute value score Abs(b, v) = [1/#(A)]· A |v(A) − b(A)| is the best measure of epistemic disutility. Joyce (2009) rejects this on the basis of the following sort of case: Example. A fair three-sided die will be tossed. Consider credence functions b = (1 /3 ,1 /3 ,1 /3 ) and c = (0, 0, 0). Calculation shows Abs(b, (1, 0, 0)) = Abs(b, (0, 1, 0)) = Abs(b, (0, 0, 1)) = 4/9 and Abs(c, (1, 0, 0)) = Abs(c, (0, 1, 0)) = Abs(c, (0, 0, 1)) = 1/3 . The logically inconsistent assignment c thus dominates the assignment b. The problem here is not the assignment b, but the rule Abs, which does not do enough to penalize error and so portrays the obviously correct credence assignment (the one that agrees with the known objective chances), as less accurate than the logically inconsistent assignment that assigns credence zero to each of three possibilities that are known to be collectively exhaustive. Gibbard [2008] also disputes some of Joyce’s constraints, notably Symmetry and Normality, and argues that scoring rules can only be useful for purposes of guiding action if they are proper. He thus wonders whether a Joyce-styled argument for probabilism can be mounted on the basis of Propriety. It turns out that Lindley [1982] had already provided an argument of this sort, albeit not billed as such. Lindley imagines a believer who assigns credences bn to members of a finite partition. These credences are scored using a rule S with these features: • Truth-Directedness • Additive form: S(b, v) = Σn λn ·sn (bn , vn ). • sn (b, 1) and sn (b, 0) are defined and have continuous first derivatives on [0, 1]. •
d d db sn (b, 0)/ db sn (b, 1)
approaches 0 when b approaches 0 from above.
•
d d db sn (b, 1)/ db sn (b, 0)
approaches 0 when b approaches 1 from below.
Given these assumptions, Lindley proved the following: Lindley’s Theorem: bn is undominated relative to S only if the numbers d d d L(bn ) = sn (bn , 0)/[ sn (bn , 0) − sn (bn , 1)] db db db obey the laws of probability. Additionally, if the map L is one-to-one, then the L(bn ) obey the laws of probability only if bn is undominated. So, every undominated credence function has a “known transform” into a probability, and if this transformation is one-one, then any set of credences with a coherent transform is undominated. Lindley then observes, in passing, that if S is proper then bn = L(bn ) for each n and the transform is one-one. Thus, for
444
James M. Joyce
proper rules bn is incoherent (coherent) if and only if there is a (is no) credence function cS such that S(b, v) ≥ S(cS , v) for all truth-value assignments v with the inequality holding strictly for at least one v. This is a lovely result, and very close to what Gibbard hoped for, but it does not give us quite everything. First, it does not guarantee that incoherent credences are strictly dominated, since it might be that S(b, v) = S(cS , v) for all v. This issue has been largely resolved by Leib et al., (manuscript) who derive the stronger conclusion from similar assumptions. Second, and more important, the requirement be a S be proper scoring rule is not appropriate in contexts where the aim is to provide a vindication of probabilistic coherence that might convince non-Bayesians. Propriety makes sense on the assumption that rationality requires maximizing expected epistemic utility, but, as has already been emphasized, the concept of an expectation only makes sense when expectations are generated by probabilities. Joyce [2009] shows how to use Coherent Admissibility, a weakening of Propriety that invokes only dominance considerations, to obtain the desired result. Specifically, it is possible to prove: Theorem:19 If S satisfies Truth-Directedness and Coherent Admissibility, and is bounded and continuous for all credence functions and truth-value assignments, then • For every incoherent credence function b defined on a finite partition of propositions there is a coherent credence function cS such that S(b, v) ≥ S(bS , v) for all logically consistent truth-value assignment v, with at least one inequality being strict. • No coherent credence function is ever S-dominated in this way. While this is close to what one would want, it would be desirable to remove the restriction to partitions, to relax the requirement that S be bounded, and to find a way of deriving Coherent Admissibility from more basic principles. For further critical discussion of results like these see [Joyce, 2009; H´ ajek, 2008].
2.3 Is Probabilistic Coherence Enough? Clearly, there is no shortage of proposed justifications for probabilistic coherence. Even if we are entirely convinced by one (or more) of them, however, subjective Bayesianism remains a “garbage-in-garbage-out” theory. Coherent beliefs can be absurd, but subjective Bayesians tolerate them no matter how crazy they get, (short of violating the laws of probability). Many find this objectionable. Alan Chalmers [1999, p. 188] expresses the worry nicely: 19 Joyce [2009] actually invokes a weaker version of Coherent Admissibility in which the “>” is replaced by “≥”. This, however, was a mistake since the proof makes use of the stronger version.
The Development of Subjective Bayesianism
445
Once we take probabilities as subjective degrees of belief. . . a range of unfortunate consequences follow. The Bayesian calculus is portrayed as an objective mode of inference that serves to transform prior probabilities into posterior probabilities in light of given evidence. Once we see things this way, it follows that any disagreements in science must have their source in the prior probabilities held by the scientists. But these prior probabilities themselves are totally subjective and not subject to critical analysis. Consequently, those of us who raise questions about the relative merits of competing theories. . . will not have our questions answered by the subjective Bayesian, unless we are satisfied with an answer that refers to the beliefs that individual scientists just happen to have started out with. Hard-core subjectivists, like de Finetti, will reject such concerns out of hand. “It’s too bad,” they will say, “but all we have to go on in science are ‘the beliefs that individual scientists just happen to have started out with’. To think otherwise is to pretend that you have access to information you cannot possibly possess.” Other Bayesians will suggest that things are not nearly so bad a Chalmers implies since (a) disagreements in science should ultimately be resolved by acquiring further evidence, and (b) as more evidence is acquired disagreements become increasingly less severe and less frequent. The next section explains how this “tempered” Bayesian response to the problem of the priors is supposed to work.
2.4
Tempered Bayesianism and “Washing Out”
Tempered Bayesians maintain that prior opinion will tend to “wash out” as believers acquire more and more information. Here is a famous statement of the view from Edwards et al. [1963, p. 201], one of the classic papers on the topic: If observations are precise. . . then the form and properties of the prior distribution have negligible influence on the posterior distribution. From a practical point of view, then, the untrammeled subjectivity of opinion. . . ceases to apply as soon as much data becomes available. More generally, two people with widely divergent prior opinions but reasonably open minds will be forced into arbitrarily close agreement about future observations by a sufficient amount of data. This “merger of opinion” is what passes for objectivity in the tempered Bayesian’s worldview, which sees objectivity as intersubjective agreement in the long run. The basis of such claims is found in the “washing out” theorems, which purport to show that believers who start with widely divergent, but coherent, priors and who update on the same data streams will eventually converge on a “consensus posterior.” Since the structure of all the washing out theorems is similar, we will consider only the most general version, which relies on the Doob Martingale Convergence Theorem, Doob [1971].
446
James M. Joyce
Consider two believers with priors b and c such that 0 < b(h), c(h) < 0 for some hypothesis h. These believers will undergo a potentially infinite sequence of learning experiences involving the random variables X1 , X2 , X3 , . . ., each of which may take finitely many values. Assume that: a. Neither subject is closed-minded about the data: both b and c assign each finite data sequence dj = (X1 = x1 ) ∧ (X2 = x2 ) ∧ . . . ∧ (Xj = xj ) an intermediate probability. b. At time-j each subject learns the actual value of Xj . c. Both subjects know they will condition on the evidence they receive, i.e., they know that, for each time j, bj (h) = b(h|dj ) and cj (h) = c(h|dj ). (a)–(c) entail that each subject’s credences form a martingale sequence in which every term is the expected value of its successor: bj (h) = Σx bj (bj+1 (h) = x)·x and cj (h) = Σx cj (cj+1 (h) = x)·x. Doob’s Theorem says that, with probability one, these sequences converge to a definite limit. To show that these limits coincide, additional assumptions are required. Here are two possibilities: • Each possible infinite data stream determines truth-value v ∈ {0, 1} for h. • Each possible infinite data stream determines an objective chance p ∈ [0, 1] for h. If first assumption holds, limj bj (h) = limj cj (h) = v. If the second holds, and if the subjects align their credences with their best estimates of their objective chances (of which more below), then limj bj (h) = limj cj (h) = p. Success? Unfortunately not. Washing out theorems that rely on either of the above assumptions will not assuage worries about subjectivity. Here prior opinion “washes out” only because the data is so incredibly informative in the limit that the subject’s prior beliefs are irrelevant to her final view as a matter of logic. In the chance case, for example, each leaning experience might reveal successive digits of the hypothesis’s objective chance. Priors will then wash out, of course, but only because they play no real role in posterior at all. If all learning situations were like this, there would be no call for priors at all, at least in the limit. To obtain washing out theorems that might have a chance at quelling worries about subjectivism, we must start from weaker data and derive convergence from commonalities among the priors alone. One way to do this, pioneered by Savage [1954, p. 46-50] is to suppose that the subjects agree that the data statements are statistically independent and identically distributed (IID) conditional on both h and ¬h. • If dj is any possible data stream, and x is any possible value of Xk with k > j, then b(Xk = x|h) = b(Xk = x|h∧dj ) and b(Xk = x|¬h) = b(Xk = x|¬h∧dj ). The same holds with b replaced by c.
The Development of Subjective Bayesianism
447
Under this assumption, it can be shown that bj (h) and cj (h) will converge to a common value with probability one according to both b and c. Many Bayesians see this as an answer to the concerns that people like Chalmers have raised about the “anything goes” aspect of subjective Bayesianism. The case of repeated IID trials is a common one, and the result shows that subjects who believe they are watching a sequence of such trials will eventually come to agree to a greater and greater extent. Example. Neither Pierre not Bruno knows anything about the bias of a coin. Pierre begins with a uniform prior over [0, 1], while Bruno starts with a sharply peaked normal distribution, with mean 1 /10 and variance 1 /100 . After fifteen tosses in which ten heads appear, the two will still be far apart in their estimates of the bias of the coin. But, after 3,000 tosses in which 2,000 heads have appeared both will have arrived at posterior distributions that differ only in very distant decimal places. While this is comforting, it is important to recognize that the theorem requires a great deal of initial agreement. All parties must concur about the possible hypotheses. If, e.g., Bruno does not spread his prior over all of [0, 1], thereby excluding some biases from consideration, convergence with Pierre is no longer assured. Likewise, all parties must agree about the possible data sequences, and they must all know that they will condition on whatever data they receive. Believers who disagree about these things will not tend toward consensus as evidence accumulates simply because they will not agree about what counts as evidence. If, e.g., after seeing 3,000 straight heads Bruno concludes with certainty that the coin is two headed and ignores all future data (a conclusion not sanctioned by conditioning), then the convergence result does not apply. If either party does not believe he or she is observing an IID process, then the result does not apply. While these restrictions do not render the washing out theorems irrelevant to questions about subjectivism, they underscore a significant limitation. The washing out results only secure agreement in the limit by assuming substantial amounts of agreement at the start. It’s a case of no agreement in, no agreement out! 3
INDUCTIVE INFERENCE AS UPDATING SUBJECTIVE PROBABILITY
Whatever their views about the status of prior probabilities, all Bayesians see inductive inference as a matter of updating probabilities in light of new evidence. Abstractly, this process involves a probability P that represents some prior state of evidence, a learning experience λ that imposes constraints on a posterior probability, and an update rule, P − λ → Pλ , which maps the prior to a unique posterior satisfying the constraints. As before, we think of the prior as derived from an unconditional probability distribution over a partition of hypotheses H, and a family of normalized likelihood functions Lx H → [0, ∞), where Lx (h) = P (x|h) is the probability of observing datum x ∈ X if h ∈ H holds. A learning experience λ
448
James M. Joyce
conveys information which requires that this prior be supplanted by a posterior Pλ that satisfies certain constraints. In mostcases, these constraints can be represented by a set of equations Exp λ (fk ) = h,x P (h ∧ x)·fk (h ∧ x) = ck where each fk is a random variable on H∧ X and each ck is a real number. Here are some possible constraints (with H ∧ X assumed finite): a. {Pλ (x2 ) = 1} b. {Pλ (x1 ) = 0.2, Pλ (x2 ) = 0.3, Pλ (x3 ) = 0.5, Pλ (xm ) = 0 for n > 3} c. {Pλ (x1 ) = 2·P (x1 ), Pλ (x2 ) = 1/2·P (x2 ), Pλ (x3 ) = 5·P (x3 ), Pλ (xn ) = 0·P (x3 ) for n > 3} In (a) experience delivers the verdict that x2 is certainly true. In (b) it directly sets probabilities for the various data statements. In (c) new probabilities for the data statements are specified as a function of their priors. Notice that (b) and(c) impose identical constraints when P (x1 ) = 0.1, P (x2 ) = 0.6, P (x3 ) = 0.1 and n>3 P (xn ) = 0.2. Even so, it is crucial to understand that they describe different learning experiences. On might call (b) a hard experience because it ignores the prior and resets each Pλ (xn ) de novo, thereby requiring the posterior to satisfy {Pλ (x1 ) = 0.2, Pλ (x2 ) = 0.3, Pλ (x3 ) = 0.5} for any prior. In contrast, (c) fixes posterior probabilities indirectly via a specification of Bayes updating factors βP (·, λ) = Pλ (·)/P (·) for all xn . As a result, (c) and other soft experiences, impose constraints on posteriors whose effects vary with changes in the priors. As these examples illustrate, experiences constrain posterior probabilities incompletely. The role of the update rule is to select that probability, from among the many that satisfy the constraints, which best preserves the information encoded in the prior. This process is governed by a kind of “minimal change” ethos which prohibits the posterior from introducing distinctions in probability among hypotheses that are not already inherent in the prior or explicitly mandated by the new evidence. There should be no “jumping to conclusions.”
3.1 Simple Bayesian Conditioning In the simplest learning experiences a person comes to know some X ⊂ X with certainty. If this is all she learns, her experience will not distinguish among possibilities that entail X, and the relevant minimal change principle requires that X’s probability be raised to Pλ (X) = 1 in a way that does not disturb probability ratios among propositions that entail X. MC1 : If a learning experience’s effect on a posterior is confined to fixing Pλ (X) = 1, then ratios of probabilities among propositions that entail X should remain fixed, so that Pλ (A ∧ X)/Pλ (B ∧ X) = P (A ∧ X)/P (B ∧ X) for A, B ∈ H ∧ X .
The Development of Subjective Bayesianism
449
This means that the absolute Bayes updating factor for propositions that entail X is βP (A, λ) = 1/P (X), and the relative Bayes updating factor is constant: βP (A, λ)/βP (B, λ) = 1. It should come a no surprise that the only probability that fits the bill is Pλ (•) = P (•|X). In this way, we secure the most basic principle of Bayesian inductive logic: Updating by Conditioning. Suppose a person’s prior epistemic state is represented by a prior P that is not dogmatic about X, so that 1 > P (X) > 0. If the person has a learning experience λ in which she receives the information that X is certainly true, and if this is all she learns, her posterior should be Pλ (•) = P (•|X). As we have already seen, conditional probabilities have features that make them ideal for modeling inductive learning. For example, Pλ (•) = P (•|X) is a probability that is dogmatic about X. In addition, the temporal order in which data is acquired is irrelevant to its evidential import.
3.2
Jeffrey Conditioning
The weakness of Bayesian conditioning as an update rule is that it only applies to learning experiences that raise the probability of some proposition to one. This dogmatic aspect of the process can seem implausible, especially given that propositions learned via conditioning cannot be unlearned by subsequent conditioning. Richard Jeffrey, [1983], a forceful critic of dogmatist epistemologies, has advocated a model of learning that allows experiences to alter probabilities without raising them to certainty or lowering them to zero. Example. Consider a wine drinker who orders the house red with the expectation that it is a cabernet. Taking a sip, he has a gustatory experience that causes him to doubt his judgment. This experience might move his subjective probabilities from a prior in which P (cabernet) = 0.95 to a posterior in which Pλ (cabernet) = 0.6. One might try to explain this change by positing a proposition, say x = “The wine has a fruity taste”, that the drinker comes to learn with certainty during the experience, and to suppose his priors were such that P (cabernet |x) = 0.6. While this might make sense for sophisticated oenophiles, it is implausible for most people. If asked, the drinker might not be able to articulate any specific feature of the wine that leads him to alter his views, and even if he comes up with something vague — “It doesn’t taste like other cabs” — it is even less plausible that he will have the required prior conditional probabilities. A better model, Jeffrey suggests, portrays the change as being a direct effect of the experience, without the mediation of any knowledge gained with certainty. The gustatory experience causes the drinker to move directly from P (cabernet) = 0.95 to Pλ (cabernet) = 0.6. If this is its only immediate effect, Jeffrey argues, then probabilities of events conditional
450
James M. Joyce
on {cabernet, not-cabernet} should remain fixed, so that Pλ (·) = 0.6·P (·| cabernet) +0.4·P (·|¬cabernet). More generally, represent a Jeffrey learning experience on X is a countable vector of ordered pairs λ = xn , λn where 1 > P (xn ) > 0 for all n and λ1 , λ2 ,. . . is a sequence of non-negative real numbers summing to one. In Jeffrey’s picture, the learning experience directly fixes each λn as the posterior for xn . An experience with this as its only immediate effect will introduce no new distinctions in probability among propositions that entail the same xn . This leads to the following minimal change principle: MC2 : If a Jeffrey learning experience λ’s effect on a posterior is confined to fixing probabilities for elements of X , then the ratios of probabilities of events that entail each xn should remain fixed, so that Pλ (A ∧ xn )/Pλ (B ∧ xn ) = Pλ (A ∧ xn )/ Pλ (B ∧ xn ) for A, B ∈ H ∧ X . The only probability that obeys MC2 subject to the constraint that Pλ (xn ) = λn for each n is the Jeffrey shift: Pλ (•) = P (•|xn , λn ) = n λn·P (·|xn ). Despite the notation, one should not think of λ = xn , λn as a proposition or of P (•|xn , λn ) as a conditional probability. Rather, λ is a direct specification of posterior probabilities. Ordinary Bayesian conditioning can be seen as a special case of Jeffrey conditioning, since P (•|X) = P (•|xn , λn ) when λn = P (xn |X) for all n, but not every instance of Jeffrey conditioning involves conditioning on a proposition. Like ordinary conditioning, Jeffrey conditioning has a number of features that recommend it as a model for learning. In particular, it satisfies: Sufficiency. Pλ (•|xn ) = P (•|xn ) for all n. Conversely, if Q(•|xn ) = P (•|xn ) for all n, then Q(•) = P (•|xn , Q(xn )). Refinement. If Pλ,μ is the result of one Jeffrey shift P − xn , λn → Pλ followed by a second Jeffrey shift Pλ − yk , μk → Pλ,μ , then Pλ,μ can always be represented as a single Jeffrey shift based on the refined partition xn ∧ yk . Specifically, βP (xn , λ)·βP λ (yk , μ)·P (• ∧ xn ∧ yk ), Pλ,μ (•) = n,k
where βP (xn , (λ) = λn /P (xn ) is the Bayes factor for xn in the first shift, and βP λ (yk , μ) = μk /Pλ (yk ) is the Bayes factor for yk in the second shift. There is an asymmetry lurking in this Refinement. Since β and β* are defined relative to a given prior and posterior, reversing the order of updating can alter the final result. Example. You discover Colonel Mustard lying on the floor of the Conservatory. “Either Mrs. Peacock (x1 ), Mrs. White (x2 ) or Miss Scarlet
The Development of Subjective Bayesianism
451
(x3 ) did it,” he mutters, before dying. You set about trying to determine the culprit, starting from equal priors P (x1 ), P (x2 ), P (x3 ) = 1/3, 1/3, 1/3. The air is thick with perfume, and (though you can’t say exactly why) it strikes you as smelling unlike Mrs. Peacock. On the basis of this olfactory experience you become less inclined to suspect her, and your subjective probability for x1 falls to 1/5. Since this is the only effect of the experience (the scent fails to discriminate Mrs. White from Miss Scarlet), you Jeffrey condition to arrive at Pλ (x1 ), Pλ (x2 ), Pλ (x3 ) = 0.2, 0.4, 0.4. The coroner arrives and examines the body. “Aha!,” he says, “the Colonel was killed with a r ,” but just then, as he is about to name the weapon, a lead pipe crashes through the window and kills him dead. Even though you could not make out what is said, you have the sense that the coroner was closer to saying “r¯ o”, which would indicate that a rope was used, than he was to saying “rˇ o”, which would indicate a wrench, or “r¯e” which would suggest a revolver. As it happens, you know that Mrs. Peacock kills with a revolver, Mrs. White prefers a wrench, and Miss Scarlet uses a rope. On the basis of your auditory experience, your probability for Miss Scarlett being the culprit rises to 0.7, and you Jeffrey condition to get Pλ,μ (x1 ), Pλ,μ (x2 ), Pλ,μ (x3 ) = 0.1, 0.2, 0.7. Surprisingly, if we imagine the evidence coming in the reverse order we get a different answer. The auditory experience, if it occurs first, induces a shift from P to Pμ (x1 ), Pμ (x2 ), Pμ (x3 ) = 0.15, 0.15, 0.7. If this is followed by the olfactory experience, a second Jeffrey shift yields Pμ,λ (x1 ), Pμ,λ (x2 ), Pμ,λ (x3 ) = 0.2, 0.1412, 0.6588. Jeffrey conditioning clearly does not commutate as ordinary Bayesian conditioning does. One experience induced shift P → Pλ followed by another Pλ → Pλ,μ need not be evidentially equivalent to P → Pμ followed by Pμ → Pμ,λ . Indeed, Refinement entails that Jeffrey shifts commute only when βP (xn , λ)·βP λ (yk , μ) = βP (yk , μ)·βP μ (xn , λ). Many commentators object to this non-commutative aspect of Jeffrey’s approach. For example, Kelly [2008, p. 616] rejects Jeffrey conditioning on the basis of: Commutativity of Evidence Principle (CEP). “To the extent that what it is reasonable for one to believe depends on one’s total evidence, historical facts about the order in which that evidence is acquired make no difference to what it is reasonable for one to believe.” Others have made similar claims, including Doring [1999] and Lange [2000]. It turns out, however, that CEP is mistaken, and that Jeffrey conditioning fails to commute in precisely those circumstances where CEP fails. Refer back to the distinction between “hard” and “soft” learning experiences. It is tempting to think that all Jeffrey shifts reflect hard experiences, so that the posterior for each xn is set to λn no matter what prior is in play, i.e., Pλ (xn ) = Qλ (xn ) for all priors
452
James M. Joyce
P and Q. Such experiences obliterate any information about X that might have been contained in the prior: except in the very special case where P (xn ) = Pλ (xn ) for all xn , it will be impossible to infer anything about the prior distribution on X from the posterior distribution on X . Consequently, for hard Jeffrey shifts λ and μ, one has Pλ,μ (yk ) = Pμ (yk ) = μk and Pμ,λ (xn ) = Pλ (xn ) = λn . When μ occurs last it cancels out any information λ might have conveyed about Y. When λ occurs last it cancels any information μ might have conveyed about X . We should not generally expect updating rules to commute in such situations: CEP should fail when the second experience expunges information provided by the first. In general, we should only want commutation among hard Jeffrey shifts when the information in the first experience is preserved in the second, so that λn = Pλ,μ (xn ) and μk = Pμ,λ (yk ). As Diaconis and Zabell [1982] show, this happens exactly when λ and μ are Jeffrey independent relative to P in the sense that P (xn ) = Pμ (xn ) for xn ∈ X and P (yk ) = Pλ (yk ) for yk ∈ Y. Indeed, they prove that Jeffrey independence is necessary and sufficient for commutation when λ and μ are hard Jeffrey shifts.20 So, in the case of hard learning, Jeffrey conditioning commutes in exactly those cases in which it should, viz., when the second experience does not nullify the information provided by the first. Now, one might wonder how often “hard” Jeffrey shift experiences occur, especially in sequences that generate commutativity failures. How plausible is it to think, e.g., that hearing the coroner intone “r ” effects one’s beliefs the same way whether one hears it before or after smelling the perfume? It seems likely that differences in one’s prior state of opinion will affect one’s responses to evidence. There are two ways to cash this out. Lange [2000] has argued that seeming commutativity failures of Jeffrey conditioning arise only in cases where the subject is not really undergoing the same experiences in reverse order. Rather, the character of the second experience is altered as a result of the first, so that, e.g., smelling the perfume is a different experience after one has heard the coroner than when experienced de novo. This preserves the “hard” aspect of experience —- if λ and μ really are the same experiences, independent of the order in which they occur, then Pλ,μ (yk ) = μk and Pμ,λ (xn ) = λn —- but it suggests that apparent failures of Commutativity have the form Pλ,μ∗ (yk ) = Pμ,λ∗ (xn ) where λ*, the experience one has after μ, is not the same as λ, and μ*, the experience one has after λ, is not the same as μ. Lange’s approach has the disadvantage of requiring the qualitative character or content of experiences themselves to vary depending on the subject’s prior epistemic state, which seems like an instance of the Hanson/Kuhn “theory ladenness of observation” fallacy. A more plausible, and more comfortably Bayesian, picture of the situation will portray subjects as drawing different conclusions from the same experiential data depending on their prior beliefs. 20 While Diaconis and Zabell do not make the hard/soft distinction, there results are clearly meant to apply only to hard learning experiences.
The Development of Subjective Bayesianism
453
Such a picture is found in Field [1978]. Field recognized that there can be “soft” Jeffrey shifts in which Pλ (xn ) and Qλ (xn ) diverge. Such shifts have the form λ = xn , λn (P ), where the λn (P ) are weighting coefficients that may depend on P . Experience sets a posterior for each xn indirectly by specifying a Bayes factor βP (xn , λ) that is unique up to multiplication by a positive constant. Then, independent of P, experience stipulates that Pλ (xn )/P (xn ) ∝ βn for some set for non-negative real numbers β1 , β2 , . . ..21 For any such “Field shift” x n , βn there is always an associated Jeffrey shift λwhere λn (P ) = β n · P (xn )/ j βj · P (xj ), with the associated posterior Pλ (•) = n βn ·P (• ∧ xn )/[ j βj ·P (xj )]. In addition to being soft, experiences that produce Field shifts preserve prior information about the distribution over X . Given knowledge of the posterior over X and the fact that it arose from P xn , βn via a Field shift, the prior over X can be deduced. This ensures that the effects of successive Field shifts λ = xn , βn and μ = yk , χk are always independent of the order in which they occur. In general, one has ⎡ ⎤ χk ·βn ·P (• ∧ xn ∧ yk )/ ⎣ χi ·βj ·P (xj ∧ yi )⎦ Pλ,μ (•) = k,n
i,j
Since this is a Field shift on xn ∧ yk that is symmetric in n and k, it follows that Pλ,μ (·) = Pμ,λ (·). Moreover, as Wagner [2002] showed the converse holds: soft Jeffrey shifts commute only if they can be represented as successive Field shifts. The moral is that Field conditioning differs from forms of Jeffrey conditioning that do not commute because subsequent Field conditioning on events in one partition does not expunge information already received about the other partition. So, to emphasize the key points: • Jeffrey conditioning commutes when it should, i.e., when subsequent updatings preserve information acquired in earlier updatings. • Jeffrey conditioining does not commute when it should not, i.e., when subsequent updatings destroy information acquired in earlier updatings. Turning now to a different facet of Jeffrey conditioning, let’s ask whether there are good reasons for thinking that Jeffrey’s rule is the correct way to generalize Bayesian conditioning for non-dogmatic learning experiences. An affirmative answer is provided by Diaconis and Zabell [1982], who show that, relative to a range 21 Field portrayed x , b as an “input parameter” that captures the information conveyed n n by experience itself, so that “same experience” = “same sequence of bn ”. As shown in Garber [1980], this interpretation founders on cases of repeated sampling. In our “Clue” example, suppose sniffing the perfume causes a small shift from 1/3, 1/3, 1/3 to 0.34, 0.35, 0.31, so that b1 = 34/300, b2 = 35/300 and b3 = 31/300. If b1 , b2 , b3 were the experiential input itself, independent of the evidential context in which it occurred, then a second sniff would cause a shift to 0.3459, 0.3665, 0.2875, a third would cause a shift to 0.351, 0.383, 0.266,. . . , and a fiftieth would cause a shift to 0.194, 0.804, 0.002. But, repeatedly undergoing the same uninformative experience should not have such a dramatic effect on probabilities.
454
James M. Joyce
of ways of measuring divergences among probability functions, the Jeffrey shift Pλ is the “closest” probability Q to P that satisfies the constraints Q(xn ) = λn for all n. These measures include all the following (with ω ranging over atomic elements of the algebra Ω over which the probabilities are defined): Variational Distance. VP (Q) = sup{|P (A) – Q(A)| : A ∈ Ω} Brier Score. BP (Q) = Σω (P (ω) – Q(ω))2 1 Hellinger Distance. HP (Q) = Σω P (ω) + Q(ω) − (P (ω)·Q(ω)) /2 Kullback-Leibler Entropy. BP (Q) = Σω Q(ω)·log(Q(ω)/P (ω)) In all cases except the first, Pλ is the unique minimum. To the extent that one is impressed with the idea that belief updating rules should make the smallest change in the prior consistent with the new data, these results make Jeffrey conditioning look like an excellent rule for updating subjective probabilities in the kinds of situation to which it applies.
3.3 Other Form of Bayesian Conditioning There has been little systematic investigation of updating rules that apply when ordinary Bayesian conditioning and Jeffrey conditioning are inapplicable. In theory, a believer should be able to update on any informational constraint of the form Exp(f ) = S, where f is a random variable and S is a set of real numbers. In practice, however, it is hard to know how to do this since the various measures of probabilistic divergence do not speak unequivocally for evidential constraints that cannot be expressed in terms of assignments of new probabilities to propositions. An excellent illustration of the phenomenon is provided by the “Judy Benjamin Problem” of van Fraassen [1981]. Here experience directly provides the value of a conditional probability, and the question is how to update beliefs in light of such information. Example. Judy, a paratrooper, has just been dropped in the dead of night into a square region that is aligned along the compass points, and is divided into four equal-sized regions: NE, NW, SE and SW. Initially, she believes that she might have landed anywhere in the region, and that her chances of being in the east or west do not depend on her north/south location, and conversely. So, Judy begins with these priors: North [0.5] South [0.5]
West [0.5] 0.25 0.25
East [0.5] 0.25 0.25
Judy knows just two things about the region: (i) there are three times as many bears in the northwest as in the southwest, and (ii) her chances of seeing a bear in any region depends only on the number of bears in that region. If we let #(R) be the number of bears in region R, (i)
The Development of Subjective Bayesianism
455
says #(NW ) = 3·#(SW ), and (ii) says that, for any regions R and R*, P (Bear | R)/P (Bear |R∗) = #(R)/#(R*). This means that Judy’s posterior must satisfy Pλ (R)/Pλ (R∗) = [(R)·P (R)]/[(R∗)·P (R∗)] when she knows the value of #(R)/#(R∗) and encounters bear. Judy sees a bear. Being a good Bayesian, she decides to update before fleeing. Since she assigns NW and SW equal priors her posterior should satisfy Pλ (NW ) /Pλ (SW ) = 3. In effect, she has acquired the conditional information Pλ (N |W ) = 3/4, which constrains her posterior so that for some p, q ∈ [0, 1] North [q + (0.75 − q)·p] South [(1 − q) + (q − 0.75)·p]
West [p] 0.75·p 0.25·p
East [1 − p] q·(1 − p) (1 − q)·(1 − p)
The challenge is to determine what p and q should be. People tend to have two strong intuitions here. First, given that Judy is as likely to be in any one quadrant as any other, it seems that seeing a single bear cannot indicate anything about her east/west location. So, p should remain fixed at 1/2. Likewise, since Judy has no information about the distribution of bears in the eastern regions, it seem that seeing a single bear should not alter her views about the relative probabilities of NE and SE. So, q should be 1/2. Together these intuitions completely determine the posterior. North [0.625] South [0.375]
West [0.5] 0.375 0.125
East [0.5] 0.25 0.25
Surprisingly, this answer cannot be justified on the basis of standard “minimum change” principles. The Brier, Hellinger and Kullback-Liebler measures of probabilistic divergence produce the following results: Brier Hellinger K-L Entropy
Pλ (NW ) 0.3333 0.363 0.3525
Pλ (SW ) 0.1111 0.121 0.1175
Pλ (NE ) 0.2778 0.258 0.265
Pλ (SE ) 0.2778 0.258 0.265
While these are consonant with q = 1/2, none supports p = 1/2. Indeed, in every case Judy’s confidence that she is in the east increases when she sees the bear. Some, e.g., Grove and Halpern [1997], have argued that the failure to secure p = 1/2 shows that it is misguided to use divergence measures as a guide to updating. However, for all its a priori appeal, the p = 1/2 answer is incorrect. The intuitions in its favor rest on the claim that seeing a bear cannot convey any information to Judy about her east/west location. This is plainly wrong because it ignores the fact that learning a conditional probability almost always conveys information about underlying unconditional probabilities (the only exception being when the new value for the conditional probability is the same as the old). For example, as
456
James M. Joyce
van Fraassen noted, a learning experience that forces Pλ (N |W ) to be near one also forces Pλ (SW ) to be near zero. In the extreme case where Judy learns Pλ (N |W ) = 1 (because she knows #(NW) > #(NE) = 0), the evidential impact of seeing a bear straightforward: Judy learns she is not in the southwest, and her posterior is Pλ (·) = P (·|¬SW ). This effect is not confined to extreme probabilities. For example, learning Pλ (N |W ) = 4 /5 provides Judy with information that forces Pλ (SW ) below 1 /5 . It might seem that no such information can be conveyed when the prior for SW already falls below the upper bound that experience dictates, but this is not so. Even when P (SW ) = 1/4 Judy acquires data about unconditional probabilities. She learns that there is an increment δ ∈ [0, 1/4] such that SW ’s probability must decrease by δ and NW ’s probability must increase by 1/2 − 3δ. While she does not learn δ’s value, its existence tells Judy things about the unconditional probabilities of NW and SW, e.g., that at least one of them has to change and that Pλ (NW ) must end up three times larger than Pλ (SW ). An updating rule that alters the unconditional probabilities of NW or SW must treat these changes in the way Bayesians usually treat changes in unconditional probabilities. In particular, the rule should respect the Bayesian Updating Insight (BUI). If a proposition has its probability diminished as the result of experience then, subject to obeying whatever other constraints experience imposes, the lost probability should be distributed among the other propositions in proportion to their probabilities. For an example of an updating rule that respects BUI, focus on the increment δ. If the decrease in P (SW )’s probability was due to Jeffrey conditioning, then the posterior could be specified by the following recipe: (a) assign each of the other regions a number, mR = P (R)/P (SW ), equal to the factor by which its prior exceeds or falls short of P (SW ), (b) normalize to attain weighting coefficients wR = mR /( R =SW mR ) , and (c) apportion δ so that each region other than SW increases according to its weight, Pλ (R) = P (R) + δ ·sR for R = SW. This way of thinking about Jeffrey conditioning makes it feasible to process the new information Judy receives. When #(NW) = 3·#(SW )) is known, seeing the bear will affect Judy’s updating process in two related ways. In addition to requiring her posterior to satisfy Pλ (N |W ) = 3/4, it also corrects her view of the relationship between the unconditional probabilities of NW and SW. Whereas Judy had thought mN W = 1, she now knows mN W = 3. In light of this information, Judy should choose a posterior that can be obtained from the prior using Jeffrey conditioning with values mN W = 3 and mN E = mSE = 1, and that satisfies Pλ (N |W ) = 3/4. This process both respects BUI, subject to Judy’s new views about the unconditional probabilities, and yields a determinate posterior. In fact, it yields Pλ (NW ) = 1 /3 , Pλ (NW ) = 1 /9 , Pλ (NE ) = Pλ (SE ) = 5 /18 , the Brier score’s answer. For another process that satisfies BUI, imagine that Judy chooses a posterior for which Pλ (N |W ) = 4/3 via Jeffrey conditioning using her initial values mN W = mN E = mSE = 1. This produces Pλ (SW ) = 1 /10 , Pλ (NW ) = Pλ (NE ) =
The Development of Subjective Bayesianism
457
Pλ (SE ) = 3 /10 , so that all alternatives to SW receive an equal bump. Unlike the previous approach, which treats the data Pλ (NW ) = 3·Pλ (SW ) as an input to the conditioning rule, this approach treats it as a piece of information about SW alone, holding fixed the idea that NW, NE and SE are equiprobable. In effect, Judy is saying, “even though I now know that I am three times more likely to be in NW than in SW, and even though I have received no new evidence about the relative probabilities of SW, NE or SE, I am going to treat them all as equally probable for purposes of updating. So, I will satisfy the constraint by lowering SW ’s probability just the right amount to make it true that Pλ (¬SW ) = 9·Pλ (SW ).” The flaw in this reasoning is that it treats the new data Judy receives as being entirely about the distribution of probabilities over {SW, ¬SW }, but it concerns {NW, ¬NW } as well. There are other updating rules that satisfy BUI, but none yields p = 1/2. BUI requires every proposition in ¬SW
458
James M. Joyce
There is still much work to be done on generalizing Bayesian conditioning to learning situations in which the data does not fix probability assignments for propositions. The method of minimizing divergence seems fruitful, but the plethora of measures, and lack of agreement among them, means that it further efforts are required. A promising approach, taken above for the Brier score, is to use BUI to relate the posterior probabilities recommended by various divergence measures to more familiar updating rules. 4 SUBJECTIVE PROBABILITY AND OBJECTIVE CHANCE Even if probabilities can be understood as rational degrees of confidence, and belief change can be modeled in full generality using Bayesian methods of updating, it remains to ask whether other interpretations of probability might also be legitimate, and how they might be encompassed within a Bayesian framework. As noted, there is a Bayesian tradition that allows for objective a priori probabilities, but the concept of probability can also be given an empirical gloss. Simplifying greatly and blurring distinctions, two main empirical interpretations of probability have been proposed: probabilities as relative frequencies and probabilities as single-case propensities. Frequentist views have been dominant for about 150 years, beginning with Venn [1866], and buttressed by seminal contributions from Von Mises [1939], Reichenbach [1948], Neyman [1950] and many others. Roughly, frequentism identifies the probability of an event with the frequency of occurrences of its general type within a reference class of similar events generated by a long run of repetitions of a wellstructured random experiment. While there is disagreement about what the “long run” involves, the most plausible views identify probabilities with limits of relative frequencies in the infinitely long run. To see the idea, imagine an idealized experiment involving a random variable X with values in some finite set V = {v1 , . . ., vK } that can be sampled indefinitely to yield a sequence of observations x1 , x2 , . . ., xN for arbitrary large N . In such a scenario frequentism says that the probability of X having the value v, is, by definition, P (X = v) = limn→∞ nv /N where nv is the number of trials among the first N on which X had value v. The probability exists if and only if this limit does.22 The propensity view interprets probabilities as measurements of the dispositions that experimental set-ups have to produce outcomes that exhibit stable frequencies. These are explicitly single-case probabilities. The propensity of a coin to 22 Some frequentists, e.g., Von Mises [1939] and Martin-L¨ of [1966], augment this account with a “randomness” requirement to ensure that the limiting operation does not conceal patterns that should be taken into account. If a coin tossing experiment turns out h, t, h, t, h, t, h, . . ., it seems wrong to set the probability of heads at 1/2. It is clearly in the spirit of the frequency view that, as N increases, there should eventually come a point at which nv /N is almost as good as an estimate of the outcome of the next coin toss as would be provided by any other plausible method. Frequentists need not specify when this will happen, only that it eventually will. Here a “plausible method” is a recursive function that maps initial sequences of heads and tails into {h, t}.
The Development of Subjective Bayesianism
459
come up heads on a given toss is physical a property of the coin, the tossing apparatus, and the particular situation. It is independent of what happens in other coin tossings. It explains why we observe the patterns of frequencies we do, but it is not defined in terms of those patterns. There are a variety of theories about how single-case chances are determined, but we shall not pursue the matter.
4.1
De Finetti’s Rejection of Chance
Bayesians have complicated relationships with objective probabilities. Some maintain that there is no such thing. In his (1974, p. x), Bruno de Finetti, perhaps the most strident anti-objectivist, writes PROBABILITY DOES NOT EXIST. The abandonment of superstitious beliefs about the existence of the Phlogiston, the Cosmic Ether,...or Fairies and Witches was an essential step along the road to scientific thinking. Probability, too, if regarded as something endowed with some kind of objective existence, is no less a misleading misconception. De Finetti believes that we are led, mistakenly, to believe in objective chances by a kind of psychological projection fallacy in which we misidentify symmetries found in our subjective beliefs with objective features of the world. It takes a bit of apparatus to explain de Finetti’s idea. Let X1, X2, X3,... be an infinite sequence of random variables each with domain V = {v1 , v2 , . . ., vK }, and let X n = V ∪ V 2 ∪ . . . denote the set of all finite sequences drawn from V . Each x = x1 , x2 , . . ., xN ∈ X n can be thought of as the result of an experiment that samples each of the first N variables. A reordering of x is an element xσ = xσ(1) , xσ(2) , . . ., xσ(N ) of X n that can be obtained by rearranging x’s values using a permutation σ of {1, 2,. . . , N }. Note that not every permutation yields a distinct reordering. In the binary case where V = {0, 1}, if x = 1, 1, 0, 0, then the permutation that swaps the even indicies and the odd indicies produces the same reordering as the one that swaps the first and last indicies and the middle indicies — both produce 0, 0, 1, 1. In general, if σ and τ are permutations and xσ(n) = xτ (n) for all n ≤ N , then xσ and xτ are the same reordering of x. By this criterion, the number of distinct reorderings of x is N !/(n1 !·n2 !·. . .·nK !) where n1 is the number of times v1 appears in x, n2 is the number of times v2 appears in x, and so on. In de Finetti’s terminology, a probability P over X n is exchangeable when P (x) = P (xσ ) for xσ any reordering of x. Example (IID Coin Tossing). If we assume that tosses of a coin are independent and that there is a fixed bias θ toward heads on any toss whatever the outcome of previous tosses, then, irrespective of the order in which heads and tails appear, the exchangeable probability of obtaining n heads and N − n tails is θn ·(1 − θ)N −n .
460
James M. Joyce
Example (Sampling with Replacement). We sample randomly from an urn containing b black balls, w white balls and r red balls, and we always replace the drawn ball before the next trial. Then, independent of the order in which various colors appear, the exchangeable probability of any sequence of j black, k white and N − (j + k) red is (bj ·wk ·rN −(k+j) )/(b + w + r)N . In these examples the trials X1 , X2 , . . . form an ordered sequence of independent, identically distributed (IID) random variables relative to P . This automatically makes P exchangeable because the probability of any given sequence of results from IID random variables is the product of the probabilities of the results taken individually. The converse does not hold: P can be exchangeable even when not IDD. Example (Polya’s Urn). Imagine an urn of unlimited capacity that initially holds b0 black balls (color 0) and w0 white balls (color 1). The proportion of balls in the urn is altered via an iterative process that, at each stage n, involves randomly drawing a ball and observing its color, and returning the ball to the urn with another of the same color, so that • If the nth ball is black, then bn+1 = bn + 1, wn+1 = wn . • If the nth ball is white, then bn+1 = bn , wn+1 = wn + 1. If we begin with five blacks and three whites, and draw 0, 1, 1, 0, 1, the contents of the urn will be (b0 = 5, w0 = 3), initial (b1 = 6, w1 = 3), after black (b2 = 6, w2 = 4), after black, white (b3 = 6, w3 = 5), after black, white, white (b4 = 7, w4 = 5), after black, white, white, black (b5 = 7, w5 = 6), after black, white, white, black, white The probability of making this precise ordered sequence of draws is P (0, 1, 1, 0, 1)
= [b0 /(b0 + w0 )]·[w0 /(b0 + w0 + 1)]·[(w0 + 0)/(b0 + w0 + 2)] ·[(b0 + 1)/(b0 + w0 + 3)]·[(w0 + 2)/(b0 + w0 + 4)] = 5 /8 · 3/9 · 4/10 · 6/11 · 5/12 = 5/264
Permuting the order makes the calculation different but the result remains the same e.g., P (1, 0, 0, 1, 1) = 3/8 · 5 /9 · 6 /10 · 4 /11 · 5 /12 = 5/264 . More generally, the probability for any sequence of B blacks and W whites, taken in any order, is ((b0 + B − 1)!/(b0 − 1)!)·((w0 + W − 1)!/(w0 − 1)!) (b0 + w0 + B + W − 1)!/(b0 + w0 − 1)!
The Development of Subjective Bayesianism
461
So, the probability distribution generated by a Polya urn is exchangeable. It is not IID, however, since the probability of drawing, say, a black ball on the fourth trial depends on how many blacks were drawn on the first three trials. For an example of a non-exchangeable process consider the following variant of the Polya urn. Example (Polya Urn with Asymmetrical Replacement). Imagine a process like the Polya urn (b1 = 5, w1 = 3) except with a bias toward white, so that • If the nth ball is black, then bn+1 = bn + 1, wn+1 = wn . • If the nth ball is white, then bn+1 = bn , wn+1 = wn + 5. Exchangeability fails since, e.g., P (0, 0, 1, 1) = 5/8 ·6 /9 ·3 /10 ·8 /15 = 1/15 P (1, 1, 0, 0) = 3/8 ·8 /13 ·5 /18 ·6 /19 = 5/247 Markov processes provide another instructive example of non-exchangeability. Example (Markov Urns). Begin drawing, with replacement, from a black urn that contains 6 black balls and 4 white balls. Continue drawing from the black urn until a white ball is selected. Then begin drawing, with replacement, from a white urn that contains 2 black balls and 8 white balls. Continue drawing form the white urn until a black ball is selected. Then repeat the whole procedure. This is a Markov process with transition probabilities: P (black on draw n + 1 | black on draw n) =
3
P (black on draw n + 1 | white on draw n) =
1
/5 /5
Exchangeability fails since, e.g., P (0, 0, 1) = 3/5 ·3 /5 ·2 /5 =18 /125 and P (0, 1, 0) = 3/5 ·2 /5 ·1 /5 = 6/125 . Though Markov processes are not generally exchangeable, they can possess a property much like exchangeability. To see the idea consider that P (0, 0, 1, 0, 1) = 3/5 ·3 /5 ·2 /5 ·1 /5 ·2 /5 = P (0, 1, 0, 0, 1) = 3/5 ·2 /5 ·1 /5 ·3 /5 ·2 /5 =
36
/3125 /3125
36
It is no accident that these probabilities agree: (a) both sequences start out with black, which contributes a factor of 3 /5 to the probability; (b) each contains two switches from black to white, which each contribute a factor of 2 /5 ; and (c) each contains one switch from white to black, which contributes a factor of 1 /5 . Indeed, if we take a sequence of length N that starts with a black and if we count the number i of black-to-white transitions and the number j (= i or i − 1) of white-toblack transitions, then the sequence’s probability is 3 /5 ·(2 /5 )i ·(1 /5 )i−1 ·(4 /5 )N −2·i
462
James M. Joyce
when j = i − 1 and is 3 /5 · (2 /5 )i · (1 /5 )i · (3 /5 )N −2·i+1 when j = i. So, all that matters to the probability of a sequence is its first element, and the number of transitions of each type. This suggests a fruitful chain of definitions, following Diaconis and Freedman [1980]. Given a sequence x ∈ X n and distinct vi , vj ∈ V , let τ (vi , vj ) be the number of vi -to-vj transitions in x, i.e., it is number of times that xn = vi and xn+1 = vj . Say that two sequences x, y ∈ X n are similar when they have the same first element, so x1 = y1 , and have the same transition numbers, so that τ x (vi , vj ) = τ y (vi , vj ). P is partially exchangeable when each pair of similar sequences has the same probability. Every exchangeable distribution is partially exchangeable since similar sequences can always be obtained from one another via reordering. But, not every partially exchangeable distribution is exchangeable since not every reordering preserves initial elements and transition numbers. It is illuminating to interpret these concepts in terms of sufficient statistics. A necessary and sufficient condition for P to be exchangeable is that the frequencies of occurrence of the sample values constitute a sufficient statistic. Given x = x1 , . . ., xN and vk ∈ V , let Fx (vk ) = nk /N be the frequency with which vk appears among x’s elements. To say that these frequencies are sufficient statistics means that any extension x+ = x1 , . . ., xN , . . ., xN +M of x satisfy P (x+ |x = P (x+ |Fx (v1 ), Fx (v2 ), . . ., Fx (vK ), N ) = P (x+ |n1 , n2 , . . ., nK ). So, if you know x ’s length and know the frequencies with which the various possible v-values appear in x, then further information about the sequence (e.g., information about the order in which the vi appear) is immaterial to any conclusions you might draw about future trials. For instance, knowing the frequencies of black and white balls that have appeared in the first N draws from a Polya urn suffices to establish a definite probability for the next draw even if you are ignorant of the order in which colors appeared. If, say, you know that the urn was set up so that (b0 = 5, w0 = 3) and are told that 60% black balls and 40% white balls came up in the first 100 draws, then you can deduce (b100 = 65, w100 = 43) and conclude that the probability of the 101st draw being black and the 102nd draw being white is 65 /108 ·43 /109 . The connection between exchangeable sequences and frequencies is quite fruitful. Consider the following question: Given non-negative natural numbers n1 , n2 , . . ., nK that sum to N , how likely is it that the frequencies in the first N trials will be F (v1 ) = n1 /N, F (v2 ) = n2 /N, . . ., F (vK ) = nK /N ? When P is exchangeable there is an easy answer. Choose an x ∈ X n of length N with desired frequency profile. x will have a definite probability, P (x ), which it shares with all its reorderings. Since there are N !/(n1 !·n2 !·. . .·nK !) reorderings of x , and since these reorderings are exactly the sequences of length N with x ’s frequency profile, it follows that P ({y ∈ X n of length N : Fy (vk ) = nk /N for k = 1, 2, .., K}) = P (x) · N !/(n1 ! ·n2 ! · . . . · nK !). One can think of this as a probability over frequency profiles for sequences of length N , so that P N (n1 /N, . . ., nK /N ) = P (x )·N !/(n1 ! · n2 ! · . . . · nK !) is the
The Development of Subjective Bayesianism
463
probability, according to P , of observing a sequence of length N with the specified profile. Example. In five trials from a Polya urn with (b0 = 5, w0 = 3), what is the probability of observing 2 /5 black and 3 /5 white? Answer: the probability of any single sequence of two blacks and three a white is 5 /264 . Since there are 5! /(3! · 2!) = 10 reorderings of such a sequence, P 5 (2 /5 ,3 /5 ) = 50/264 ≈ 0.19. Things get even more interesting when we consider P N ’s cumulative distribution function. For each p ∈ [0, 1], this gives the probability of obtaining a sequence of length N with a frequency of black balls that does not exceed p, P N (F (black) ≤ p). Here are some values: P N (F (black) ≤ p) p= N =5 N = 10 N = 20 N = 50 N = 100 N =ω
0.0 0.026 0.003 3·10−4 5·10−6 2·10−7 0
0.1 0.026 0.018 0.005 0.001 5·10−4 1.77·10−4
0.2 0.121 0.052 0.023 0.011 0.007 4.67·10−3
0.3 0.121 0.117 0.071 0.045 0.037 0.0288
0.4 0.310 0.218 0.160 0.123 0.110 0.0963
0.5 0.310 0.354 0.298 0.255 0.242 0.2266
0.6 0.576 0.516 0.475 0.444 0.432 0.420
0.7 0.576 0.686 0.668 0.656 0.652 0.647
0.8 0.841 0.838 0.841 0.846 0.849 0.852
0.9 0.841 0.949 0.958 0.967 0.970 0.974
A clear pattern of convergence can be discerned. Indeed, one can show that, as the length of the sequence increases, the cumulative distribution for P N converges to the bottom row. This is the cumulative distribution of the beta density: β(5,3) (θ) = θ4 · (1 − θ)2 / ∫01 t4 · (1 − t)2 dt. To appreciate what this means, think of θ as the limiting frequency of an infinitely long sequence of draws from a (b0 = 5, w0 = 3) Polya urn. P ω (θ) = β(5,3) (θ) is then the probability that, in the infinitely long run, sampling from the urn will yield a sequence that has θ as its limiting frequency for black. This result generalizes to Polya urns with arbitrary initial contents (b0 = b, w0 = w). Here taking a limit of P N produces the beta density β(b,w) (θ) = θb−1 ·(1−θ)w−1 / ∫01 tb−1 ·(1−t)w−1 dt as the probability for obtaining a countably infinite sequence of draws that has θ as its limiting frequency for black. De Finetti [1939] proved that something similar holds for exchangeable probabilities on binary random variables, and Hewitt and Savage [1955] extended this result to the more general setting considered here. De Finetti’s Representation Theorem. Suppose that P is exchangeable over the sequence of random variables X1 , X2 , . . ., each of which has values in v1 , . . ., vK , and let Θ = {θ1 , θ2 , . . ., θK ∈ [0, 1] : θ1 + . . . + θK = 1} be the set of all probability distributions over v1 , v2 , . . ., vK . Then, (A) There is a unique probability P ω over Θ that can be obtained as the limit of probabilities for frequencies P ω (θ1 , . . . , θk ) =
464
James M. Joyce
limN →∞ P N (n1 /N, . . ., nK /N ) where P N is the probability over frequency profiles for sequences of length N . (B) If P (x | θ) is the probability of obtaining x ∈ X n as the initial segment of a infinite sequence with frequency profile θ ∈ Θ, then nK P (x | θ) = θ1n1 · θ2n2 ·. . . ·θK
where n1 ,. . . , nK are the number of times, respectively, that v1 ,. . . , vK appear in x . (C) In light of (B), the unconditional probability of any x ∈ X n can be expressed as a mixture of the IID probabilities, so that nK · P ω (θ)dθ P (x ) = ∫ Θ θ1n1 · θ2n2 ·. . . ·θK
This highlights the deep connection between exchangeability and limiting frequencies. It says that having an exchangeable subjective probability over X1 , X2 ,. . . involves having determinate expectations about the probabilities with which various liming frequencies are likely to occur (assuming sampling to infinity), and that, conditional on one of these expectations, the subjective probability for each finite sequence is as it would be if that sequence was generated by an IID process. Partially exchangeability relates to Markov processes in a similar way. The sufficient statistic for partial exchangeability is the pair x1 , [τ (vi , vj )] consisting of a sequence’s first entry and the K × K matrix of its transition numbers. Just as there is a limiting cumulative distribution over long-run frequencies for exchangeable random variables, there is likewise a limiting cumulative distribution over x1 , [τ (vi , vj )] in the partially exchangeable case. This cumulative distribution yields a probability density over the set of all Markov processes that specify an initial probability distribution θ = θ1 , θ2 , . . ., θK for X1 and a matrix of θT = [θ(vi , vj )] of transition probabilities. Then, if Θ* is the set of all pairs θ, θT , we obtain the following generalization of de Finetti’s theorem: Theorem Diaconis and Freedman [1980]: Let P be partially exchangeable over a sequence of random variables X1 , X2 ,. . . that is recurrent in the sense that each vk in v1 , . . ., vK appears infinitely often among the Xn . Then, (A) There is a unique probability P ω over Θ* that can be obtained as the limit of P N ’s probabilities for initial frequencies and transition numbers. (B) If P (x | θ, θT ) is the probability that x ∈ X n is the initial segment of a infinite sequence with Markov profile θ, θT , then P (x |θ, θT ) = θ(x1 ) · θ(x1 ,x2 ) · θ(x2 ,x3 ). . . ·θ(xK−1 ,xK ) (C) In light of (B), the unconditional probability of any x ∈ X n can be expressed as a mixture of the Markov probabilities, so that P (x) = ∫Θ∗ θ(x1 )·θ(x1 , x2 ). . .·θ(xK−1 , xK )·P ω (θ, θT )dθ, θT
The Development of Subjective Bayesianism
465
This highlights the deep connection between partial exchangeability and Markov processes. It says that having a partially exchangeable subjective probability over X1 , X2 , . . . involves having determinate expectations about initial elements and transition frequencies (assuming sampling continues to infinity with recurrence), and that conditional on these expectations the subjective probability for a finite sequence is as it would be if that sequence was generated by a Markov process. Friends of objective probability might take comfort in these results. Those favorably inclined toward limit-of-frequency interpretations will construe P ω (θ) as a subject’s estimate of probability that, in the infinitely long run, sampling from X1 , X2 , . . . will yield a sequence that has θ1 as it its limiting frequency for v1 , θ2 as its limiting frequency for v2 , and so on. Similarly, under conditions of partial exchangeability and recurrence, P ω (x | θ, θT ) will be her subjective probability that, in the infinitely long run, sampling from X1 , X2 ,. . . will yield a sequence with the given transition frequencies. Proponents of propensity views will view P ω (x | θ) and P ω (x | θ, θT ) as single-case probabilities produced by an underlying IID or Markov causal process. Either way, it might seem, De Finetti’s theorem can be read as saying that, under conditions of exchangeability or partial exchangeability, a person’s subjective probabilities are her subjective expectations of objective chances. This makes it appear as if objective chances are required to explain why subjective probabilities have the exchangeability properties they do. De Finetti draws the opposite moral. Instead of seeing exchangeability judgments as requiring explanation in terms of objective chances, he regards them as the bedrock phenomena. It can make sense, he maintains, for a person to view a process as exchangeable or partially exchangeable even when she denies that it is governed by objective chances. The probabilities P ω (x | θ) and P ω (x | θ, θT ) will still exist of course, but instead of reflecting objective chances they display the subject’s personal opinion that facts about the past ordering of outcomes, or about the positions at which transitions among outcomes occur, are irrelevant to questions about future outcomes. For example, if you treat a coin tossing process as exchangeable – i.e., if you judge that the past order of heads and tails provides no relevant information about future heads or tails – then de Finetti’s theorem shows that your inferences from the data will be identical to those of a person who thinks the coin has a determinate, perhaps unknown, objective chance of landing heads. Even so, there is nothing in your inductive practices that requires you to invoke objective chances: your exchangeability judgments do all the work. While you might act as if you believe that the coin has some objective chance of landing heads, the moral of de Finetti’s theorem is that this attitude can be based on judgments of exchangeability alone. Indeed, de Finetti would stress, the value of P (x ) on the left of the (C) equations is not derived from the chance estimate on the right. Rather, P (x ) comes directly from the exchangeable probability on X n , and P ω (x | θ) and P ω (x | θ, θT ) are constructed, post hoc, to make the equations work out. The invocation of objective chance is a third wheel: all inductive inferences drawn on the basis of an exchangeable subjective probability can, in principle, be explained without chances.
466
James M. Joyce
De Finetti draws a radical conclusion. He maintains that the idea of objective chances is a kind of illusion that arises when we project our subjective beliefs onto the world. It seems like there are objective chances, he says, only because we so often reason as we would if chances existed, but this reasoning is based solely on exchangeability judgments. When we are faced with a situation in which various aspects of the order of past outcomes seems immaterial to the probability of future events we illegitimately reify some chance property to explain our inductive tendencies in much the same way that ancient peoples invoked “fairies and witches” to explain coincidences in nature. But, the objective facts actually concern not chances but our personal tendencies to draw certain sorts of inductive conclusions from specified data. Instead of saying the weaker, true thing that the frequency of outcomes in past trials is a sufficient statistic for inferences about future trials, we say the stronger, false thing that the data is being generated by an IID process governed by objective chances. Objective chance, according to de Finetti, is a man-made chimera.
4.2 The Principal Principle Some Bayesians are more tolerant of objective chances. David Lewis, for example, writes [1980, p. 83]: We subjectivists conceive of probability as the measure of reasonable partial belief. But we need not make war against other conceptions of probability, declaring that were subjective credence leaves of, there nonsense begins. Along with subjective credence we should believe in objective chance. The practice and analysis of science requires both concepts. Neither can replace the other. So, says Lewis, subjective probability and objective chance can peacefully coexist. The challenge for Bayesianism is not to show how chance can be eliminated, but to formulate chance/credence principles (CCPs) that explain how information about chances should affect credences. Lewis proposes a simple CCP, the Principal Principle, which he hoped would characterize the conditions under which credences and chances should align, but things turned out to be far more complicated than he first thought. To appreciate the complexities, imagine an agent who has a subjective probability Pt at each time t, and who assigns credences to propositions, Ch t (A) = p, which say that the chance at t of event A occurring is p.23 On a first pass, we might try to formulate a CCP as follows: CCP (Incorrect). If Pt (Ch t (A) = p) > 0, then Pt (A| Ch t (A) = p) = p. More generally, a subject’s unconditional probability for A should be her expectation of its objective chance, Pt (A) = ∫01 Pt (Ch t (A) = p) · p dp. 23 Here “Ch (A)” is a non-rigid designator — “the chance of A at t, whatever it is” — and p t is a determinate real number in [0, 1].
The Development of Subjective Bayesianism
467
The trouble with this, as Lewis recognized, is that evidence about chances can be rendered moot by facts about later chances or by direct evidence of A’s truth-value. Example. A coin was tossed ten times yesterday. You know that it was either biased 2:1 for heads or 3:1 for tails, and you think these are equally likely. You need to estimate the probability of the proposition, A, that the sixth toss was a head. If this is all you know, then it is reasonable to set your credence in A at your expectation of yesterday’s chances, in which case Pt (A) = 11 /24 . But, if you then learn that nine of the ten tosses were heads, it is crazy to set Pt+1 (A) = 11 /24 . The right answer is near 9 /10 , an answer you cannot attain as a mixture of the chances. Lewis [1980, p. 92] calls a proposition B admissible with respect to Ch t (A) = p when it is “the sort of information whose impact on credences. . . comes entirely by way of chances.” B is inadmissible when B conveys information about A that cannot be accounted for by changes in A’s chance distribution conditional on B, in which case Pt (A| Ch t (A) = p ∧ B) = p for some p. To repair our CCP we need to explain the substantive conditions under which information is and is not admissible. There are some clear first steps: • Current chances screen-off the past. If B is entirely about the state of the world prior to t, then B is admissible with respect to Ch t (A) = p. • Later chances screen-off earlier chances. If B entails that A’s chance is anything other than p at time t or later, then B is inadmissible with respect to Ch t (A) = p. So, any proposition Ch s (A) = q with s > t and q = p is inadmissible with respect to Ch t (A) = p, and P (A| Ch t (A) = p ∧ Chs (A) = q) = q. • Truth screens-off everything. If B entails A, then B is inadmissible with respect to Ch t (A) = p < 1, and P (A| Ch t (A) = p ∧ B) = 1. In his initial paper Lewis made these basic observations and proposed the following: Principal Principle. Let be Pt be any “reasonable” initial credence function with Pt (Ch t (A) = p) > 0. If B is admissible for Ch t (A) = p then Pt (A| Ch t (A) = p ∧ B) = p and, more generally, Pt (A|B) = ∫ 10 Pt (Ch t (A) = p|B) · p dp. Of course, this merely gives a name to the problem since, absent any substantive theory of admissibility, the Principle says little. Lewis initially thought that the incomplete theory of inadmissibility just sketched would suffice for most purposes, but it turned out that, on Lewis’s favored theory of chance, many statements about current chances are inadmissible relative to the current chances. The reasons for
468
James M. Joyce
this are unimportant, but subsequent struggles with the notion of admissibility made it clear that much remained to be done before an acceptable CCP would be forthcoming. One sticking point is Lewis’s focus on admissibility for propositions rather than for credence functions. He sometimes speaks as if the Principal Principle can be applied when Pt is any reasonable (= probabilistically coherent) credence function, but this cannot be. If B contains inadmissible information then any Pt that assigns it a high probability, even short of one, would need to be ruled out. In light of this, it can be tempting to think that Pt should be restricted to “pure” priors that either contain no information about A or can be derived from such priors via learning experiences that do not bring in inadmissible information.24 Unfortunately, there is no such thing as a pure prior in this context. Any prior that assigns unconditional probabilities to statements about A’s chances (a prerequisite for this whole discussion) is thereby taking some stand on A’s truth-value, and it is not clear what it could mean to say that one of these stands is evidentially neutral. So, we need a distinction between those credence functions that encode information, either as certainties or not, that make it reasonable to align credences with chances and those that do not. With this in mind, we can reformulate Lewis’s principle thus: Principal Principle. If Pt (Ch t (A) = p) > 0 and if Pt does not encode any information that is inadmissible with respect to Ch t (A) = p, then Pt (A| Ch t (A) = p) = p and Pt (A) = ∫01 Pt (Ch t (A) = p) · p dp. We still need a substantive admissible/inadmissible distinction if this is to be useful, but at least we now have the right distinguishata. One might wonder, however, whether a substantive distinction is really necessary. Perhaps we can do without it if we adjust our views about the relationship between credence and chance. Such an approach has been suggested by Lewis [1994], Hall [1994] and Thau [1994]. Strevens [1995] and Meacham [2005] make similar proposals. The basic idea is that we can eliminate the need for any substantive theory of admissibility by recognizing that a believer should align her credences with known chances only when these chances incorporate all the information the believer possesses. When the Principal Principle fails it is always because the credence function encodes information not found in the chance distribution. A vivid, if unrealistic, example is provided by “crystal ball” cases, Hall [1994]. Example. Assume the same set-up as in the previous example except that coin will be tossed tomorrow. Before making your estimate you consult a soothsayer you believe to a reliable. She tells you that nine of the ten tosses will land heads, an unlikely event whichever way the coin is biased. Taking her word, you end up assigning Pt (90% heads) 24 Meacham [2005], for example, speaks of a “hypothetical prior” that gives the right credences to hold before the receipt of any evidence about A.
The Development of Subjective Bayesianism
469
= 1, but Ch t (90% heads) is either 10·(2 /3 )9 · (1 /3 ) or 10·(1 /4 )9 · (3 /4 ), both very small numbers. So, you possess information that the chance distribution lacks, and you would be unwise to align your credences with expected chances since doing so prevents you from believing A to degree 0.9 or to any degree greater than 10·(2 /3 )9 · (1 /3 ) ≈ 0.087. Lewis, Hall and Thau suggest that one ought to align one’s credences with the chances conditional on the extra information one possesses. So, the chance/credence principle should be expresses thus: New Principle. If E expresses every relevant item of data that a believer has concerning A’s truth or falsity, and if Pt (Ch t (A | E ) = p) > 0, then Pt (A| Ch t (A | E ) = p ∧ E) = p and, more generally, Pt (A | E ) = ∫ 10 Pt (Ch t (A | E ) = p | E )·p dp. We no longer need a substantive theory of admissibility since any “inadmissible” information B will be incorporated into E and so into the chance distribution. B is always admissible for Ch t (A | E ) = p when E entails B. Alternatively, as Strevens (1995) observes one can simply define B as inadmissible when it alters the chances, Ch t (A | B ) = Ch t (A), and then New Principle and the Principle Principal end up being equivalent for admissible evidence, and only the New Principle applies when a believer has inadmissible evidence. It is instructive to think about the Principal Principle and New Principle using the theory of “expert” probabilities developed in Gaifmann [1986]. An epistemic expert is a probabilistic information source to which a believer defers by aligning her credences with the source’s probabilities, to the extent that she can discern what those probabilities are. More exactly, say that an believer with a subjective probability P treats another probability Q as an epistemic expert with respect to the propositions in A exactly when P (A|Q(A) = a) = a for all a ∈ [0, 1] and A ∈ A.25 Truth, the probability that assigns one to all truths and zero to all falsehoods, must be accorded expert status by any coherent credence function. Other commonly alleged experts include Chance, Epistemic Probability and Physical Probability (derived, e.g., from quantum mechanics or statistical mechanics). There is a pecking order among experts. While a coherent believer will always expect all her experts to agree with one another, in the technical sense that Exp(Q(A)−Q∗ (A)) = 0, this allows for the possibility that experts sometimes disagree in fact. When this happens, the pronouncements of some experts nullify those of others. Truth sits atop the heap. As a consequence of the laws of probability, it must receive complete deference in all circumstances from all other experts. As we have seen, later chances merit more deference than earlier ones, so later chances are higher in the pecking order. Likewise, current chances trump any expert whose probabilities are based solely on information about the past. In this guise, the problem 25 Typically, a person’s deferential attitudes toward information sources depend on both the content of the proposition in question and the probability assigned.
470
James M. Joyce
of admissibility reemerges as the problem of determining which experts trump Chance. In considering this, it is useful to follow Hall (2004) in distinguishing two sorts of epistemic experts. A database-expert deserves deference owing to its superior knowledge. P defers to Q (solely) as a data-base expert when there is a random variable X such that (a) P is uncertain about X’s value, (b) P knows that Q is certain about X’s value, and (c) knowing X’s value would make Q’s views irrelevant to P . Here, P ’s deference to Q does not require P to admire Q’s reasoning or insight: she defers to Q simply because she believes Q knows more than she does. In other scenarios, P ’s lack of deference toward Q might be based entirely on her perception that Q is missing some key datum. Here, (a*) P is certain about X’s value, (b*) P knows that Q is uncertain about X’s value, but (c*) P would defer to Q if Q knew the truth about X. Hall calls such a Q an analyst expert. Here P respects Q’s general reasoning ability rather than her information. When P ’s total evidence can be expressed by a proposition Etot , Q is an analyst-expert in re A for P when P (A|Q(A| Etot ) = a) = a for all a. An epistemic expert might be a data-base expert or an analyst expert or some mixture of the two. Truth is the archetypal data-base expert. No great reasoner, Truth merely consults the premises at its disposal (i.e., every truth) and asserts the conclusions it finds. Our deference is due entirely to Truth’s wonderful premise set. In contrast some authors, e.g., Williamson [2000, ch. 10], believe in a kind of “epistemic or evidential” probability that serves as a pure analyst expert. Some versions of objective Bayesianism can be interpreted as attempts to cash out this notion. Jaynes, for example, is committed to the following: if QM E is the prior that results from applying MaxEnt to the evidential constraints of an inductive problem, then the right credences to adopt in that problem are always the MaxEnt prior conditioned on subsequent evidence, so P (A| QM E (A|Etot ) = a) = a. This is just to see the MaxEnt prior as a flawless inductive reasoner. It should be clear that the New Principle captures the way in which we would defer to Chance as an analyst expert, whereas the Principal Principle is appropriate if our deference is grounded in “data-base” considerations. The puzzle is where to place Chance on the analyst/data-base spectrum. Should we seek to align credences with known chances because Chance encodes information we lack, or because it attains a level of inductive reasoning we could not hope to match, or is it a little bit of both? One’s answer to these questions will reflect one’s views about how to approach the problem of admissibility, and how to formulate the chance/credence principle. At one end, we have Hall, who writes [2004, p. 101] that “chance is an analyst-expert. . . this claim holds for chances at any time, and without qualification” (i.e., for any proposition A for which Ch(A|E) is defined). Hall’s proposal, then, is this: Chance as Analyst Expert: Pt (A| Ch t (A|E) = p ∧ E) = p for any body of evidence E with Pt (Ch t (A|E) = p ∧ E) > 0.
The Development of Subjective Bayesianism
471
This portrays chance as an ideal inductive reasoner: she might not know much, but give her your data and she will produce a probability to which you should defer. This, as Hall sees things, provides a rational for the New Principle. There is no doubt that “analyst expert” considerations play a crucial role in explaining our deference to Chance. However, as Joyce [2007] emphasizes, this cannot be the whole story since there are “data-base” considerations in play as well. The really remarkable thing about Chance is that current chances screenoff past facts. When one knows that an argon-41 atom has a half-life of 109.6 minutes, one should assign credence of one-half to it decaying in the next 109.6 seconds whatever one knows about the past history of the world. One can best explain this by supposing that the probabilities that actually realize Chance encode far more information than any believer could have about the past and present. Chances have this property because they are realized by physical probabilities, like those found in quantum-mechanics or statistical mechanics, which serve as “summary statistics” that measure the causal tendency of the current state of the world to produce future effects. According to Joyce, our deference to such physical probabilities rests ultimately on our views about the causal structure of the world, and about the restrictions that this structure places on our ability to acquire evidence. In particular, given the (contingent) facts that (a) the past only causally influences the future by influencing the present, (b) our ability to predict (nearly) all contingent future events is based entirely on evidence about the present causes of those events, (c) physical probabilities encode all the evidence about the present causes of future events that any human being could possibly have at present (and lots more), it follows that believing at any time-t that the time-t physical probability of some event is p involves thinking that no additional information that could be acquired by a believer at t can undermine the chance assignment.26 The physical probabilities that realize Chance are thus data-base experts for us with respect to all questions about future events with current causes. This alters our view of the Principal Principle. While there are still inadmissible propositions and credence functions, and while it is still true that believers who knew these propositions or had these credences would not defer to Chance as an expert, believers who understand their epistemic situation will recognize that they cannot be warranted in believing inadmissible propositions or having inadmissible credence functions. Predictions about the future that are not expectations of current objective chances cannot be justified on the basis of the sorts of evidence that, as a matter of physical possibility, human believers can possess. Thus, if 26 There is a subtlety here. When Chance is realized by genuinely indeterministic physical probabilities, like those in quantum mechanics, a believer cannot acquire information about the past or present that undermines Chance simply because there is no such information. Chance already knows everything relevant to the causes of future events – it’s all in the world’s quantum state. In contrast, when Chance is realized by statistical probabilities that average over “hidden variables”, like those of statistical mechanics, there is causally relevant information about the past and present that Chance lacks. However, when we defer to such probabilities as experts part of what we are doing is admitting to ourselves that we lack the ability, perhaps for contingent reasons, to acquire evidence about the hidden variables that is not already reflected in the chance distribution.
472
James M. Joyce
Chance is realized by physical probabilities that encode all present information that is relevant to future events, then we have a practically sufficient theory of admissibility for use with the Principal Principle. In practice, we defer to current chances more-or-less unconditionally because we know that no evidence we can actually obtain will undermine them. Go back to the “crystal ball” example. When you are certain the coin is biased 2:1 for heads, part of what you believe is that acquiring additional information about the past, or more detailed information about the present, will provide no further insight into the causal processes that lead the coin to come up heads or tails. And, insofar as you believe that the basic evidence available to human beings at a time is restricted to facts that can be learned at that or previous times, you will never defer to a soothsayer who pretends to know more than the current chances. It is not that reliable crystal balls are metaphysically impossible. The point, rather, is that it is physically impossible for us, or other physical creatures, to have evidence that would warrant taking any information source whose pronouncements we can know to be a more reliable guide to the future than the current chances. In the end, both the Principal Principle and the New Principle capture central aspects of the relationship between credence and chance. The New Principle expresses our deference to chance’s prowess at drawing inductive conclusions, while the Principal Principle captures the idea that chance is a data-base expert whose access to causally relevant information about the world, though not perfect, far outstrips anything to which humans can aspire. 5 CONCLUSION Though not monolithic, Bayesianism offers a powerful and compelling set of methods for drawing inductive inferences. Its unifying ideas are (a) Pascal’s recognition that uncertainty is best expressed probabilistically and that values of unknown quantities are best estimated using the principle of mathematical expectation, and (b) Bayes’s insight that learning and inductive inference can be fruitfully modeled using conditional probabilities and Bayes’s theorem. The two central challenges for Bayesianism are the problem of the priors, and the development of general methods for Bayesian conditioning. Bayesians have responded to the problem of the priors by proposing the use of ignorance priors that are justified a priori, embracing a radical subjectivism in which probabilities are mere degrees of coherent credence, or have sought refuge in the idea that subjective prejudices will wash out as evidence increases. On the conditioning front, Jeffrey has extended Bayes’s basic approach to account for non-dogmatic learning experiences, and further developments based on measures of divergence among probabilities seem promising. Bayesians have a vexed relationship with objective chance. Some reject the notion outright and portray chances as projections of personal inductive tastes onto the world. Others hope to make room for chances by developing chance/credence principles that clarify and explain the evidential relationships between the two kinds of probability. At bottom, however, all Bayesians agree that inductive reasoning
The Development of Subjective Bayesianism
473
involves drawing conclusions from new data on the basis of prior information using update rules that require conditioning on the evidence. BIBLIOGRAPHY [Acz´ el, 1966] J. Acz´ el. Lectures on Functional Equations and Their Applications. New York: Academic Press, 1966. [Bayes, 1763] T. Bayes. “An Essay Toward Solving a Problem in the Doctrine of Chances,” Philosophical Transactions of the Royal Society of London 53: 370-418, 1763. [Brier, 1950] G. W. Brier. “Verification of Forecasts Expressed in Terms of Probability,” Monthly Weather Review, 75: 1-3, 1950. [Chalmers, 1999] A. F. Chalmers. What is This Thing Called Science, 3rd ed. Indianapolis: Hackett, 1999. [Christensen, 1996] D. Christensen. “Dutch-Book Arguments De-Pragmatized,” Journal of Philosophy 93: 450-479, 1996. [Cox, 1961] R. T. Cox. The Algebra of Probable Inference. Baltimore: Johns Hopkins Press, 1961. [Doring, 1999] F. Doring. “Why Bayesian Psychology is Incomplete,” Philosophy of Science 66 (Proceedings): S379-389, 1999. [de Finetti, 1937] B. de Finetti. La pr´evision: ses lois logiques, ses sources subjectives, Ann. Inst. Henri Poincar´ e 7: 1–68. Translation reprinted in H.E. Kyburg and H.E. Smokler (eds.) (1980), Studies in Subjective Probability, 2nd ed.: 53–118. New York: Robert Krieger, 1937. [de Finetti, 1974] B. de Finetti. Theory of Probability, Vol. 1. New York: John Wiley and Sons, 1974. [Diaconis and Freedman, 1980] P. Diaconis and D. Freedman. “De Finetti’s Theorem for Markov Chains,” Annals of Probability 8: 115-130, 1980. [Diaconis and Zabel, 1982] P. Diaconis and S. Zabell. “Updating Subjective Probability,” Journal of the American Statistical Association 77: 822–830, 1982. [Doob, 1971] J. Doob. “What Is a Martingale?,” American Mathematical Monthly 78: 451-462, 1971. [Edwards et al., 1963] W. Edwards, H. Lindeman, and L. Savage. “Bayesian Statistical Inference for Psychological Research,” Psychological Review 70: 193-242, 1963. [Fermat and Pascal, 1679] P. Fermat and B. Pascal. “Correspondence,” Varia Opera Mathematica D. Petri de Fermat: 179-188. Toulouse, 1679. Available in English translation on the web as “Fermat and Pascal on Probability,” http://www.york.ac.uk/depts/maths/histstat/ pascal.pdf. [Field, 1978] H. Field. “A Note on Jeffrey Conditionalization,” Philosophy of Science 45: 361– 367, 1978. [Fisher, 1922] R. A. Fisher. “On the Mathematical Foundations of Theoretical Statistics,” Philosophical Transactions of the Royal Society, Series A 222: 309-368 1922. [Fisher, 1959] R. A. Fisher. Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd, 1959. [Gaifman, 1986] H. Gaifman. “A Theory of Higher Order Probabilities,” Proceedings of the 1986 Conference on Theoretical Aspects of Reasoning about Knowledge: 275-292. San Francisco: Morgan Kaufmann Publishers, 1986. [Garber, 1980] D. Garber. “Field and Jeffrey Conditionalization,” Philosophy of Science 47: 142-145, 1980. [Gibbard, 2008] A. Gibbard. “Rational Credence and the Value of Truth,” in T. Szab´ o Gendler and J. Hawthorne, eds., Oxford Studies in Epistemology, vol. 2: 143-164, 2008. [Gillies, 2000] D. Gillies. Philosophical Theories of Probability. London: Routledge, 2000. [Grove and Halpern, 1997] A. Grove and J. Y. Halpern. “Probability Update: Conditioning vs. Cross Entropy,” Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence: 208-214, 1997. [Hall, 1994] N. Hall. “Correcting the Guide to Objective Chance,” Mind, 103: 504-17, 1994. [Hall, 2004] N. Hall. Two Mistakes About Credence and Chance,” Australasian Journal of Philosophy 82: 93-111, 2004.
474
James M. Joyce
[Halpern, 1999] J. Y. Halpern. “Cox’s Theorem Revisited,” Journal of Artificial Intelligence Research 11: 429–435, 1999. [H´ ajek, 2008] A. H´ ajek. “Arguments For – Or Against – Probabilism?,” The British Journal for the Philosophy of Science 59: 793-819, 2008. [Hewitt and Savage, 1955] E. Hewitt and L. J. Savage. “Symmetric Measures on Cartesian Products,” Transactions of the American Mathematical Society, 80: 470–501, 1955. [Howso, 2008] C. Howson. “De Finetti, Countable Additivity, Consistency and Coherence,” British Journal for the Philosophy of Science 59: 1-23, 2008. [Howson and Urbach, 1989] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. La Salle: Open Court, 1989. [Jaynes, 1968] E. Jaynes. “Prior Probabilities,” IEEE Transactions on Systems Science and Cybernetics, SSC-4: 227-241, 1968. [Jaynes, 1973] E. Jaynes. “The Well-Posed Problem,” Foundations of Physics 3: 477-493, 1973. [Jaynes, 2003] E. Jaynes. Probability Theory: The Logic of Science. Cambridge, U.K.: Cambridge University Press, 2003. [Jeffrey, 1983] R. Jeffrey. The Logic of Decision, revised 2nd edition. Chicago: University of Chicago Press, 1983. [Jeffrey, 1983a] R. Jeffrey. “Bayesianism with a Human Face,” In Testing Scientific Theories, edited by J. Earman, Minnesota Studies in the Philosophy of Science 10. Minneapolis: University of Minnesota Press 1983. [Jeffrey, 1987] R. Jeffrey. “Indefinite Probability Judgment: A Reply to Levi,” Philosophy of Science 54: 586-591, 1987. [Jeffreys, 1939] H. Jeffreys. Theory of Probability. Oxford: Clarendon Press, 1939. [Joyce, 1998] J. M. Joyce. “A Nonpragmatic Vindication of Probabilism,” Philosophy of Science 65: 575-603, 1998. [Joyce, 1999] J. M. Joyce. The Foundations of Causal Decision Theory. New York: Cambridge University Press, 1999. [Joyce, 2005] J. M. Joyce. “How Degrees of Belief Reflect Evidence,” Philosophical Perspectives 19: 153-179, 2005. [Joyce, 207] J. M. Joyce. “Epistemic Deference: The Case of Chance,” Proceedings of the Aristotelian Society, 107: 1-20, 2007. [Joyce, 2009] J. M. Joyce. “Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief,” in F. Huber and C. Schmidt-Petri, eds., Degrees of Belief : 263-300. Berlin: Springer, 2009. [Kaplan, 1996] M. Kaplan. Decision Theory as Philosophy. Cambridge: Cambridge University Press, 1996. [Kelly, 2008] T. Kelly. “Disagreement, Dogmatism, and Belief Polarization,” Journal of Philosophy 105: 611-633, 2008. [Keynes, 1921] J. M. Keynes. A Treatise of Probability. London: Macmillian, 1921. [Koopman, 1940] B. O. Koopman. “The Bases of Probability,” Bulletin of the American Mathematical Society 46: 763-774, 1940. [Kraft et al., 1959] C. Kraft, J. Pratt, and A. Seidenberg. “Intuitive Probability on Finite Sets,” Annals of Mathematical Statistics 30: 408-19, 1959. [von Kries, 1871] J. von Kries. Die Principien der Wahrscheinlichkeitsrechnung, 2nd ed., T¨ ubingen, 1871. [Lange, 2000] M. Lange. “Is Jeffrey Conditionalization Defective By Virtue of Being NonCommutative? Remarks on the Sameness of Sensory Experience,” Synthese 123: 393-403, 2000. [Laplace, 1774] P. Laplace. “M´ e moire sur la probabilit´e des causes par les ´evenements,” M´ emoires de l’Acad´ emie royale des sciences present´ es par divers sarans 6: 621-56, 1774. [Lee, 1997] P. M. Lee. Bayesian Statistics: An Introduction. New York: Wiley 1997. [Levi, 1980] I. Levi. The Enterprise of Knowledge. Cambridge, Mass.: MIT Press, 1980. [Lewis, 1980] D. Lewis. “A Subjectivist’s Guide to Objective Chance,” 1980. reprinted in Philosophical Papers: Volume II. New York: Oxford University Press, 1986. All page references are to the 1986 publication. [Lewis, 1994] D. Lewis. “Humean Supervenience Debugged,” Mind, 103: 473-90, 1994. [Lieb et al., unpublished] E. H. Lieb, D. Osherson, J. Predd, V. Poor, S. Kulkarni, and R. Seiringer. “Probabilistic Coherence and Proper Scoring Rules”, unpublished.
The Development of Subjective Bayesianism
475
[Lindley, 1982] D. Lindley. “Scoring Rules and the Inevitability of Probability,” International Statistical Review 50: 1-26, 1982. [Maher, 2002] P. Maher. “Joyce’s Argument for Probabilism,” Philosophy of Science 96: 73-81, 2002. [Martin-L¨ of, 1966] P. Martin-L¨ of. “On the Concept of a Random Sequence,” Information and Control 9: 602-619, 1966. [Meacham, 2005] C. J. G. Meacham. “Three Proposals Regarding a Theory of Chance,” Philosophical Perspectives, 19 (Epistemology): 281-307, 2005. [Murphy, 1973] A. H. Murphy. “A New Vector Partition of the Probability Score,” Journal of Applied Meteorology 12: 595-600, 1973. [Neyman, 1950] J. Neyman. First Course in Probability and Statistics. New York: Henry Holt, 1950. [Paris, 1994] J. B. Paris. The Uncertain Reasoner’s Companion. Cambridge, U.K.: Cambridge University Press, 1994. [Popper, 1959] K. Popper. The Logic of Scientific Discovery. London: Hutchinson, 1959. [Ramsey, 1931] F. Ramsey. “Truth and probability,” in Foundations of Mathematics and other Logical Essays. London: Kegan Paul, 1931. [Reichenbach, 1948] H. Reichenbach. The Theory of Probability, an Inquiry into the Logical and Mathematical Foundations of the Calculus of Probability. Berkeley: University of California Press, 1948. [R´ enyi, 1955] A. R´ enyi. “On a New Axiomatic Theory of Probability,” Acta Mathematica Academiae Scientiarium Hungaricae 6: 285-335, 1955. [Savage, 1954] L. J. Savage. Foundations of Statistics. New York: Wiley, 1954. [Savage, 1971] L. J. Savage. “Elicitation of Personal Probabilities,” Journal of the American Statistical Association 66: 783-801, 1971. [Scott, 1964] D. Scott. “Measurement Structures and Linear Inequalities,” Journal of Mathematical Psychology 1: 233-247, 1964. [Seidenfeld, 1985] T. Seidenfeld. “Calibration, Coherence, and Scoring Rules,” Philosophy of Science 52: 274-294, 1985. [Shimony, 1988] A. Shimony. “An Adamite Derivation of the Calculus of Probability,” in J. H. Fetzer, ed., Probability and Causality: 151-161. Dordrecht: D. Reidel, 1988. [Skyrms, 1980] B. Skyrms. “Higher Order Degrees of Belief,” in D. Mellor, ed., Prospects for Pragmatism. Cambridge: Cambridge University Press, 1980. [Skyrms, 1984] B. Skyrms. Pragmatics and Empiricism. New Haven: Yale University Press, 1984. [Strevens, 1995] M. Strevens. “A Closer Look at the ‘New’ Principle,” British Journal for the Philosophy of Science, 46: 545-56, 1995. [Suppes and Zanotti, 1976] P. Suppes and M. Zanotti. “Necessary and Sufficient Conditions for the Existence of a Unique Measure Strictly Agreeing with a Qualitative Probability Ordering,” Journal of Philosophical Logic 5: 431-38, 1976. [Thau, 1994] M. Thau. “Undermining and Admissibility,” Mind, 103: 491-503, 1994. [van Fraassen, 1981] B. van Fraassen. “A Problem for Relative Information Minimizers in Probability Kinematics,” British Journal for the Philosophy of Science 32: 375–37, 1981. [van Fraassen, 1983] B. van Fraassen. “Calibration: A Frequency Justification for Personal Probability,” in R. Cohen and L. Laudan, eds., Physics Philosophy and Psychoanalysis: 295319. Dordrecht: D. Reidel, 1983. [Villegas, 1964] C. Villegas. “On Qualitative Probability σ-Algebras,” Annals of Mathematical Statistics 35: 1787-96, 1964. [Venn, 1866] J. Venn. The Logic of Chance. London, Macmillan, 1866. [von Mises, 1957] R. von Mises. Probability, Statistics and Truth. New York: Macmillan, 1957. [Wagner, 2002] C. Wagner. “Probability Kinematics and Commutativity,” Philosophy of Science 69: 266-278, 2002. [Walley, 1991] P. Walley. Statistical Reasoning with Imprecise Probabilities. New York: Chapman and Hall, 1991. [Williamson, 2000] T. Williamson. Knowledge and its Limits. Oxford: Oxford University Press, 2000.
VARIETIES OF BAYESIANISM Jonathan Weisberg 1
INTRODUCTION
Loosely speaking, a Bayesian theory is any theory of non-deductive reasoning that uses the mathematical theory of probability to formulate its rules. Within this broad class of theories there is room for disagreement along several dimensions. There is much disagreement about exactly what the subject matter of such theories should be, i.e. about what the probabilities in these theories should be taken to represent. There is also the question which probabilistic rules are the right ones. These two dimensions mark out the primary divides within the Bayesian viewpoint, and we will begin our Bayesian taxonomy by examining the most popular points along them. We’ll then turn to another important point of disagreement: the kinds of justification that can and should be given for the aforementioned rules. Applications of the theory — to account for rational decision making, scientific confirmation, and qualitative belief — provide other significant points of intra-Bayesian dispute, and we will consider these questions later on. We begin, however, with an outline of the mathematical machinery that lies at the heart of Bayesianism.
1.1
The Standard Bayesian Machinery
Probabilities are numbers assigned to possibilities according to a few simple rules. We may choose either sets or sentences as our mathematical representations of the possibilities to which probabilities are assigned. If we use sets, then we start with a set of objects called the outcome set, denoted Ω. The singleton subsets of Ω represent fully-specified possible outcomes, and the other subsets represent less-than-fully-specified possible outcomes. For example, if we’re considering a roll of a six-sided die, we might use the set Ω = {1, 2, 3, 4, 5, 6}, with {1} representing the possible outcome where the die comes up 1, {2, 4, 6} representing the possible outcome where an even-numbered face comes up, and so on. We then use a real-valued function one these subsets, p, to represent the assigned probabilities. A typical probability assignment might have p({1}) = 1/6, p({2, 4, 6}) = 1/2, Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
478
Jonathan Weisberg
and so on. In simple cases like the die example, every subset of Ω is assigned a probability. But this won’t always be the case, for reasons explained below (fn. 21). What is always the case is that there is a σ-algebra on Ω that has probabilities assigned throughout: σ-Algebra Given a set Ω, a σ-algebra on Ω is a set of subsets of Ω that contains Ω and is closed under complementation and countable union: (σ1) σ ⊆ P(Ω) (σ2) Ω ∈ σ (σ3) If A ∈ σ, then A ∈ σ (σ4) If {Ai } is a countable collection of subsets of σ, then
{Ai } ∈ σ.
Clearly, P(Ω) is a σ-algebra on Ω, and we typically assume that it is the σ-algebra under discussion unless specified otherwise. Given an outcome set, Ω, and a σ-algebra on it, σ, we call the ordered pair (Ω, σ) an outcome space, and we define a probability function relative to it. Probability Function Given the outcome space (Ω, σ), a probability function on (Ω, σ) is a total, real-valued function on σ, p, satisfying the following three axioms: (P1) p(A) ∈ [0, 1] for each A ∈ σ. (P2) p(Ω) = 1. (P3) p(A ∪ B) = p(A) + p(B) whenever A ∩ B = ∅. The ordered triple (Ω, σ, p) is called a probability space, and jointly represents the bearers of probability as well as the probabilities they bear. What if we use sentences rather than sets to represent the possibilities to which probabilities are assigned? In that case, we start with a standard propositional language built out of a countable set of atomic propositions, {Ai }, and the connectives ¬ and ∨. We define the relation of logical entailment on L, written , in the usual way, and the other standard logical connectives, ∧, ⊃, and ≡, are defined out of ¬ and ∨ in the usual way as well. For some applications the language needs to be enriched into first-order logic, but for most purposes, and for us here, L need only be standard propositional logic. We call the ordered pair (L, ) a logic, and we define a probability function relative to it: Probability Function A probability function on the logic (L, ) is a real-valued, total function on L satisfying: (P1 ) p(A) ∈ [0, 1] for each A ∈ L. (P2 ) p(A) = 1 if A.
Varieties of Bayesianism
479
(P3 ) p(A ∨ B) = p(A) + p(B) when ¬(A ∧ B). Call the triple (L, , p), a logical probability space. The three axioms governing probability functions on logics, (P1 )–(P3 ), deliberately mirror those governing probability functions on outcome spaces, (P1)– (P3). In fact, the set-based and sentence-based formulations are pretty much interchangeable in most contexts. For example, we have the following elementary consequences for probability spaces: PROPOSITION 1. p(∅) = 0. PROPOSITION 2. p(A) = 1 − p(A) for any A. PROPOSITION 3. If {Ai } is a finite set of mutually disjoint sets, then p( {Ai }) = p(Ai ). i
And analogously, for logical probability spaces, we have: PROPOSITION 4. If ¬A then p(A) = 0. PROPOSITION 5. p(¬A) = 1 − p(A) for any A. PROPOSITION 6. If {Ai } is a finite set of mutually logically incompatible sentences, then p(Ai ). p( {Ai }) = i
If the set-based and sentence-based approaches are so similar, why have both? In some ways the set-based approach is tidier, and hence preferable for certain applications. For example, in the set-based approach each way things could turn out is represented by exactly one thing, a set. On the sentence-based approach, however, there may be infinitely many logically equivalent sentences that represent the same eventuality. We can prove that they will all have the same probability, so that our choice of representation does not matter, but the fact that we have to verify this adds a minor complication to our machinery. On the other hand, philosophical considerations can make the sentence-based approach more perspicuous. Consider, for example, the sentences “Superman will save the world today” and “Clark Kent will save the world today”. To Lois Lane, these two sentences represent very distinct eventualities, though in fact they correspond to the same possible outcome. On the set-based approach, these two eventualities are represented by the same set, and hence must have the same probability. It seems reasonable, however, for Lois to think that they have different probabilities, which is allowed if we use different atomic propositions, A and B, to represent them. Then it is only once Lois realizes that A and B are equivalent that she must assign them the same probability: PROPOSITION 7. For any A, B in a logical probability space, if p(A ≡ B) = 1, then p(A) = p(B).
480
Jonathan Weisberg
Of course, we can get something similar out of the set-based approach, if we just interpret the elements of Ω so that they correspond more closely to what Lois regards as possible. But the sentence-based approach seems (to many, at least) to provide a more natural description of the situation. Also, for many philosophical applications where the objects of probability are evidence and hypotheses, it is often more natural to talk about sentences or propositions, in which case the sentence-based approach is again more natural. Furthermore, some theorists seek to derive probabilistic facts from logical features of the hypotheses in question, so that a logical foundation is needed (see the discussion of Carnap below, p. 485). For the most part our discussion will be ambiguous between the two approaches, slipping occasionally into one or the other approach for convenience. Our notation will often reflect this ambiguity. I will often use A to mean ¬A, and AB as ambiguous between A ∧ B and A ∩ B. So, for example, I will write p(A) = p(AB) + p(AB) instead of the more cumbersome p(A) = p(A ∧ B) + p(A ∧ ¬B). These conventions are common and help to reduce clutter. We need one final definition to complete our basic mathematical arsenal: Conditional Probability The conditional probability of B given A is written p(B|A), and is defined by p(B|A) =
p(BA) p(A)
when p(A) = 0. As we will see below (p. 481), some Bayesians prefer to take conditional probability as the basic notion, and treat unconditional probability as the defined one. Treating conditional probabilities as the derived notion is the more traditional approach, however, so we will use unconditional probability as our basic notion until we have reason to do otherwise.
1.2 Alternatives to the Standard Machinery The mathematical system of probability that we just outlined is often thought to be inappropriately restrictive. For one thing, our definition of conditional probability leaves p(B|A) undefined when p(A) = 0, though there are cases where it seems it should still be defined. For example, H´ ajek [2003] asks us what the probability is that a randomly chosen point on the Earth’s surface will be in the Western hemisphere, given that it is on the equator. The intuitive answer is 1/2. And yet
Varieties of Bayesianism
481
the probability that the point will be on the equator is 0. So our framework does not allow us to have conditional probabilities where it seems we should have them. Our framework also demands a level of precision that may not always be appropriate. Every probability that is defined is perfectly determinate, picking out a precise point in [0, 1]. But couldn’t probabilities be less precise in some cases? Consider the probability that it will snow tomorrow. Where I am right now, Toronto in February, I can only say that this probability is greater than .05 but no more than .5. Maybe that lack of precision just reflects my ignorance about the true probability. But if we want to use probabilities to represent degrees of belief, as many do (see section 2.2), then those estimations of mine may just reflect indeterminacy in my opinions. Being inexpert about the weather as I am, I simply may not be able to cite any reasons that motivate a more precise opinion than, “greater than .05 but less than .5.” We’ll now consider modifications to the traditional probabilistic framework that are designed to fix these, and other problems. Our first two variations on the traditional machinery address the problem of defining conditional probabilities when the condition has probability 0. The third variation addresses indeterminacy in probabilities. Primitive Conditional Probabilities The classic way to get around the zero-denominator problem for our definition of conditional probability is to reverse the order of definition, taking conditional probabilities as primitive and defining unconditional probabilities in terms of the conditional ones. This allows us to have the conditional probabilities defined in all cases, but still be able to talk about probabilities unconditionally. There are several competing axiomatizations of conditional probability in the literature; ours is a minimal characterization in the spirit of the approach devised by Popper [1959] and refined by R´enyi [1970]: Popper-R´ enyi Function Given an outcome space (Ω, σ), a Popper-R´enyi function is a real-valued, two-place function on σ × (σ − {∅}) satisfying the following axioms for any A, B, C ∈ σ: (P1 ) p(·|A) is a probability function. (P2 ) p(A|A) = 1. (P3 ) p(BC|A) = p(B|AC)p(C|A). We then define the unconditional probability of A to be p(A|Ω). Then, from these axioms, we can recover the ratio formula for conditional probabilities: PROPOSITION 8. If p is a Popper-R´enyi function and p(A) > 0, then p(B|A) =
p(BA) . p(A)
482
Jonathan Weisberg
So what was a definition of conditional probability is now a theorem about a a basic concept. Since this ratio formula is only a constraint now, and not an analysis, Popper-R´enyi probabilities allow us to have conditional probabilities even when the condition has probability 0. Infinitesimal Probabilities Another way around the zero-denominator problem is to ensure that the denominator is never 0. This looks impossible in cases like the Equator example. Consider the longitudes: there are uncountably many and it seems they should all be equally probable. But if we assign one of them a positive real number, then we must assign it to them all, in which case our probabilities will add up to more than 1, violating (P1). So if we cannot assign 0, and we cannot assign a positive real, we must assign something in between: an infinitesimal. Robinson [1966] showed that there must be such things as infinitesimals. Start with a first-order language containing a name for every real number, and state the theory of real analysis. Now state the theory containing all the sentences of the form a < ω, where a is the name of a real number and ω is a new name of our choosing. Take the union of these two theories. Each finite subset of the unified theory has a model and hence, by compactness, the whole theory does. ω denotes an infinite element of the model, one that is greater than all the reals. Better yet, = 1/ω denotes an infinitesimal element of the model, one that is smaller than all the reals but greater than 0. Best of all, if we allow our probability functions to take on such infinitesimal values we can avoid assigning 0 to any of the latitudes, longitudes, or any set of points on the Earth’s surface. We can even arrange things so that the conditional probability of a randomly chosen point on the Earth’s surface lying in the Western hemisphere, given that it is on the equator, is 1/2 [Skyrms, 1995]. To say much more than this, however, we have to say a lot more about the nature of infinitesimals, for which Skyrms refers us to [Parikh and Parnes, 1974] and [Hoover, 1980]. Infinitesimals take us into tricky and unfamiliar territory, so most Bayesians who want to solve the zero-denominator problem prefer to go with a PopperR´enyi-style solution. A result of McGee’s [1994] shows that the two approaches are to some extent equivalent, and [Halpern, 2001] explores the equivalence further. But infinitesimals may have another use, as we’ll see in section 3.2. Indeterminate Probabilities Suppose that, for whatever reason, we want to allow probabilities to be indeterminate. There are two obvious ways to go about this. The first obvious way is to allow our probability functions to take on sets of values instead of single values, and the second is to talk about sets of probability functions rather than single functions. Let’s take the first one first. If we have an indeterminate probability function returning sets of points instead of points, what constraints are reasonable? The
Varieties of Bayesianism
483
most natural one is that our indeterminate probability function be, in some sense, resolvable into a determinate one. Resolvable how? Precisification Given a function p˜ : σ → P([0, 1]), a probability function p is a precisification of p˜ if and only if p(A) ∈ p˜(A) for each A ∈ σ. We then define indeterminate probability functions as those that are ambiguous between their various precisifications: Indeterminate Probability Function An indeterminate probability function on (Ω, σ) is a function p˜ : σ → P([0, 1]) such that, whenever x ∈ p˜(A), there is some precisification of p˜, p, for which p(A) = x. Notice, we don’t just require that p˜ be precisifiable, but also that every value it encompasses be part of a potential precisification. If we didn’t require this then p˜ could include “junk” values in its outputs, ones that could never be the “true” probability, or a probability into which p˜ gets resolved. We can define conditional indeterminate probabilities in the obvious way: Conditional Indeterminate Probability If p˜ is an indeterminate probability function then the indeterminate probability of B given A, p˜(B|A), is defined p˜(B|A) = {x : p(B|A) = x for some precisification p of p˜}. Using the same idea, we can talk about other operations on indeterminate probability functions, like sums and mixtures. Whatever the operation normally is, just apply it to each precisification and gather up the results. The other obvious way to think about indeterminate probabilities was to work with classes of probability functions, rather than functions that returned classes of values. How do the two approaches compare? Clearly any class of probability functions picks out a unique indeterminate probability function, namely the one whose outputs include exactly those values assigned by some function in the class. Ambiguation If P is a set of probability functions, the ambiguation of P is the indeterminate probability function that assigns for each A, p˜(A) = {x : p(A) = x for some p ∈ P }. The map that takes an indeterminate probability function to the class of its precisifications is clearly 1-1. However, the ambiguation of a set of probability functions can have precisifications not in the ambiguated set. A natural extra constraint on indeterminate probability functions is to restrict the outputs to intervals. This is especially natural when we regard the indeterminacy as resulting from our ignorance about some objective but unknown probability function, where we have established lower and upper bounds but can’t pin down the exact values. When thinking in terms of classes of probability functions, the constraint most often added is convexity:
484
Jonathan Weisberg
Convexity A class of probability functions, P , is convex if and only if whenever p, q ∈ P , every mixture of p and q is in P as well. That is, if and only if whenever p, q ∈ P we also have αp + (1 − α)q ∈ P , α ∈ (0, 1). Convexity is a bit harder to motivate than the interval requirement. Intuitively what it says is that the class of functions contains all the functions on the straight line between any two of its members. Why should we demand that? Levi [1980] endorses convexity on the grounds that a mixture of p and q can be seen as a sort of resolution of the conflict between the two states of opinion that p and q represent. Thus if we are torn between the state of opinion p and the state of opinion q such that we suspend judgment, we should not rule out any compromise or resolution of that conflict. Convexity is closely related to the interval requirement: PROPOSITION 9. If P is convex with p˜ its ambiguation, then p˜(A) is an interval for each A. However, there are also non-convex sets of probability functions whose ambiguations are interval-valued. If our indeterminate probabilities obey the interval constraint, it becomes natural to think about them in terms of the upper and lower bounds they place on the probability of each proposition. Upper and Lower Probabilities If p˜ is an indeterminate probability function, define p˜∗ (A) = inf{x : x ∈ p˜(A)} p˜∗ (A) = sup{x : x ∈ p˜(A)}. p˜∗ and p˜∗ are called the lower and upper probabilities of p˜. Upper and lower probabilities are a quite general and well-studied way of representing uncertainty. In fact, they lead fairly naturally to one of Bayesianism’s closest competitors, the Dempster-Shafer theory of belief functions [Shafer, 1976]. If we impose the requirement of n-monotonicity on lower probabilities: n-monotonicity A lower probability function p˜∗ is n-monotone if and only if for any A1 , . . . , An in σ, (−1)|I|+1 p˜∗ ( Ai ). p˜∗ (A1 ∪ . . . ∪ An ) ≥ I⊆{1,...,n}
i∈I
then we have the class of belief functions [Kyburg, 1987], which are the topic of Dempster-Shafer theory (see the entry on Dempster-Shafer theory in this volume). These are elementary remarks on indeterminate probabilities, and they serve only to make the reader aware of the possibility and to give a sense of the idea. For the remainder of our discussion, we will ignore indeterminate probabilities almost
Varieties of Bayesianism
485
entirely. For further discussion, the reader is referred to [Levi, 1974], [Jeffrey, 1983], [Kyburg, 1987; Kyburg, 1992], [van Fraassen, 1990], [Walley, 1991], and [Halpern, 2003]. Halpern and Kyburg are especially useful as surveys of, and inroads to, the literature on indeterminate probabilities. 2 INTERPRETATIONS OF PROBABILITY Now that we have the mathematical framework of probability in hand, let’s survey the sorts of epistemic subject matter it might be used to represent. (For a more detailed and historically-oriented survey of our first two interpretations, the logical and degree of belief interpretations, see the entry “Logicism and Subjectivism” in this volume.)
2.1
The Logical Interpretation
One way to think about probability is as a logical property, one that generalizes the deductive notion of logical validity.1 Logical truths have probability 1 and logical falsehoods have probability 0, so it is natural to think of intermediate probabilities as “degrees” of logical truth. Similarly, the relationship of conditional probability generalizes the notion of logical entailment, with p(B|A) representing the degree to which A entails B. Since we have PROPOSITION 10. If A B then p(B|A) = 1; if A ¬B then p(B|A) = 0, it seems sensible to think of p(B|A) as representing the extent to which A entails B, with p(B|A) = 1 representing the extreme case given by deductive validity. On this interpretation, the probability axioms (P1)–(P3) are to be understood as constraints on this generalized concept of validity, extending the usual rules of logical consistency. But it’s hard to understand the idea of partial entailment, since no analysis of the concept is given, and an analogy with the usual analyses of deductive entailment is not forthcoming. Usually deductive entailment is explained semantically, modally, or syntactically: A B ⇐⇒ In every world/model where A is true, B is true. A B ⇐⇒ Necessarily, if A then B. A B ⇐⇒ B is deducible from A by the rules of a given logic. The obvious way to extend the semantic definition is to say that A entails B to degree x iff 100x% of the worlds/models where A is true are worlds where B is true. But then we need some way of counting or measuring sets of worlds, a notoriously problematic endeavor (see section 3.6). As for the modal and syntactic analyses, they don’t seem to admit of any obvious extension to degrees. 1 Keynes [1921] and Carnap [1950] were leading proponents of this interpretation in the 20th Century; Keynes attributes the origins of his views to Leibniz.
486
Jonathan Weisberg
Etchemendy [1990] argues that none of the usual analyses of logical validity offer a proper analysis anyway. Instead they should be understood as providing converging approximations to a basic concept. Exploring semantic and syntactic validity and the relations between them helps us to refine our grip on our basic concept of logical validity, but does not tell us what logical validity is. But it wouldn’t do for the proponent of the logical interpretation to say that, likewise, no analysis of partial logical entailment should be expected. Even if analyses of core logical notions cannot be demanded, we should expect at least some help in getting a grip on them — say, for example, by their approximate relationships to semantics, syntax, and modality. If similar approximating paths of convergence to partial entailment were given, then it would be on a par with deductive entailment, but there don’t seem to be any such paths to help us converge on the postulated concept of partial entailment. That’s not to say that this sort of thing hasn’t been tried. Carnap, one of the chief proponents of the logical interpretation, made a famous and heroic effort to characterize the concept of partial entailment syntactically. In his [1950], and later [1952], Carnap outlined basic principles from which a particular partial entailment relation on a simple language could be derived. This relation had some nice, intuitive properties that we might expect from a relation of partial entailment. For example, a larger sample of green emeralds entails “All emeralds are green” more strongly than a small sample does. But because Carnap’s method for specifying the partial entailment relation is bound to the syntax of the underlying language, it falls victim to Goodman’s [1954] new riddle of induction. Define grue to mean “green and observed or blue and not observed”. If we formulate the underlying language in terms of ‘grue’ rather than ‘green’, Carnap’s partial entailment relation will tell us that our observation of a large sample of green emeralds highly entails “All emeralds are gue,” since the green emeralds we have observed are all grue. But “All emeralds are grue” says that all emeralds not yet observed are blue, and it seems that this should not be highly entailed by our observation of many green emeralds. Insofar as we understand partial entailment, it does not seem it should be a purely syntactic matter, as Carnap’s approach made it.2 Carnap’s approach is also regarded as problematic because it uncovered many candidates for the relationship of partial entailment (the famous “λ-continuum”), and none of them seems to be especially self-recommending, not to mention logically required.
2.2 The Degree of Belief Interpretation Carnap’s difficulties led many Bayesians to embrace the (already existing) degreeof-belief interpretation of probability, according to which probabilities represent a subject’s levels of certainty. On this view, the force of the probability axioms (P1)–(P3) is that of constraints on rational belief. Someone whose degrees of belief 2 For more on Goodman’s riddle, see the entry “Goodman on the Demise of the Syntactic Approach” in this volume.
Varieties of Bayesianism
487
violated those axioms, say by thinking rain tomorrow 70% likely and no rain 50% likely, would violate a cannon of rationality. Many notable thinkers in this tradition have thought that this rational force is on a par with the rational force of the rules of deductive logic — that the probability axioms provide the rules for degrees of belief just as the laws of deductive logic provide the rules of full belief. Some even claim that one whose degrees of belief violate the probability axioms is logically inconsistent in the deductive sense. But their view is still very different from the logical interpretation. Even if general rules of probability like (P1)–(P3) express logical facts, statements of particular probabilities represent psychological facts. For them, to say that p(B|A) = 1/3 is to say that the ratio of my degree of belief in AB to my degree of belief in B is 1/3, not that there is any logical relationship between A and B that achieves a degree of 1/3. In fact, many in the degree-of-belief tradition have thought that there are few or no rules of rationality for degrees of belief beyond (P1)–(P3). According to them, there are many degrees of belief one can have for p(B|A), any of which would be reasonable to hold (except in the special case where A deductively entails either B or its negation). Thus there is no such thing as the degree to which A entails B, merely the degrees of belief one can hold without violating the cannons of logic. The degree-of-belief interpretation is sometimes called the “subjective” interpretation, because probabilities are used to describe a subjective state, degree of belief. But ‘subjectivism’ has other connotations: specifically, the view just mentioned, according to which the probability axioms capture (nearly) all the rules of rationality for degrees of belief, thus leaving a great deal of room for reasonable inter-subjective variation. Because the degree-of-belief interpretation is characterized by the particular subjective state it concerns, degree of belief, I prefer the more explicit title ‘the degree of belief interpretation’, and I save ‘subjectivism’ for the view that the rules of rationality leave much room for inter-subjective variation. We will discuss subjectivism in this sense later (section 3). Different Notions of Degree of Belief Bayesians differ on exactly what degrees of belief are. The first precise accounts were given by Ramsey [[1926] 1990] and de Finetti [1937]. Their definitions were heavily operationalist, defining degrees of belief in terms of what a person prefers, or would choose if given the option. This operationalist tendency may reflect the heavy influence of logical positivism at the time, motivating those using a psychological interpretation to offer empirically respectable analyses of the psychological state they were postulating.3 3 Ramsey actually prefaces his definition saying, “It is a common view that belief and other psychological variables are not measurable, and if this is true our inquiry will be vain [. . . ] for if the phrase ‘a belief two-thirds of certainty’ is meaningless, a calculus whose sole object is to enjoin such beliefs will be meaningless also.” [Ramsey, [1926] 1990, p. 166] The worry that we can move from ‘not measurable’ to ‘meaningless’ suggests a logical positivist perspective.
488
Jonathan Weisberg
Let’s start with de Finetti’s definition, since it is simpler. According to de Finetti, your degree of belief in a proposition A is the odds at which you would regard a bet on A that pays $1 as fair. For example, if you are willing to pay $.50 but no more to play a game that pays $1 if A comes true and nothing otherwise, that must be because you think it 50% likely that A is true. For then you would stand a 50% chance of gaining $.50, but a 50% chance at losing $.50. Were your degree of belief in A lower, you would think it more likely that you would lose the $.50 than gain it, and would not regard the bet as fair. And if your degree of belief were higher, you would think yourself more likely to gain it than lose it, making the bet unfair (if advantageous). This sort of definition has been criticized for failing to take into account factors other than the subject’s degrees of belief that might affect her betting behavior (see, for example, [Earman, 1992; Weatherson, 1999; Christensen, 2001]). For example, she might be risk averse, in which case her betting behavior would underrate her degrees of belief. She might also have other motives or interests besides money, like entertainment or showing off. She could even be misled by the format in which the deal is offered, failing to appreciate its consequences or falling prey to a framing effect. The standard response to such worries is to specify an idealized setup in which all such factors have been ruled out, and to say that your degrees of belief are the odds you would regard as fair in that idealized scenario [Goldstick, 2000]. There is, of course, the worry that such conditions cannot be spelled out non-circularly. There is also the worry that comes with any counterfactual analysis, namely that moving to such a remote, hypothetical situation may alter the subject’s epistemic state so that the degrees of belief that guide her there are not the actual ones that we were trying to mine. Ramsey’s approach is similar in spirit but more sophisticated and technical. Rather than assume that our subject is interested in money, Ramsey tries to extract her degrees of beliefs from her preferences whatever they may be. To do this, Ramsey proved his famous representation theorem, which says roughly this: Representation Theorem Suppose our subject’s preferences are rational in that they obey a set of constraints C (not specified here). Then there is exactly one probability function-utility function pair, (p, u), that represents the agent’s preferences in this sense: she prefers A to B if and only if the expected utility of A relative to p and u is greater than that of B.4 The idea is to use the theorem to reverse-engineer the agent’s beliefs and desires from her preferences. Her beliefs are given by p and her desires by u, since these are the only functions that coincide with her preferences, and hence must be the beliefs and desires that generated her preferences. Since Ramsey proved his theorem, variations to the same effect have been proved by Savage [1954], Jeffrey [1965], and others.5 4 See
section 5 for a definition of ‘expected utility’. are also non-Bayesian representation theorems, which allow for preferences that violate
5 There
Varieties of Bayesianism
489
A classic criticism of the Ramsey-Savage-Jeffrey approach is that actual people do not obey the constraints on preferences, C, assumed by the theorem. For example, the famous Allais paradox [Allais, 1979] has been shown by Khaneman and Tversky [1979] to habitually lead people into violations of Savage’s key Axiom of Independence (section 4.2). Even the elementary constraint of transitivity — that you should prefer A to C if you prefer A to B and B to C — has been claimed to be habitually violated by actual people [Lichtenstein and Slovic, 1971; Lichtenstein and Slovic, 1973]. Savage and others typically regard the constraints in C as normative, not descriptive. Hence they see no problem if people do not obey the constraints assumed by the theorem, so long as it is still true that they should obey them. But even if the constraints in C are normatively correct, this does nothing to ameliorate the problem for Ramsey and others who want to use representation theorems to define ‘degree of belief’. For if nobody satisfies the constraints needed by the theorem, then nobody’s degrees of belief can be defined via the theorem. One might respond that the theorem still offers a definition of degree of belief for ideal agents who do satisfy the constraints in C. But it is questionable whether one can (without begging the question) show that ideal agents really would have the degrees of belief attributed to them by the theorem. And it is also unclear how that would help us understand what degrees of belief are for actual people. For further development and discussion of these worries, see Zynda [2000], Christensen [2001], and Meacham and Weisberg [unpublished]. If the standard, operationalist definitions fail, what can be put in their place? We might take degrees of belief as primitive on the grounds that they are theoretically fruitful [Eriksson and H´ ajek, 2007]. We might also hope that a precise characterization will ultimately be provided implicitly in our psychological theorizing. Psychological theorizing about uncertain reasoning has boomed in the last 25 years, and we might take the naturalist attitude that degrees of belief are roughly characterized by our folk-psychological theory of confidence/certainty, and that the concept will be refined as that theory is refined by the empirical work in psychology. Of course, there is the possibility that, at the end of the day, nothing remotely like our folk concept will show up in the finished psychological theory, in which case the degree of belief interpretation will be out of luck. But the meaningfulness of our normative theorizing about a state always hangs on that state actually existing, and we should always be prepared for the eventuality that our empirical theorizing will eliminate it, rendering our normative theorizing otiose.6
the usual constraint-set C in certain ways, and which deliver non-Bayesian representations of those preferences. See, for example, [Kahneman and Tversky, 1979] and [Wakker and Tversky, 1993]. 6 Compare, for example, the argument that virtue ethics is undermined by empirical research showing that the character traits it presupposes do not exist [Doris, 2002].
490
Jonathan Weisberg
2.3 Primitivism Another take on the interpretation of probability is one we might call primitivism, according to which Bayesianism can be understood as a theory about an important epistemological notion of probability that need not be given any analysis (and maybe can’t be given one). Timothy Williamson [2000] frankly adopts this stance. Williamson doesn’t necessarily reject the degree of belief interpretation of probability, but he thinks that there is an important concept of probability that is not captured by that interpretation, and cannot be captured in any analysis. If someone asks, “how likely is the theory of evolution given our evidence?”, Williamson thinks that there is a definite, and objective answer, one that needn’t correspond to any actual (or ideal) person’s degrees of belief.7 Such questions are about a concept Williamson dubs evidential probability. Evidential probability is something like the confirmational notion of probability pursued by Carnap, except that Williamson does not endorse the analogy with deductive entailment, and rejects any attempt to specify probabilities syntactically. Any attempt to analyze evidential probability would, on Williamson’s view, be a mistake on a par with trying to analyze modality or sets. Instead, Williamson thinks our approach should be to go ahead and use the concept in our epistemological theorizing. Our grip on the concept will strengthen as our theory develops and evidential probability’s role is outlined, as happens with possibility in modal logic and sets in set theory.
2.4 Non-Epistemic Interpretations: Frequency and Chance There are other classic interpretations of probability that may be legitimate but which do not yield an epistemological theory, and so do not enter into our taxonomy of Bayesian theories. Still, they deserve mention since they are widely discussed and are important to Bayesian theory in other ways. The first is the frequency interpretation, according to which the probability of an event is the frequency with which such things occur. For example, the probability that the coin I am about to flip will come up heads is 1/2 because this coin and coins like it come up heads half the times they are flipped. As stated, this interpretation is ambiguous in two respects. First, we need to know which coins count as “coins like this one”. If trick coins that are asymmetrically weighted or have heads on both sides count as “like this one”, then the frequency of heads may be something other than 1/2. This is the famous problem of the reference class: with reference to which class of coin-flips should we calculate the frequency of heads to determine this coin’s probability of coming up heads? There is an accompanying ambiguity that is less often stressed but is also crucial: what is it we are supposed to count the frequency of? If we want to know the probability of the sentence “the coin will come up heads”, we might count the frequency with 7 Maher [1996] takes something of an intermediate stance between the degree of belief interpretation and Williamson’s primitivism. On Maher’s view there is a (fairly) definite and objective answer to such questions, but the answer is the degree of belief one ought to have (not the degree of belief one actually does have).
Varieties of Bayesianism
491
which such coins come up heads, the frequency with which they come up at all (instead of disappearing or landing on a side), or even the frequency with which such coins are coins. Which feature of the sentence is it that we must count the frequency of to determine the sentence’s probability? These two ambiguities correspond to the two variables F and G in the question, “how many F s are Gs?”. Given a sentence whose probability we want to know, we must specify what F and G are before we can say what the frequency is that determines the probability. For a sentence like, “the coin-flip will come up heads,” what F and G are supposed to be may be obvious. But it isn’t always obvious, as is shown by sentences like “Louis XIV wore brown trousers on Jan. 14, 1713.” So a recipe for specifying F and G must be given for this interpretation to have much content. Setting these problems aside, why isn’t the frequency interpretation part of our taxonomy of Bayesian theories? Because frequency is not an epistemic subject, strictly speaking. There are no facts about what the frequencies ought to be, nor do frequencies represent some epistemic relationship like confirmation, partial entailment, or evidential probability. That is not to say that frequencies are epistemically irrelevant. Far from it. Frequencies are one of the most useful kinds of data we can have for determining a theory’s degree of logical truth/degree of belief/evidential probability. But every Bayesian theory acknowledges that probabilities can be used to represent frequencies8 and that frequencies are a crucial kind of data. The interesting question is what we should do with our knowledge of frequencies. Should they determine our degrees of belief, our judgments about partial entailment, or our judgments about evidential probability? And what are the rules by which they determine these judgments? Why is the frequency with which we observe green emeralds relevant in a way that the frequency of grue emeralds is not? Frequencies are just one kind of data out of many, and the fact that these data are important is not controversial. What is controversial is what kind of epistemic facts or states these data determine, and what the exact determination relation is. The same is true for another common interpretation known as the physical chance interpretation. On this interpretation, probabilities represent objective physical properties of objects or events. The classic exemplars of physical probability are the probabilities in quantum mechanics. For example, the probability that a given atom of U238 will decay in the next 10 minutes is supposed to be an objective, physical fact that is entirely independent of the epistemic status of any agent. It is a property of that atom, or of the events it is a part of, and does not merely reflect our ignorance about the state of some hidden variables since, ostensibly, there are none. Another, more contentious example of physical chance comes from statistical 8 It is an elementary truth of mathematics that frequencies obey the probability axioms. Given a set of F s, the ratio that are Gs in the set is always a (rational) real number in [0, 1], the ratio that are F s is always 1, and the ratio that are either Gs or G s is the ratio that are Gs plus the ratio that are G s, if G and G are incompatible.
492
Jonathan Weisberg
mechanics. Because classical statistical mechanics is compatible with a deterministic underlying physics, some think that its probabilities cannot be physical chances. Lewis [1980], for example, holds that the chances in a deterministic universe are always 0 or 1. Others, however, hold that statistical mechanical chances must be objective and physical since they would otherwise be incapable of doing the explanatory work demanded of them [Albert, 2001; Loewer, 2001]. Statistical mechanical probabilities are supposed to explain why ice cubes always melt when we put them in water, but if the chances were not something physical — and especially if they are merely something epistemic — then they could not explain why this occurs. They would, at best, explain why we expect it to occur, which is not the same as explaining why our expectations are repeatedly met. The probabilities in special sciences like biology and sociology are also held by some to be physical chances, but these cases are even more contentious. At any rate though, most agree that quantum mechanical probabilities are examples of physical chances. The physical chance interpretation is sometimes called the “propensity” interpretation, but this label is not entirely neutral, conveying a more specific view about what physical chances are like. On the propensity interpretation, physical chances are likened to dispositions like fragility and solubility. Just as a thing can be more or less fragile, a thing can have a higher or lower propensity to do a certain thing. On the propensity view, physical chances are just the measures of these propensities. Not all authors who believe in physical chances endorse the analogy with propensities, however. With respect to our taxonomy of Bayesian theories, physical chances are in the same boat as frequencies. Some think there are such chances and some think there are not, but this debate is a matter of physics or metaphysics, not epistemology. But, like frequencies, chances are very epistemically important if they exist, since they would guide our degrees of belief, evidential probabilities, or whatever. Chances will enter into our taxonomy when we consider the particular epistemic norms endorsed by various Bayesian theories, since the exact way in which chances should guide degrees of belief is a point of disagreement amongst Bayesians. But as far as the subject matter of our epistemic theorizing goes — the question what epistemic properties, states, or relations are being subjected to normative rules by our theory — chances are not a point of decision.
2.5 Summary So Far Let’s summarize our taxonomy so far. We began with the question: what epistemic states or relations can probability be used to model and formulate norms about? We considered three candidate answers. First there was the logical interpretation, according to which probability generalizes logical validity and entailment to allow for intermediate degrees. Problems with understanding how logical entailment could be a matter of degree led us instead to the degree of belief interpretation, according to which probabilities represent levels of confidence and the rules of
Varieties of Bayesianism
493
probability theory are normative rules governing them. We also considered various ways of explicating the notion of degree of belief and their drawbacks. Finally, we considered the primitivist stance, which treats probability as its own, basic epistemic notion, for which an analysis is neither demanded nor given. These interpretations are not necessarily incompatible. We might think that all these epistemic phenomena exist, and that they are all appropriately modeled using probabilities. For example, a pluralist might think that we can have a degree of belief about the evidential probability of the level of partial entailment between A and B. Whichever interpretations we accept as appropriate, the next question we face is what precise rules they ought to follow. When does A entail B to degree x? Given that I am x confident in A, how confident should I be in B? What is the evidential probability of B given my evidence? These kinds of questions bring us to the second dimension of our taxonomy. In addition to the three basic rules that define Bayesianism, the probability axioms (P1)–(P3), what other rules should we include in our theory?
3 THE SUBJECTIVE-OBJECTIVE CONTINUUM Probabilism is the view that degrees of belief (or degrees of entailment or evidential probabilities) should obey the probability axioms, (P1)–(P3). But Probabilism leaves a tremendous amount unsettled, since many wildly different probability assignments can satisfy (P1)–(P3). Take a coin-flip for example. An assignment of 1 and 0 to heads and tails satisfies (P1)–(P3), but so would an assignment of 0 and 1, 1/2 and 1/2, 1/3 and 2/3, etc. This means that our Bayesian theory so far allows for a wide range of inter-subjective variation, since you may choose one assignment and I another without either of us violating any rule. Thus our theory so far is heavily “subjective”.9 If we were to add more rules beyond just (P1)– (P3), we would move more and more towards objectivity as we ruled out more and more possible assignments. At the objective end of this spectrum is a theory that specifies a single probability distribution for each individual or situation. On the degree-of-belief interpretation, this spectrum of subjectivity is just a matter of how much room there is for people’s beliefs to reasonably vary. On the 9 The “subjective vs. objective” terminology is deeply entrenched, but can be very misleading. Presumably, any correct theory of non-deductive reasoning must allow for massive intersubjective disagreement, since different people have different evidence, conceptual frameworks, interests, etc. The idea typically had in mind, I think, is that more subjective brands of Bayesianism allow for inter-subjective disagreement even between agents with the same evidence. But one might want to acknowledge other epistemological factors besides evidence that determine the probabilities an agent should assign, e.g. attention and epistemic interests. A scale from “subjective” to “objective” thus obscures a central question: which variables should determine a subject’s probabilities? Just evidence, or are there other properly epistemic factors to consider? Once that question is settled, we can place views on a subjective-contiuum according to how much they think these variables do to determine a subject’s probabilities. Without having settled which variables are relevant though, there is a danger of cross-talk.
494
Jonathan Weisberg
other interpretations it’s less clear what a more subjective theory amounts to. If our theory of partial logical entailment demands only (P1)–(P3), does that mean that degrees of entailment are generally indeterminate? What about for primitive or evidential probability? On these interpretations a theory at the subjective end of the spectrum looks dangerously at odds with the very notion of probability offered. But even on the degree of belief interpretation a heavily subjective theory looks problematic. Someone who is convinced that heads will come up but can offer no epistemological reason for their conviction seems completely unreasonable. The mere fact that he satisfies (P1)–(P3) because he simultaneously has no confidence in tails does little to raise our estimation of his rationality. So all parties have an interest in filling out their theory to push it more towards the objective end of the spectrum. This is usually done by formulating specific rules, over and above (P1)–(P3), to eliminate possible probability distributions. Which such rules should Bayesians endorse? Here we consider the most common proposals.
3.1 Countable Additivity According to (P3), the probability of two disjoint possibilities is the sum of their individual probabilities. We can derive from there that the same sort of additivity applies to any finite number of disjoint possibilities (Proposition 1.3). But what about a countable infinity of disjoint possibilities, does additivity apply there too? (P1)–(P3) do not entail that it does. There are probability distributions satisfying (P1)–(P3) but violating Countable Additivity If {Ai } is a countably infinite set of mutually disjoint sets, then p( {Ai }) = p(Ai ). i
The existence of probability distributions satisfying (P3) but not Countable Additivity can be proved using the ultrafilter lemma.10 So we have the option of adding countable additivity to our theory or not. Should we add it?11 De Finetti [1970; 1972] famously argued that we should reject Countable Additivity, since it rules out fair, countable lotteries. Couldn’t God pick out a natural number at random? Not if Countable Additivity holds. Any probability function that assigned the same value to each natural number would have to assign them all 0. If we assigned some a > 0 to each number, we would violate (P3) since some finite collection of numbers would have individual probabilities adding up to more 10 The ultrafilter lemma is a consequence of ZFC set theory, though not of ZF, so this way of proving the existence of such distributions is non-constructive. I do not know whether a construction is possible. 11 Many authors contemplate adding a condition called ‘continuity’ which, in the context of (P1)–(P3), is equivalent to Countable Additivity. But the content and point of Countable Additivity is much more readily apparent, so our discussion is cast in terms of it.
Varieties of Bayesianism
495
than 1. But if we assign 0 everwhere, we violate Countable Additivity: p( {i}) = 1 = pi = 0. i∈N
i∈N
But, de Finetti thought, the possibility of a fair, countable lottery is a real one, so Countable Additivity cannot be true. On the other hand, Countable Additivity is crucial to the proof of some allegedly important theorems. There is a family of theorems known as Bayesian convergence theorems which say, in various more precise forms, that a Bayesian agent’s conditional probabilities on true data are certain to converge to certainty in the truth. For example, consider the following gloss of a theorem from Gaifman and Snir [1982]: Gaifman-Snir Theorem Let L be a first-order language with finitely many “empirical terms”, and {Ei } a sequence of sentences that “separates” the models of L. For each model ω of L, let ω(A) = 1 if A is true on ω, 0 otherwise. Finally, let Eiω denote Ei if ω(Ei ) = 1, ¬Ei otherwise. Then for any H in L, Eiω ) = ω(H) lim p(H| n→∞
0≤i 0, p(H|E) = p(H)
p(E|H) . p(E)
If H predicts E, then p(E|H) is high, typically making the ratio on the right high, and so p(H) gets multiplied by a number greater than 1, making p(H|E) > p(H). Thus, when we conditionalize on E, the probability of H goes up. Similarly, if H
500
Jonathan Weisberg
predicts E but we find E, H’s probability will typically go down since p(E|H) will be low, and thus p(H) will typically be multiplied by a number near 0. And if H predicts E where E would be surprising otherwise, then the fraction on the right is especially large, raising the probability of H significantly when we conditionalize on E. It is important to note that discussions of Conditionalization are often ambiguous between two importantly different interpretations of the same formal idea. Suppose we are understanding probabilities as degrees of belief. Then what Conditionalization says is that, when you learn new information E, you should adopt as your new degree of belief in H your previous conditional degree of belief in H given E , whatever your degrees of belief may happen to have been just before you learned E. On this understanding of the rule, the probabilities you plug into the rule are just whatever degrees of belief you happen to have when you learn E. But what if your degrees of belief right before you learned E were badly misguided, treating E and H as mutually irrelevant when they ought to be regarded as importantly related? This thought prompts a different reading of Conditionalization, one where the probabilities plugged into the rule are not the degrees of belief you happen to have at the moment but, rather, something more objective. An obvious candidate is the probability function that you ought to have had — the probabilities that would have been reasonable given your available evidence up to that time. This reading of Conditionalization, however, requires constraints strongly objective enough to specify what degrees of belief would be “reasonable” given your evidence up to that time. Thus which reading of Conditionalization is used usually depends on the author’s leanings on the subjective-objective continuum. These remarks, it should be noted, apply quite generally to pretty much any update rule, including the ones about to be discussed in the next few sections. Such rules always take in “prior” probabilities, modifying them in light of the new evidence. Which probabilities should be regarded as the appropriate prior ones — e.g. the degrees of belief you happened to have vs. the objectively correct evidential probabilities — is a major point of disagreement between objectivists and subjectivists. That said, this ambiguity in the interpretation of diachronic rules will be suppressed for the remainder of the section. The downside to Conditionalization is that it flouts Continuing Regularity (section 3.2) by giving probability 1 to all evidence: PROPOSITION 14. If q comes from p by conditionalizing on E, then q(E) = 1. But we are rarely if ever entitled to be absolutely certain that our evidence is true. Notice that, using Conditionalization, we could not distinguish the different grades of plausibility amongst various pieces of our evidence. Another common complaint about Conditionalization is that it guarantees that evidence not only becomes certain, but stays certain forever. If all changes in our probabilities happen by Conditionalization, then certainties stay certainties. Conditionalizing on anything consistent with a certainty leaves it as a certainty:
Varieties of Bayesianism
501
PROPOSITION 15. If p(E1 ) = 1 then p(E1 |E2 ) = 1 for any E2 consistent with E1 . And we can’t conditionalize on anything inconsistent with past evidence, since we can’t conditionalize on anything inconsistent with a certainty: PROPOSITION 16. If p(E1 ) = 1 then p(A|E2 ) is undefined whenever E2 is inconsistent with E1 , since p(E2 ) = 0. So evidence stays certain once certain, and evidence to the contrary cannot even be assimilated (unless we turn to Popper-R´enyi functions (section 1.2)). These problems led Jeffrey [1965] to offer a less demanding extension of Conditionalization, and many have followed him.
3.4
Jeffrey Conditionalization
Jeffrey motivated his rule by considering cases of unclear observation, like the observation of a cloth in dim candlelight, where the color appears red but might also be green or blue. Suppose the probabilities the experience licenses in the red, green, and blue hypotheses are q(R), q(G), and q(B). What, then, is the probability of H, that this is your sock and not your roommate’s? Well the appearance of the cloth tells you something about its color, but nothing about how its color bears on the probability of it being your sock as opposed to your roommate’s. So we can set q(H|R) = p(H|R), and similarly for G and B. We are then in a position to calculate q(H): q(H)
= q(H|R)q(R) + q(H|G)q(G) + q(H|B)q(B) = p(H|R)q(R) + p(H|G)q(G) + p(H|B)q(B).
The example suggests the following as a general rule: Jeffrey Conditionalization When an observation bears directly on the probabilities over a partition {Ei }, changing them from p(Ei ) to q(Ei ), the new probability for any proposition H should be p(H|Ei )q(Ei ). q(H) = i
Notice that Jeffrey Conditionalization gets Conditionalization as a special case when the partition is {E, E} and q(E) = 1. PROPOSITION 17. If q is obtained from p by Jeffrey Conditionalization on the partition {E, E} with q(E) = 1, then q(·) = p(·|E). The driving assumption behind Jeffrey Conditionalization is that the bearing of the observation is captured entirely by the new probabilities on the partition, so that the conditional probabilities on the elements of the partition need not be changed. It is the probabilities of the Ei that we have been informed about,
502
Jonathan Weisberg
not their evidential bearing on other questions. This key assumption, that the conditional probabilities on {Ei } should remain unchanged, is called rigidity. Historically, the most common complaint about Jeffrey Conditionalization has been that it is not indifferent to the order of evidence in the way that Conditionalization is (Proposition 12). (See, for example, [Levi, 1967b], [Domotor, 1980] and [van Fraassen, 1989] for this complaint.) For example, suppose we use Jeffrey Conditionalization to get q from p and then r from q, using the partition {E, E} both times with q(E) = x and r(E) = y, x = y. Now reverse the order, using y first and then x. In the first case the final probability of E is x, in the second case it will be y. So order matters. Lange [2000] counters that this kind of order-dependence is not problematic, since it does not correspond to an order-reversal of anything whose order should not matter. For example, the quality of the observations yielding the x and y values in the first scenario will not be the same as in the second. To illustrate, let E be the statement that the raven outside is black, let x = 1/10, and let y = 9/10. Supposing I start with p(E) = 1/2, then the first scenario must be one in which the raven appears pretty clearly non-black at first glance (a shift from 1/2 to 1/10), but then looks very clearly black at a second glance (a shift from 1/10 to 9/10). If we reverse the order of the numbers, however, then the raven must have seemed pretty clearly black on the first glance (1/2 to 9/10), but very clearly non-black on the second (9/10 to 1/10). So reversing the order of the input values does not amount to having the same qualitative experiences in reverse order. Why, then, should we expect order-invariance when we reverse the order of the input probabilities? Another concern about Jeffrey Conditionalization is that it is incomplete in a very important way. Without some supplementary rule telling us which partition an experience bears on, and what probabilities on that partition the observation warrants, we cannot apply the rule. Jeffrey Conditionalization needs a partition and a distribution over it as inputs, and we haven’t been told how to select these inputs. It’s worth noting that a similar problem afflicts Conditionalization, since it does not specify when a proposition counts as new evidence, and should thus be conditionalized on. The problem is especially acute for Jeffrey Conditionalization, however, since it is nearly vacuous without any constraints on what the inputs should be. If we may select any inputs we like, then we can get any q from a given p. We just use the set of singletons in Ω as the partition and use as input probabilities whatever probabilities we want to end up with for q.16 Field [1978] approached this problem, sketching a rule for assigning input probabilities to experiences. Garber [1980] showed that Field’s proposal had the unwelcome consequence that repetition of the same experience could boost a hypothesis’s probability without bound, even though it seems no real information should be gained from redundant observations. Wagner [2002] shows that Field’s proposal is actually the only one that will make Jeffrey Conditionalization order-invariant 16 For Ω’s larger than R, we may not be able to get just any q from a given p, but we will still have a tremendous amount of freedom.
Varieties of Bayesianism
503
on experiences (though Wagner actually takes his result to vindicate Jeffrey Conditionalization in the face of the order-invariance objection). Christensen [1992] worried that Jeffrey Conditionalization might be in tension with the epistemological doctrine of holism, according to which a belief’s empirical justification is always sensitive to background assumptions. He points out that the tension becomes an outright clash when Jeffrey Conditionalization is supplemented by Field’s rule. Weisberg [2009] argues that, in light of Wagner’s result, Chrisensen’s worry about holism becomes a dilemma for Jeffrey Conditionalization: Wagner shows that Field’s rule is the only way to make Jeffrey Conditionalization commutative, but Field’s rule is anti-holistic, so Jeffrey Conditionalization cannot satisfy both commutativity and holism. An entirely different sort of concern about Jeffrey Conditionalization is that it may not be general enough. Does the rule apply in every case where we need to change our probabilities? Let’s set aside cases where we lose information rather than gain new information, say through memory degradation, cognitive mishap, or loss of self-locating information as in the Sleeping Beauty puzzle [Elga, 2000]. Even setting aside all these kinds of probability changes, there are alleged cases where evidence is gained but cannot be treated by Jeffrey Conditionalization. Perhaps the most famous is van Fraassen’s Private Judy Benjamin Problem [van Fraassen, 1981]. Judy Benjamin is dropped from a plane in the middle of a field, parachuting to the ground. The field is divided up into four sectors of equal area — NW, NE, SW, and SE — and Judy thinks the probability that she is in a given quadrant is 1/4 in each case. She radios base to describe her surroundings and they tell her that they cannot say whether she is in the North or the South, but they can say that, if she is in the North, the odds are 3:1 that she is in the East. What should Judy’s new probabilities be? The trouble is supposed to be that what judy has learned is a conditional probability, p(N E|N ) = 3/4, rather than a distribution over a partition. So Jeffrey Conditionalization does not apply. To handle this sort of problem we might look for a more general rule (see the next section). But that might be too hasty. Arguably, the problem is only apparent, arising from an overly simplistic representation of Judy’s probabilities. What Judy has really learned, after all, is that home base reported a conditional probability p(N E|N ) = 3/4. However Judy conceives of those probabilities — as home base’s degrees of belief, as physical chances, or whatever — she is only disposed to accept them as her own in so far as she trusts home base’s opinions. But then her acceptance of a conditional probability of p(N E|N ) = 3/4 is based on the evidence that home base reported that they thought 3/4 to be the correct conditional probability. So Judy should just conditionalize her prior probabilities on the fact that home base made the report they did. There is no need for a new rule that takes conditional probabilities as inputs [Grove and Halpern, 1997]. Richard Bradley [2005] offers other examples where the evidence is, allegedly, properly represented as a conditional probability, thereby stumping Jeffrey Conditionalization. Bradley’s examples may be amenable to similar treatment, but they may not.
504
Jonathan Weisberg
3.5 Infomin If we are not satisfied with the Conditionalization-based treatment of the Judy Benjamin problem, or we think that other examples show a need to be able to take conditional probabilities as inputs, then we should look for a more general rule. The best-known proposal is that of information minimization, or Infomin: Infomin If your initial probability function is p and your evidence mandates a probability function in the set S, adopt the probability function q ∈ S that minimizes the quantity p i , pi log H(p, q) = qi i∈Ω
where pi = p({i}) and similarly for qi . Infomin has the virtue of being extremely general. Not only could we constrain S by insisting on a particular conditional probability, we could also constrain it by insisting on unconditional probabilities over a non-exclusive set of propositions, or even by insisting on a set of expectation values.17 Notice, though, that Infomin is incomplete in the same way that (Jeffrey) Conditionalization is, since it does not say how a particular experience, observation, or bit of evidence fixes a constraint on S. Just as those rules need to be supplemented with a rule determining the inputs, so does Infomin. Why adopt Infomin? The usual advertisement is that Infomin makes the minimal changes in p necessary to meet the constraints imposed by the evidence. H(p, q) is commonly regarded as a measure of the difference in information between p and q, and p being your existing opinions, you should seek to preserve them as much as possible while still respecting your new evidence. Also to its credit, Infomin has the nice property of agreeing with Conditionalization and Jeffrey Conditionalization when they apply [Williams, 1980]. PROPOSITION 18. If S contains just those probability distributions on (Ω, σ) that assign the values xi over the partition {Ei }, then H(p, q) is minimized when q(·) = i xi p(·|Ei ). But Infomin is not the only generalization that agrees with (Jeffrey) Conditionalization, nor is H the only way to measure the distance between probability functions. As a number of authors have pointed out, H isn’t even symmetric — H(p, q) = H(q, p) — so it’s not a proper metric. Variational distance (a.k.a. Kolmogorov distance), defined δ(p, q) =
1 |pi − qi |, 2 i∈Ω
is a more standard way of measuring the distance between functions, one that does satisfy the definition of a metric. 17 For
a nice discussion of exactly how general Infomin is, see [Williams, 1980].
Varieties of Bayesianism
505
PROPOSITION 19. For any probability functions p, q, and r, δ(p, p) = 0, δ(p, q) = δ(q, p), and δ(p, q) + δ(q, r) ≥ δ(p, r). If we minimize δ(p, q) instead of H(p, q), we have a rule that still agrees with (Jeffrey) Conditionalization but sometimes disagrees with Infomin. Interestingly though, the posterior probability delivered by Jeffrey Conditionalization is not always the unique one that minimizes δ, whereas Jeffrey Conditionalization does uniquely minimize H. For discussion of these concerns and the options in this area, see [Diaconis and Zabell, 1982] and [Howson and Franklin, 1994].
3.6
Principles of Indifference
The rules suggested so far still leave tremendous room for inter-subjective variation. When faced with a coin about to be flipped, we can assign any probability to heads we want (except 1 and 0, thanks to Regularity), provided we just assign one minus that probability to tails. While Conditionalization and its successors determine your new probabilities given the ones you start with, what probabilities you may start with is still pretty much a free-for-all. Countable Additivity and Initial Regularity just don’t rule out very much. But our next proposal aims to rectify this problem, specifying a unique starting distribution. Discrete Indifference The classic principle for assigning initial probabilities is the Principle of Indifference,18 which says that you should assign the same probability to each possibility in the absence of any (relevant) evidence. Principle of Indifference Given a finite outcome set Ω with cardinality N , if you have no (relevant) evidence then assign p(ωi ) = 1/N to each singleton ωi ⊆ Ω. When Ω is countably infinite, we can extend the principle to mandate p(ωi ) = 0, assuming we are not endorsing Countable Additivity. But then the Principle of Indifference does not determine all of p, since we cannot determine the probabilities of infinite sets by additivity. And if we do accept Countable Additivity, then the principle has no application for a countable Ω. The qualifier “in the absence of any (relevant) evidence” is deliberately vague and ambiguous.19 There are at least two interpretations to be considered. First, 18 Originally it was called ‘The Principle of Insufficient Reason’, but Keynes renamed it in an attempt to make it sound less arbitrary. 19 It is also something of a modernism: early statements of the principle, most notoriously those of Laplace and Bernoulli, spoke of “equipossible” outcomes, rather than outcomes for
506
Jonathan Weisberg
the principle might be viewed as a way of assigning ur-probabilities, the probabilities one ought to assign absent any evidence whatsoever, and which ought to determine one’s probabilities when one does have evidence by updating on the total evidence. This interpretation is troubled by the fact that the full space of possibilities we are aware of may be too large for the Principle of Indifference to apply. There is a formulation for outcomes spaces of cardinality |R|, which we will consider momentarily (section 3.6). Yet we know that there are more than |R| possible ways our universe could be. For every element in P(R), there corresponds the possibility that precisely its members describe the physical constants of the universe. Most of these possibilities are highly improbable, but they are certainly possible. To my knowledge, the Principle of Indifference has not been extended to larger-than-|R| outcome spaces. A second interpretation makes crucial use of the parenthetical “relevant” in “(relevant) evidence”. On this interpretation, the Principle of Indifference is not to be applied to the grand-world problem of assigning ur-probabilities to the total space of possibilities. Rather, it applies to agents who have evidence, but whose evidence is not relevant to a certain partition; that is, their evidence does not favor any one element of the partition over any other (or maybe doesn’t favor any of them at all). In that case, Ω represents the partition in question, and the Principle of Indifference tells us to assign the same probability to each ωi . A concern for this interpretation is that the principle becomes vacuous. It tells us to assign the same probability to ω1 and ω2 when our evidence doesn’t favor either one over the other, but what does it mean for our evidence not to favor either one, if not that they should have the same probability?20 Use of the Principle of Indifference in actual practice tends to be more in the spirit of the second interpretation. Statements of the principle commonly include the “relevant” qualifier or something similar, and the outcome spaces to which the principle is applied are typically finite or continuous, and do not represent the full space of possibilities in maximal detail. Such applications might still be consistent with the first interpretation: evidence that is not “relevant” to a finite or continuous partition might be understood as evidence which, once the ur-probabilities are conditionalized on it, yields uniform probabilities over the partition in question. Nevertheless, we will proceed to look at applications of the principle with the second interpretation in mind. For toy cases like the flip of a coin or the roll of a die, the Principle of Indifference looks quite sensible. If you have no information about the coin, the principle will tell you to assign a probability of 1/2 to both heads and tails. And for the die it will tell you to assign 1/6 to each face. But some cases yield terrible results. To illustrate, suppose that the coin is to be flipped ten times. The possible outcomes here are the 210 possible 10-bit sequences of the sort HT T T HHT T T H. which relevant evidence is lacking. See [Hacking, 1971] for an excellent historical discussion of the notion of equipossibility and the classic formulation’s connections to the modern one. 20 This concern goes back at least as far as [von Mises, [1928] 1981] and [Reichenbach, 1949].
Varieties of Bayesianism
507
The Principle of Indifference instructs you to regard each such sequence as equally likely. If you do, what will be the probability that the last toss will come up heads given that the first nine do? Let Hi mean that the i-th toss was a heads. Then p(H10 |H1−9 ) = = =
p(H1−10 ) p(H1−9 ) 1/210 1/29 1/2.
In fact, that probability will be the same no matter how many times the coin is to be flipped and how many times it comes up heads. So the Principle of Indifference enforces a kind of stubborn anti-inductivism. This result led Carnap [1950] to prefer a different application of the principle, where we assign the same probability to each hypothesis of the form “m heads and n tails”, and then divide that probability equally amongst the possible sequences instantiating that hypothesis. This yields a much more sensible probability distribution, one that starts out assigning 1/2 to each Hi but increases that estimate as more and more heads are observed. Another approach to the same problem, proposed by Jon Williamson [2007], appeals to implicit background information encoded in our linguistic formulation of the problem. The Principle of Indifference is only supposed to apply when we have no relevant evidence, and Williamson’s idea is that we do have this kind of evidence in these cases, since we know that H10 would be another instance of the same property, heads, as is instantiated 9 times in H1−9 . This background information rules out the probability distribution that gives the same probability distribution to each possible sequence of heads and tails. (Exactly what this knowledge rules out and how it does so takes us into details of Williamson’s proposal that I won’t go into here.) If our implicit, linguistic knowledge rules out the uniform distribution mandated by the Principle of Indifference, what probabilities should we assign? There is a generalization of the principle that we can apply in its stead, called the Principle of Maximum Entropy, or MaxEnt If your evidence is consistent with all and only those probability distributions in the set S, then use the probability distribution p that maximize the quantity pi log(pi ). −H(p) = − i∈Ω
MaxEnt is strongly reminiscent of InfoMin (section 3.5), and in fact employs the same basic idea. H was supposed to measure information, and entropy defined as −H is its opposite. Minimizing information is the same as maximizing entropy. The difference is that, whereas before we sought to minimize the change in information relative to our initial probabilities, now we are setting our initial
508
Jonathan Weisberg
probabilities by minimizing the amount of absolute information. MaxEnt just applies Infomin to find the distribution that is informationally closest to assigning 1 to each {i}. (We have to take it on faith that minimizing information relative to this assignment amounts to minimizing information absolutely.) MaxEnt is a generalization of the Principle of Indifference, selecting the uniform distribution recommended by indifference when S includes it. PROPOSITION 20. Suppose p is the uniform distribution on (Ω, σ) and p ∈ S. Then p maximizes −H on S. But we can also apply MaxEnt when the uniform distribution isn’t in S. To solve the problem we are facing, that the uniform distribution can be stubbornly antiinductive, Williamson uses Bayes-nets and the linguistic structure of the problem to rule the uniform distribution out of S, and then applies MaxEnt. The end result is not the uniform distribution but, rather, the “straight-rule”. This distribution says that the conditional probability of Hi is the frequency with which heads has been observed so far. The straight-rule is actually not such a great result to end up with. It assigns p(H2 |H1 ) = 1, so after observing just a single heads we conclude with certainty that the next flip will be heads too. But we might be able to give a formula for narrowing S down that yields a more sensible distribution. There is a deeper problem that deserves our attention instead. Williamson appeals to knowledge implicit in the syntactic features of our language in order to shape S. It’s supposed to be the match between the predicates we use to formulate H10 and H1−9 that tells us that these propositions are inductively related. But grounding induction and probability in syntax is notoriously problematic, as Goodman showed and we saw in section 2.1. If the agent uses a gruesome language to describe her hypotheses, she will end up with a different S and hence a different distribution. Williamson might not be troubled by this, since he might think that an agent who speaks a gruesome language has different implicit knowledge and so should assign different probabilities. But, thanks to Goodman, we have both ‘green’ and ‘grue’ in our language, so we can’t use our language to shape S according to Williamson’s recipe. Clearly Williamson will want us to focus on the ‘green’ fragment of our language when we apply his recipe, but then it’s plain that it isn’t really our language that encodes the implicit knowledge that leads us to project ‘green’ instead of ‘grue’. Indeed, it’s hard to see what implicit knowledge Williamson could be appealing to, except maybe our knowledge that heads is very likely given lots of heads initially, that green emeralds are more likely than grue ones, and so on. Continuous Indifference What if Ω is uncountable? We can extend the Principle of Indifference to cover this case if we have a continuous parameterization of Ω, i.e. if we have a 1-1 map from Ω onto an interval of reals so that each element of the interval represents an
Varieties of Bayesianism
509
element of Ω. Supposing we have such a parameterization, indifference takes the following form: Principle of Indifference (Continuous Version) Suppose you have no evidence relevant to Ω, and Ω is parameterized on the interval from a to b (open or closed). Then the probability of a subset is the area under the curve f (x) = 1 on that subset, divided by the total area on the interval, b − a. Formally, 1 1 dx. p(S) = b−a S For sub-intervals, their probability will be their length divided by (b−a). For single points the probability will be 0, and likewise for any countable set of points.21 In general, when the probabilities over a real parameter are encoded by the areas under a curve like f (x) = 1, the encoding function f is known as a probability density. The Principle of Indifference tells us to use the uniform density function, f (x) = 1, on the interval in question, since it treats each possible value the same. Now we know how to apply the Principle of Indifference given a parameterization. But how do we pick a parameterization? Does it matter? A famous puzzle of Bertrand’s [[1888] 2007] shows that the choice of parameterization does matter to the probabilities we end up with. Here’s a nice illustration from van Fraassen [1989]. Suppose a factory makes cubes, always between 1 and 3 inches on a side. What is the probability that the next cube to come off the line will be between 2 and 3 inches on a side? If we apply the Principle of Indifference to the parameter of side-length, the answer is 1 3 1 dx = 1/2. 2 2 But notice that the volume of the cube is another possible parameterization. If we use that parameterization instead, the Principle of Indifference gives us 27 1 1 dx = 19/26. 26 8 So we get different probabilities if we use different parameterizations. To resolve this problem, Jaynes [1968; 1973], developing an idea due to Jeffreys [[1939] 2004], suggested a variation on the Principle of Indifference. Rather than using a uniform density on the grounds that it treats each possible value the same, we should look for a density that treats each parameterization the same. That is, we want a density that is invariant across all parameterizations of the problem. 21 For some sets, known as Vitali sets, the probability is not defined at all, and so they can’t be included in the σ-algebra implicitly under discussion. To be explicit, the σ-algebra for which p is defined here is the family of Borel sets in the interval from a to b, i.e. those sets that can be obtained from subintervals by countable union and intersection. It is a basic consequence of measure theory that there must be Vitali sets, assuming the axiom of choice. For an explanation why, see any standard textbook in measure theory, such as [Halmos, 1974].
510
Jonathan Weisberg
In the cube factory example, the only density that is invariant between the length and volume parameterizations is f (x) = 1/x.22 Recall that b 1/x dx = log(b) − log(a). a
So in terms of side-length, the probability of an interval [a, b] will be log(b) − log(a) . log(3) − log(1) Recall also that log(xn ) = n log(x). So when we convert from side-length to volume, the probability of that same range of possibilities will be log(b3 ) − log(a3 ) log(33 ) − log(13 ) or
3[log(b) − log(a)] . 3[log(3) − log(1)] So changing the parameter from length to volume doesn’t change the probabilities encoded by f (x) = 1/x. In general, because areas under f (x) = 1/x are given by log(b) − log(a), and log(xn ) = n log(x), any transformation of x to xn will leave the probabilities invariant, since the n’s will divide out. Something similar happens if we transform x into nx, except that the n’s will subtract out, since log(nx) = log(x) + log(n). But for Jaynes’s proposal to work in general, there must always be a unique density that leaves the probabilities invariant across all parameterizations. Unfortunately, it isn’t so. A famous Bertrand-style problem due to von Mises [[1928] 1981] gives an example where there is no density that will yield invariant probabilities across all plausible parameterizations. Suppose we have a container with 10cc of a mixture of water and wine; at least 1cc is water and and at least 1cc is wine. What is the probability that at least 5cc is wine? Consider two parameterizations: the ratios of water-to-wine and of wine-to-water. As before, f (x) = 1/x is the only density that is invariant between these two parameterizations, and the probability it gives is p([1, 9])
= =
log(9) − log(1) log(9) − log(1/9) 1/2.
But what if we move to a third parameterization, wine-to-total-liquid? In this case we get a different answer: p([1/2, 9/10])
= ≈
22 Actually,
log(9/10) − log(1/2) log(9/10) − log(1/10) .268
any constant multiple of f (x) = 1/x will be invariant, but they will all yield the same probability distribution, since the constant will divide out.
Varieties of Bayesianism
511
Since f (x) = 1/x is the only density invariant across the first two parameterizations, there cannot be another density that would be invariant across all three. Jaynes was well aware of the water-wine puzzle, and even mentions it towards the end of his seminal 1973 essay on the invariantist approach, The Well-Posed Problem: There remains the interesting, and still unanswered, question of how to define precisely the class of problems which can be solved by the method illustrated here. There are many problems in which we do not see how to apply it unambiguously; von Mises’ water-and-wine problem is a good example [. . . ] On the usual viewpoint this problem is underdetermined; nothing tells us which quantity should be regarded as uniformly distributed. However, from the standpoint of the invariance [approach], it may be more useful to regard such problems as overdetermined ; so many things are left unspecified that the invariance group is too large, and no solution can conform to it [Jaynes, 1973, p. 10]. Jaynes’s view seems to be that there is no answer to the problem as posed, but there might be a solution if the problem were posed in such a way as to specify a class of parameterizations for which there is an invariant distribution. But what the correct probabilities are cannot depend on the way the problem is posed. Whether we’re talking about the degrees of belief we ought to have, or what the logical or evidential probabilities are, how could they be relative to how the problem is phrased? Maybe what Jaynes has in mind is not the way the problem is phrased, but what assumptions we bring to the table in an actual case, and thus which parameterizations we are entitled to regard as on a par. But the whole point of the von Mises problem is that we don’t bring any assumptions to the table except the ones stated. So the three parameterizations discussed are all legitimate. It simply won’t do to say that there is no fact of the matter what the probabilities are, or that the probabilities are assumption-relative. The whole point of the Principle of Indifference is to tell us what the correct probabilities are given no (relevant) assumptions, and either of these responses is simply a non-answer. Summary of Principles of Indifference In the discrete case, we saw that the probabilities assigned by the Principle of Indifference were badly anti-inductive unless we fiddled with the representation, first partitioning the space of possibilities coarsely according to frequencies. We also saw that the probabilities we got from Williamson’s recipe depended on the language in which the space of possibilities was characterized, making the principle susceptible to grueification. And a similar representation-sensitivity troubled us in the continuous case. The variable by which the possibilities are parameterized affects what probabilities the Principle of Indifference assigns. The bottom line looks to be that the Principle of Indifference is inappropriately sensitive to the way the space of possibilities is represented. To get unique answers
512
Jonathan Weisberg
from the principle, we need a recipe for choosing a partition or parameterization. But there is no obvious way of doing this, except to pick the one that respects the symmetry of our evidence, e.g. the finite partition such that our evidence favors no one element over any other. And such recipes threaten to rob the Principle of Indifference of its content (but see [White, 2009] for a contemporary defense).
3.7 Principles of Reflection The Principle of Indifference is ambitious but checkered. Let’s turn our attention to a more modest principle, one with a much narrower focus: probabilities of future probabilities. Suppose you are at the track just before a race. Assuming you will be sure that your horse will lose the race as it comes around the final bend, what should you think about your horse’s chances now, before the race starts? It would be odd to be confident that your horse will win, given that you will have the opposite attitude as the race draws to a close. So we might adopt the following general principle: Reflection Principle Let p represent your current probabilities, and pt your probabilities at some future time, t. Then for each A you should have p(A|pt (A) = x) = x. An immediate and important consequence of Reflection is PROPOSITION 21. If p satisfies the Reflection Principle then, for each A, xi p(pt (A) = xi ). p(A) = i
Thus, if you obey Reflection, your current probability for A must be the expected value of the probabilities you may come to have at future time t. So to speak, your current probability must match the probability you expect to have in the future. I’ve cast Reflection’s motivation and formulation in terms very close to the degree of belief interpretation, since that’s the context in which it was first advanced by van Fraassen [1984], and in which it is usually discussed. It is also the context in which it is most controversial. As many have pointed out, Reflection forces you to follow your future degrees of belief now, even if you are well aware that they will be unreasonable or misinformed. Even if you know that you will be drunk [Christensen, 1991], or that you will have forgotten things you know now [Talbott, 1991], Reflection still insists that you adopt those unreasonable/uninformed opinions. Reflection demands tomorrow’s failures today. On other interpretations of probability, Reflection may be less troubled. If the future probability in question is logical or evidential, then it may not be subject to the kind of cognitive imperfections that your future degrees of belief are subject to, and so it might be appropriate for today’s probabilities to follow them. Still, future evidential or logical probabilities might depend on what evidence you will have in
Varieties of Bayesianism
513
the future, and evidence can be lost. Williamson points out that his evidential probabilities can in fact lead to violations of Reflection [Williamson, 2000, pp. 230-237]. Assuming the degree of belief interpretation though, why would anyone endorse such an implausible principle? Van Fraassen pointed out that the same Dutch book argument used to support Conditionalization can be used in support of Reflection (see section 4.1). Some take this as a sign that there is something amiss with the argument, but van Fraassen takes the opposite view, that Reflection is a requirement of rationality. He later defended Reflection against the Talbott and Christensen objections in his [1995] and, perhaps more importantly, offered a more general version of Reflection with a new justification: General Reflection Your current probability for A must be a mixture of your foreseeable probabilities at any future time t. General Reflection is essentially a weakening of the idea captured in Proposition 3.11. Proposition 3.11 says that your current probability must be a specific mixture of your future probabilities, namely the expected value. General Reflection says that it need only be some mixture — any mixture. That is, p(A) must equal i xi yi , where the xi are your foreseeable future probabilities, and the yi are any sequence of positive reals adding up to 1. In fact, this requirement is equivalent to p(A) lying between the greatest and smallest foreseeable values for pt (A). So why does van Fraassen phrase General Reflection in terms of mixtures? Because he allows for indeterminate probabilities (section 1.2), in which case mixtures do not have this trivial formulation. For simplicity, our discussion will continue in terms of point-probabilities. Because point-probabilities are a special case of indeterminate probabilities, the critical points raised will apply mutatis mutandis to indeterminate probabilities. In support of General Reflection, van Fraassen argues that it is entailed by Conditionalization [van Fraassen, 1995, p. 17]. He claims that any conditionalizer will satisfy General Reflection automatically, so we cannot violate General Reflection without violating Conditionalization. Weisberg [2007] argues that this is in fact not so. Rather, it is that someone who thinks she will Conditionalize (and also thinks that her future evidence forms a partition, and also knows her current conditional probabilities) will satisfy Reflection. But we are not rationally required to satisfy these conditions — indeed, it seems we should not satisfy them — so violating General Reflection does not entail a violation of anything required of us. Van Fraassen also argues that General Reflection, under certain assumptions, entails the original Reflection Principle, which he redubs Special Reflection [1995, pp. 18-19]. And van Fraassen [1999] argues that General Reflection also entails Conditionalization when the foreseeable evidence forms a partition. But Weisberg [2007] argues that this latter claim makes a mistake similar to the one in the alleged entailment from Conditionalization to General Reflection.
514
Jonathan Weisberg
More tempered views on the connection between current and future degrees of belief can be found in [Elga, 2007] and [Briggs, forthcoming]. Elga defends the view that your probability for A should match the probability your future self or a third party would have if their judgment were absolutely trustworthy (a “guru”) and they had all the relevant information you have. (This is a rough characterization; see Elga’s paper for an exact statement of the view.) Elga goes on to ask how you should respond to the judgment of those whose judgment you do not trust completely, and addresses the special case where you disagree with an epistemic peer — someone whose judgment you regard as equally as good as your own, and who has the same information you do. In this case, Elga argues that you should average your opinion with theirs. Briggs proposes a generalization of the Reflection principle designed to handle cases where your future degrees of belief are of varying trustworthiness and informedness. For other views on these and similar questions, see also [Plantinga, 2000], [Kelly, 2005], and [Christensen, 2007].
3.8 Chance Principles In section 2.4 we briefly discussed the physical chance interpretation of probability. There I said that, because physical chance is a (meta)physical notion, this interpretation did not yield its own brand of Bayesianism. But Bayesians who do think that there is such a (meta)physical thing as physical chance tend to think that it has important epistemic connections, since what the chances are should influence what the logical probabilities, correct degrees of belief, or evidential probabilities are. What is the exact nature of this connection? The obvious, naive proposal is that your probability for A, given that the physical chance of A is x, should be x: Miller’s Principle Let p represent a reasonable probability function, and let X be the proposition that the chance of A is x. Then p(A|X) = x The principle is jokingly called ‘Miller’s Principle’, since Miller [1966] argued that it is inconsistent. Jeffrey [1970b] argued that Miller’s argument is invalid, and this is now the accepted view. The principle is still problematic though, since it forbids omniscience and foreknowledge. Assuming Conditionalization, someone who learns X must assign probability x to A, even if they have already consulted a perfectly reliable crystal ball that told them whether A will turn out true. A more canonical view, designed to take account of such problems, comes from Lewis [1980]. Lewis endorses both the degree of belief and physical chance interpretations of ‘probability’, and proposes the following rule connecting them: Principal Principle Let p be any reasonable initial credence function.23 Let t be any time and let x be any real number in the unit interval. Let X be the 23 What
do we mean by “reasonable, initial credence function”? Lewis is thinking of p as a
Varieties of Bayesianism
515
proposition that the chance at time t of A’s holding equals x. Let E be any proposition compatible with X that is admissible at time t. Then p(A|XE) = x. The idea is that your degree of belief in A, on the assumption that the chance at t of A is x, should be x. Unless, that is, you have “inadmissible” information, information that “circumvents” the chances, so to speak. The kind of thing Lewis has in mind as inadmissible is the consultation of a reliable crystal ball, timetraveller, oracle, or any other source of information from the future after t. Of course, admissibility is everything here. As Lewis says, “if nothing is admissible, [the principle] is vacuous. If everything is admissible, it is inconsistent.” [1980, p. 272] Lewis did not give a full account of admissibility, but did identify two very broad kinds of admissible information: (i) information about the history of the world before t, and (ii) information about how chances depend on the history. These views famously ran into trouble when paired with his metaphysical view that what the true chances are depends only on how the history of the world turns out [Thau, 1994; Lewis, 1994; Hall, 1994; Arntzenius and Hall, 2003], though Vranaas [1998] argues that the problem is solvable. Of more general concern are worries raised by Meacham [2005] about the compatibility of Lewis’s view with the chance theories actually used in physics. Meacham argues that Lewis’s view is incompatible with the the Aharonov-Bergman-Lebowitz theory of quantum mechanics, as well as classical statistical mechanics. The trouble, in short, is that Lewis’s view makes chances time-relative, and assigns only chances of 0 and 1 if the true physics is deterministic. The Ahoronov-Bergman-Lebowitz theory falsifies the first view, and classical statistical mechanics falsifies both. Meacham also argues that the concept of admissibility is actually dispensable if we formulate the principle correctly, which Meacham claims to do. This is good news if true, since no account of admissibility has received general acceptance.
3.9
Summary So Far
We have now surveyed the best-known proposals for rules constraining the range of permissible probability functions. Where do things stand today? Most Bayesians accept that something close to (Jeffrey) Conditionalization is correct, as well as something close to the Principal Principle, and maybe, in certain limited cases, a version of the Principal of Indifference. Because of worries arising from infinitary scenarios, Initial Regularity is less widely accepted, though Countable Additivity seems to do better. Reflection, on the other hand, is almost universally rejected. credence function which, “if you started out with it as your initial credence function, and if you always learned from experience by conditionalizing on your total evidence, then no matter what course of experience you might undergo your beliefs would be reasonable for one who had undergone that course of experience.” [Lewis, 1980, p. 268]. Thus the Principal Principle should be thought of as a constraint on reasonable “ur-credences”, (p. 506) rather than as a constraint on reasonable credences at any given time.
516
Jonathan Weisberg
A disclaimer: these are my own, informal estimations, and may reflect my own biases, and selective exposure to a certain kind of Bayesian. Assuming that these estimations are not completely off though, the picture that emerges is of a fairly subjective trend. Diachronic rules like Conditionalization do a fair bit to fix what your new probabilities should be given what they were before, but there is still lots of room for disagreement about what probabilities one may start with. The Principle of Indifference would eliminate much or all of that latitude, but it is only accepted by a few Bayesians and often in only a limited way. Countable additivity and Initial Regularity, as mentioned earlier (p. 505), do very little to constrain the range of permissible probability functions, and the Principal Principle, though more substantial, is still quite permissive. Some Bayesians probably think that all that leeway is actually not allowed, and we just haven’t yet found the correct formulation of the principles that disallow it. Others, however, think that such disagreement is just an inevitable fact of life, and does not demonstrate the incompleteness of the Bayesian theory so-far developed. The latter view is certainly counterintuitive at first blush, and whether it can be made more palatable depends on questions that go well beyond the scope of our discussion here. What one thinks about the nature of rationality, and about the aims of logic and epistemology, will heavily inform one’s conceptions about objectivity. So we must leave off the subject of the subjective-objective continuum here, and move on to survey justificatory arguments for the foregoing rules.
4 JUSTIFICATIONS We will look at four kinds of justificatory argument for the various rules we have considered. The first kind, Dutch book arguments (DBA), are supposed to justify Probabilism, (Jeffrey) Conditionalization, Reflection, and Regularity. The second and third kinds, representation theorem arguments and axiomatic arguments, apply only to Probabilism. The fourth and final kind, accuracy/calibration/cognitive utility arguments, have been used to support both Probabilism and Conditionalization.
4.1
Dutch Book Arguments
Dutch book arguments (DBAs) have been given for Probabilism, Countable Additivity, Conditionalization, Jeffrey Conditionalization, Reflection, and Regularity. The crux of these arguments is that, if the probabilities violate the rule in question, they will regard as fair a collection of deals that, as a matter of deductive logic, will lead to a sure loss. Since it is supposed to be unreasonable to regard as fair an obviously necessarily losing deal, it would be irrational to violate the rule in question. Let’s look at the individual arguments.
Varieties of Bayesianism
517
The DBA for Probabilism To get the argument going we need to establish a connection between probabilities and fairness. The standard connection assumed is Probabilities Are Fair Odds If the probability of A is x, then a bet that pays $S if A is true, pays $0 otherwise, and costs $xS, is fair. Usually the interpretation of probability assumed for DBAs is the degree of belief interpretation. Indeed, as we saw above (section 2.2), de Finetti took Probabilities Are Fair Odds to be a definitional truth about degrees of belief, since your willingness to pay at most $x for a $1 bet on A is what makes it the case that your degree of belief in A is x. For others, Probabilities Are Fair Odds may be a law of psychology, or a conceptual truth about the commitments you implicitly make when adopting a probability. As a matter of convenience and tradition, our discussion of DBAs will occasionally use the lingo of the degree of belief interpretation, though this is not essential to the formal properties of the arguments. To represent a bet that pays $S if A is true, pays nothing otherwise, and costs $x, we’ll use the notation [A : S, x]. Define the possible worlds of an outcome space (Ω, σ) as the functions on σ assigning 1 if ωi ∈ A ωi (A) = 0 if ωi ∈ A for each ωi belonging to some member of σ. Then we can say that your net in possible world ωi for the bet Bj = [Aj : Sj , xj ] is Sj ωi (Aj ) − xj . Your total net on the set of bets {Bj } in world ωi is Sj ωi (Aj ) − xj . j
j
Finally, we define a finite Dutch book for a real-valued function p on σ as a set of bets {Bj } such that p regards each Bj as fair and yet Sj ωi (Aj ) − xj < 0. ∀i : j
j
That is, a Dutch book is a set of seemingly fair bets that leads to a net loss no matter what the truth turns out to be. The DBA for Probabilism then turns on the following theorem: The Dutch Book Theorem Let p be a real-valued function on the outcome space (Ω, σ). If p is not a probability function, then there is a finite Dutch book for p. To illustrate, suppose p violates (P3) because A ∩ B = ∅ and p(A) = p(B) = .4, but p(A ∪ B) = .7. Then these bets make a dutch book: B1 B2 B3
= =
[A : 1, .4] [B : 1, .4]
=
[A ∪ B : −1, −.7]
518
Jonathan Weisberg
Notice the negative values in B3 . These correspond to playing the role of the bookie on that bet instead of the bettor, since you now “pay” −$.70 for the opportunity to “win” −$1. I.e. you accept $.70 as payment and agree to pay out $1 if A ∪ B comes true. Now, when the bets are made, you will pay out $.80 and collect $.70, for a net loss of $.10. But you cannot win your money back. If either of A or B comes true, you will collect a dollar and pay out a dollar. If neither does, no more money changes hands. And because A ∩ B = ∅, they cannot both come true. It would be unfortunate to get suckered like that, but is obeying (P1)–(P3) any guarantee against it? In fact, it is: The Converse Dutch Book Theorem If (Ω, σ, p) is a probability space then there is no finite Dutch book for p. The reason you are immune is that each Bj must have 0 expected value to be fair and so, p being a probability function, the expected value of the whole set must be 0 as well. But that couldn’t be if the net-payoff were always negative — then the expected value would have to be negative. So obeying (P1)–(P3) is not only necessary, but is also sufficient, for avoiding a finite Dutch book. The DBA for Countable Additivity The DBA for Countable Additivity is much the same, except that now there is an infinite set of bets that leads to a sure loss if you violate the principle, but no such set if you obey. We can even construct the bets in such a way that someone with a finite bankroll could place them all. So there is no worry that you would have to have an infinite amount of money to be suckered, in which case you wouldn’t much care since you could arrange to have an infinite balance left after suffering an infinite loss [Adams, 1964; Williamson, 1999]. The DBA for Conditionalization The DBA for Conditionalization is a bit different, in that there are two stages to the betting. To ensure that you incur a net loss, the bookie has to wait and see what happens, possibly offering you a second round of bets depending on how things unfold. To model the situation, fix an outcome space (Ω, σ) and let P be the set of all probability distributions on it. Define an update rule as a function r : P × σ → P . Conditionalization is one such rule. Now define a strategy for the distribution p and the rule r on the partition {Ei } ⊆ σ as a function from {Ei } to ordered pairs of books (i.e. ordered pairs of sets of bets) where the first book is always the same. A strategy is fair for p and r if, whenever it assigns to Ei the books {Bj } and {Cj }, {Bj } is fair according to p and {Cj } is fair according to r(p, Ei ). A Dutch strategy for r and p is a fair strategy for r and p that yields a net loss for each Ei . It is important that strategies are fixed, in that they commit to their secondround bets for each eventuality beforehand. Thus the strategy is as much in the dark about which of its commitments it will end up having to follow through on as
Varieties of Bayesianism
519
the update rule is. If the strategy is Dutch, then it bilks the update rule without taking advantage of any information that is not available to both parties. Thus it’s especially interesting that we have The Conditionalization Dutch Book Theorem Suppose p is a probability distribution and r an update rule such that for some E, r(p, E) = p(·|E). Then there is a Dutch strategy for p and r. The Dutch strategy that vindicates the theorem was concocted by David Lewis and reported in [Teller, 1973]. Let p be your probability function, E some element of σ for which r(p, E) = p(·|E), and q = r(p, E). Define y = p(H|E) − q(H) for some H for which p(H|E) > q(H). We then make {Bj } the following trio of bets: B1
=
[HE : 1, p(HE)]
B2
= =
[E : p(H|E), p(H|E)p(E)] [E : yp(E), y]
B3
If E turns out to be false, you lose $yp(E), and the game is over, {Cj } = ∅. If it turns out true, we sell you a fourth bet: C1 = [H : −q(H), −1] Whether or not H turns out true, you’ll find you’ve still lost $yp(E). What Lewis’s strategy does, in effect, is get you to make the same conditional bet on H given E twice, but at different odds. Lewis’s Dutch strategy is important, but the all-important question whether Conditionalization also immunizes you against Dutch strategies is settled by the following companion theorem [Skyrms, 1987]: The Converse Conditionalization Dutch Book Theorem Let p be a probability distribution and suppose r(p, E) = p(·|E) for every E. Then there is no Dutch strategy for p and r. The reason you are immunized by Conditionalization is that any fair strategy for a conditionalizer amounts to a set of bets that is fair relative to p. Whatever bets are in {Cj } can be made into conditional bets on E that will be fair at the outset. But, since p is immune to Dutch books in virtue of being a probability function, no fair strategy can be Dutch. So, as with Probabilism and Countable Additivity, Conditionalization is both necessary and sufficient to immunize you against a certain kind of bilking. The DBA for Jeffrey Conditionalization When it comes to Jeffrey Conditionalization things get a bit trickier, since it is no longer clear just how to model an update rule or how a strategy for playing against an update rule should be conceived. Before, a rule just mapped old distributions
520
Jonathan Weisberg
and new evidence to new distributions. But now no member of σ can be identified as new evidence to which the rule can respond and in response to which a strategy can offer second-round bets. An analysis of rules and strategies for bilking them is not intractable, but it requires a level of detail that would be inappropriate here. Suffice to say that we can prove, for Jeffrey Conditionalization, theorems respectably analogous to those for Conditionalization. For the details, the reader is referred to Skyrms’s [1987] excellent and careful discussion. The DBA for Reflection The Dutch book argument for Reflection is very closely related to the one for Conditionalization. We have The Reflection Dutch Book Theorem If p(H|pt (H) = x) = x for some H and x, there is a Dutch strategy for p (no matter what r). At first this result appears to contradict the Converse Conditionalization Dutch Book Theorem. How can you be Dutch booked no matter what r if, supposedly, you can use Conditionalization to immunize yourself against any Dutch strategy? To resolve the apparent conflict, notice that you cannot conditionalize in this scenario if E is pt (H) = x and E turns out true. If you did conditionalize on E when it turned out to be true, you would have pt (H) = x, since p(H|E) = p(H|pt (H) = x) = x, which contradicts E. So if you violate Reflection then you cannot Conditionalize when E turns out true, making you vulnerable to Lewis’s Dutch strategy. This is why the Reflection Dutch Book Theorem holds. Is there a converse theorem for Reflection? There cannot be. We know from the Conditionalization Dutch Book Theorem that, even if you satisfy Reflection, you can still fall victim to a Dutch strategy by failing to Conditionalize. So obeying Reflection cannot protect you from the possibility of a Dutch strategy. The absence of a converse theorem may not undermine the DBA for Reflection, however. Obeying Reflection is a necessary part of any overall plan to avoid Dutch strategies by Conditionalizing. If you violate Reflection, you put yourself in a position where you may not be able to execute this strategy. So while obeying Reflection may not be sufficient to immunize yourself, it is a necessary part of a sufficient plan. But for those who, like van Fraassen [1995], want Reflection to supplant Conditionalization, the force of the DBA is undermined. To avoid the Dutch strategy that bilks non-Reflective distributions in general, you would have to adopt Conditionalization as a general policy. The DBA for Regularity The DBA for Initial and Continuing Regularity is of a slightly different character from the preceding ones. If you violate either kind of Regularity, there is no Dutch book strictly speaking. But there is a weak Dutch book, one that does not lead you to a sure loss but may lead you to a loss and cannot lead you to a gain. If p(A) = 0, then a bet on A that pays −$1, 000, 000 if A and costs $0 is fair, since
Varieties of Bayesianism
521
there is no chance that you will have to pay out the $1, 000, 000. But unless A is impossible, it is possible that you will end up losing $1, 000, 000, and yet you stand to gain nothing, so how can the deal be fair? [Shimony, 1955] Critical Discussion of DBAs There are numerous points on which to criticize DBAs, and everyone has their favorite criticism. In the interest of brevity I will briefly mention some of the more popular criticisms with references to fuller discussions, and briefly discuss those criticisms that I think are especially instructive. First, we may contest Probabilities Are Fair Odds for presupposing a false connection between money and value. If I regard losing $.50 as more bad than gaining it would be good, the bet [A : 1, .5] will not be fair for me even if p(A) = 1/2. Probabilities Are Fair Odds appears to be false because it assumes that the value of money must be linear, which it needn’t be. But we may just be able to regard money as a stand-in for whatever goods you do value. Another complaint is that the threat of being Dutch booked is avoidable by simply not accepting the bets, or by betting in a way not in accordance with the probabilities. So probabilities don’t have to obey the relevant axiom for you to avoid being suckered, it’s just the odds at which you bet that have to obey those rules. The usual response to this point is that it’s not the actual threat of being suckered that’s the point. The point is that you would be suckered if you took the bets even though they’re supposed to be fair. How can they be fair if they lead to a sure loss? But perhaps the argument shows the wrong thing, since it makes an appeal to pragmatic rather than epistemic concerns. While it might be interesting that, as a matter of pragmatic rationality, probabilities must obey the axiom in question, this says nothing about the rules of probability qua rules of epistemic rationality. One response would be to insist that there is no difference between epistemic and pragmatic rationality; or, at least, epistemic rationality is subservient to pragmatic rationality in such a way that any epistemic rule which, if violated, leads to a violation pragmatic rationality, is a rule of epistemic rationality. Another response is that being in the position of being Dutch bookable actually demonstrates an epistemic defect, not a pragmatic one. Specifically, the probabilities that lead to the Dutch book regard the bets as fair even though they lead to a certain loss, which is logically inconsistent. A deal cannot be fair if one party has an advantage, which the bookie surely does since he necessarily walks away with a profit. So probabilities violating the axiom in question are actually logically inconsistent [Howson and Urbach, 1993]. Finally, many of the DBAs make the implicit assumption that, because each bet in a collection is fair individually, the collection of them all together is fair too [Schick, 1986]. This is known as “the packaging assumption”. The packaging assumption is, arguably, false in some cases, so its use here may be illicit. Perhaps more importantly, its use here may beg the question. Consider the example above
522
Jonathan Weisberg
(p. 517) where we gave a Dutch book for a violation of (P3). We implicitly assumed that, because the bets B1 B2 B3
= =
[A : 1, .4] [B : 1, .4]
=
[A ∪ B : −1, −.7]
were fair individually, they were fair as a package. But treating B1 and B2 as fair together because they are fair individually is very close to insisting that the bet [A ∪ B : 1, .8] is fair because p(A) = .4 and p(B) = .4, i.e. because (P3) is true. In fact, it seems so close as to rob the argument of any dialectical weight. In the case of Dutch strategies, where the package of bets is distributed between two times, it seems even less reasonable to insist that they are all fair together because they are fair individually. After all, some of the bets are evaluated as fair according to one set of probabilities, and the others as fair relative to a different set of probabilities. There is no single epistemic state or point of view relative to which all the bets are fair, so why think that any epistemic defect has been exposed? [Christensen, 1991] For a fuller discussion of this and other concerns about Dutch strategies, see [Earman, 1992]. Moreover, paradoxical cases involving countable sets of deals seem to tell against the packaging assumption, at least for infinite sets of decisions. Consider the following example from Arntzenius, Elga, and Hawthorne [2004]. God has divided an apple into a countable number of pieces, and tells Eve that she may eat any of the pieces she likes, though she will be expelled from the Garden of Eden if she eats an infinite number of them. Above all else Eve wants to stay in the garden, though she would also like to eat as much of the apple as she can without getting expelled. Now, for any given piece individually, the prospect of taking it is advantageous; no matter which of the other pieces she takes, taking this one too can’t worsen her situation and surely improves it. Of course, Eve cannot thereby conclude that taking all the pieces is advantageous. See [Arntzenius et al., 2004] for further discussion of packaging-type reasoning in infinitary scenarios. For a fuller survey of concerns about DBAs, see [H´ajek, 2008].
4.2 Representation Theorem Arguments Earlier (p. 488) we discussed Ramsey’s use of The Representation Theorem to analyze ‘degree of belief’. The same theorem has also been put into service in support of Probabilism. Recall the theorem: Representation Theorem Suppose a subject’s preferences obey a set of constraints C (not specified here). Then there is exactly one probability functionutility function pair, (p, u), that represents the agent’s preferences, in the sense that she prefers A to B if and only if EU (A) > EU (B). To turn this into an argument for Probabilism, assume that the probabilities and utilities whose existence is guaranteed by the theorem describe the agent’s actual
Varieties of Bayesianism
523
beliefs and desires. Assume also that satisfying the constraints in C is rationally required. Then anyone who satisfies those constraints, as is required, has probabilistic degrees of belief. Thus we have an argument for Probabilism, at least under the degree of belief interpretation of probability. There are essentially two main threads of criticism here. Traditionally, the target has been the claim that the constraints in C are rationally required. Allais [1979] famously devised a decision problem where it seems intuitive to violate Savage’s Independence Axiom: Independence Axiom Suppose that acts A1 and A2 yield the same outcome, O, in the event that E is false. Then A1 is preferable to A2 if and only if an act that has the same outcomes as A1 , except that some other outcome O happens if E is false, is preferable to an act that has the same outcomes as A2 except that it too leads to O if E is false. Other, more technical axioms needed to complete the theorem are also a common point of attack, such as Jeffrey’s [1965] assumption that the objects of one’s preferences form an atomless boolean algebra. In response to these complaints, proponents of the RTA tend to insist that axioms like Independence are correct if unintuitive in some cases, and that the more technical axioms are in-principle dispensable. Joyce [1999], for example, suggests that we should only expect a rational agent’s preferences to be embedable in the fine-grained sort of preference structure outlined by the axioms. Then the theorem shows that one’s state of opinion must agree with some probabilistic state of opinion, though it may not be as thorough and fine-grained as an actual probability function. Perhaps a more damning problem is the one raised in our earlier discussion of The Representation Theorem (p. 488), namely that we cannot say that an agent’s actual degrees of belief are the ones the theorem provides. Actual people do not form their preferences in accordance with expected utility maximization [Kahneman and Tversky, 1979], so why think that the p assigned by the theorem, which is only unique in the sense that it alone is part of an expected utility representation of those preferences, represent anyone’s actual probabilities? Christensen [2001] has responded that, while we cannot assume that the p guaranteed by the theorem is your actual degree of belief function, we can say that p should be your actual degree of belief function. If it weren’t, your degrees of belief would violate an intuitively plausible normative constraint connecting degrees of belief and preferences: Informed Preference An ideally rational agent prefers the option of getting a desirable prize if B obtains to the option of getting the same prize if A obtains, just in case B is more probable for that agent than A. Meacham and Weisberg [unpublished] show, however, that this is not true. One can satisfy the constraints in C, together with Informed Preference, and not have the degrees of belief assigned by the Representation Theorem. Christensen’s reason
524
Jonathan Weisberg
for thinking that this is not possible turns on an ambiguity in standard statements of the theorem, but counterexamples to his claim are easy to generate. If p represents the degrees of belief assigned by the theorem, then the function p2 will also satisfy Informed Preference, since it preserves the ordinal properties of p. A third criticism often leveled at the Representation Theorem approach is that this argument, like DBAs, conflates pragmatic and epistemic normativity. The constraints on preferences given in C appear to be pragmatic norms, so the argument’s conclusion, if it worked, would seem to be that Probabilism is a requirement of pragmatic rationality, whereas epistemic rationality is the topic of interest. As with DBAs, one might respond that there is no difference, or that epistemic rationality is subservient to pragmatic rationality.
4.3 Axiomatic Arguments Another approach to justifying Probabilism is to try to derive it from unassailable axioms. The classic result here is due to Cox [1946], who started with a few elementary assumptions that any measure of plausibility should obey, and then proved that any such measure would be isomorphic to a probability function. Here is the full statement of the theorem: Cox’s Theorem Let b(·|·) be a two-place, real-valued function on σ × (σ − {∅}) satisfying the following two conditions for any A, B, and C: 1. b(A|B) is a function f of b(A|B), and 2. b(AB|C) is a function g of b(A|BC) and b(B|C). If f and g exist and f is continuous, then there is a continuous 1-1 function i : R → R such that p = i ◦ b is a probability function in the sense that p(·|Ω) satisfies (P1)–(P3), and p(A|B)p(B|Ω) = p(AB|Ω) for any A, B. The idea is that any reasonable numerical representation of conditional plausibility will be a notational variant of a probabilistic representation. It turns out that there are technical difficulties with Cox’s theorem. Halpern [1999], for example, argues that there are counterexamples to the theorem, and that the hole can be plugged only at the price of appealing to unmotivated assumptions. There are other theorems to similar effects, e.g. those of Acz´el [1966] and Fine [1973]. A common concern with such results, and with the axiomatic approach in general, is that the assumptions needed to nail down the theorems tend to be less plausible or intuitive than the rules of probability are on their own terms. (P1) simply establishes a scale, and (P2) says that logical truths are maximally probable, which seems almost self-evident. The only really substantial and controversial axiom is thus additivity, (P3). And it’s hard to imagine that any formal argument deriving it could be based on intuitively obvious assumptions about the way probability should work that are more obvious than additivity itself. It is also hard to see why the probability function that b is shown to be isomorphic to in these theorems should be regarded as a notational variant. Why, in Cox’s
Varieties of Bayesianism
525
theorem for example, should p = i ◦ b be regarded as just a re-scaled statement of b?
4.4
Calibration and Accuracy Arguments
Another approach to justifying Probabilism is to try to show that probability functions are specially situated, veridically speaking — that they do better than any alternative at getting at the truth. This approach has been taken by Rosenkrantz [1981], van Fraassen [1983], Shimony [1988], and Joyce [1998], all of whom aim to show that probabilistic beliefs are in some sense more accurate than non-probabilistic beliefs. Suppose, for example, that we adopt as a measure of inaccuracy the quadratic loss function, (ω(A) − b(A))2 , Q(b, ω) = A∈σ
where b is some real-valued function on σ and ω is, as earlier, a singleton of Ω, with ω(A) = 1 if ω ⊆ A and 0 otherwise. De Finetti then showed the following: PROPOSITION 22. If b is a non-probabilistic function on σ then there is a probability function, p, for which Q(p, ω) < Q(b, ω) for every ω ∈ Ω. Thus non-probabilistic beliefs can always be brought closer to the truth by going probabilistic no matter what the truth may turn out to be. An argument for Probabilism based on De Finetti’s result must, however, motivate the use of Q as a measure of inaccuracy. The result can be generalized to any quadratic loss function, where the terms in the sum are given varying weights, but we still want some argument that there are no acceptable measures of inaccuracy that deviate from this general form. Alternatively, we could try to generalize the result even further, which is what Joyce [1998] does. He offers six axioms constraining any reasonable measure of inaccuracy, and then shows that a result analogous to De Finetti’s holds for any such measure. Joyce’s axioms not only entail that non-probabilistic functions are dominated by probabilistic functions in this way, but also that probabilistic functions are not dominated by non-probabilistic functions.24 Maher [2002] criticizes Joyce’s axioms on the grounds that (i) his arguments motivating them are unsound, and (ii) they are too strong, ruling out reasonable measures of inaccuracy. In particular, the simple measure |b(A) − ω(A)| ρ(b, ω) = A∈σ
is incompatible with two of Joyce’s axioms, and without them the theorem does not follow. Joyce [2009] shows that analogous dominance results can be proved from alternative constraints on inaccuracy measures, and defends these alternatives as reasonable constraints. 24 This is not explicitly shown in Joyce’s [1998], but Joyce has verified this additional result in personal correspondence.
526
Jonathan Weisberg
4.5 Expected Cognitive Utility and Conditionalization Greaves and Wallace [2006] have given an argument for Conditionalization that is similar in spirit to the preceding one, joining those ideas with Levi’s idea of maximizing expected epistemic utility.25 The rule of Conditionalization, they argue, maximizes expected epistemic utility. To make such an argument, of course, we need to say something about the structure of epistemic utility; which actions are likely to lead to epistemically good outcomes depends in large part on what counts as an epistemically good outcome. A particularly interesting class of utility functions are characterized by the following definition: Strong Stability A utility function is everywhere strongly stable if and only if the expected utility of holding any probability function, calculated relative to that same probability function, is higher than the expected utility function of holding any other probability function. I.e., u is such that for any p, q, p(Oi |Hold p)u(Oi ∧ Hold p) > p(Oi |Hold q)u(Oi ∧ Hold q). i
i
The argument then proceeds from the following theorem Greaves-Wallace Theorem If u is everywhere strongly stable, then updating by Conditionalization uniquely maximizes expected utility. If we replace the > in the definition of Strong Stability by ≥, we get the set of everywhere stable utility functions, in which case the corresponding theorem is that Conditionalization maximizes expected utility, though not necessarily uniquely — some other update rule might be equally promising. Greaves and Wallace’s argument turns on the assumption that epistemic utility is, or should be, everywhere (strongly) stable. On this subject they remark, “we find this rationality constraint plausible, but we offer no argument for it here.” [Greaves and Wallace, 2006, p. 626] Admittedly, the constraint has strong pull. An unstable utility function would be one that was empirically biased, in a sense, valuing one state of opinion over another even assuming that the latter one was correct. But I do not know how to turn this fuzzy thought into a proper argument.
4.6
Summary
Our discussion has touched on at least one argument for each of Probabilism, Conditionalization, Jeffrey Conditionalization, Reflection, and Regularity, with Probabilism and Conditionalization receiving most of the attention. We did not discuss any arguments for Infomin, the Principle of Indifference, and the Principal Principle. This treatment does represent the literature roughly, but not exactly. We mentioned some arguments for Infomin back in section 3.5; and the Principle of 25 See section 5 for a definition of ‘expected utility’ and the role of expected utility in decision making; see section 7 for Levi’s use of expected epistemic utility
Varieties of Bayesianism
527
Indifference and the Principal Principle can be motivated intuitively. The Principle of Indifference says that, when your evidence is symmetric over a partition, your probabilities should be too. And the Principal Principle says that your credences should match the physical chances, absent any reason to think you “know better”. Both are intuitively compelling suggestions that yield plausible results in many cases (indeed, Lewis [1980] motivates the Principal Principle by appealing to a list of its applications).
5
DECISION THEORY
So far we have been looking just at the logical and epistemological faces of Bayesianism. But one of Bayesianism’s great virtues is that it lends itself quite readily to theorizing about pragmatic norms — norms of decision making — as well. Assuming that we have a numerical notion of value to work with, Bayesianism gives us an algorithm for deciding what to do when faced with a choice. Let u be a utility function, i.e. a function that assigns to each possible outcome of an act a number reflecting how good or bad that outcome is. Then act A1 is better than act A2 just in case pi u i > pj u j , i
j
where the pi are the probabilities of the possible results of A1 and the ui their utilities, and similarly for the pj and uj with respect to the results of act A2 . The quantity pi u i i
is called the expected utility (EU) of act A, and it serves to weigh the value of each possible outcome of the act against the likelihood that it will occur, summing the results to capture the overall expected “goodness” promised by that act. The decision rule to choose the act with the highest expected utility is called expected utility maximization, or EU -max. The rule to maximize expected utility has near-universal acceptance amongst Bayesians. (Indeed, departures from the EU -max rule tend to be regarded as non-Bayesian almost as a matter of definition.) Disagreements usually arise over the details. There are questions about how to understand and represent acts and outcomes, and there is disagreement about how we should interpret the pi , the probabilities of various eventualities of the act in question. The best way to appreciate the subtleties here is just to follow the historical dialectic that brought them to light. We start with Savage’s classic formulation of Bayesian decision theory, move on to the motivations and amendments of Jeffrey’s theory, and finish with a look at causal decision theories.
528
Jonathan Weisberg
5.1 Savage’s Decision Theory Savage [1954] represented a decision problem in terms of three components: states, acts, and outcomes. To illustrate, consider a simple gambling problem, such as deciding whether to bet on red or black in a game of roulette. There are two possible states of the world: the wheel will land on red, or the wheel will land on black. The outcome of each act depends on the true state: if you bet red, landing on red leads to a win and landing on black leads to a loss. Vice versa for betting on black. Thus we may think of acts as functions from states to outcomes. Since the outcomes are what we care about, utilities attach to them. But our ignorance is about which state will actually obtain, so states are the bearers of probability. Thus a decision problem is represented, in Savage’s framework, as a set of states, acts, and outcomes, with probabilities distributed over the states and utilities distributed over outcomes. How do we calculate expected utility in this framework? Given act A, we want to multiply the probability of each state by the utility of the outcome it will result in given act A, p(Si )u(A(Si )), where A(Si ) is the outcome that act A maps state Si to. Thus expected utility in Savage’s framework is EU (A) = p(Si )u(A(Si )). i
In the roulette example, supposing that utility equals money and that our two options are to bet $1 on black or bet $1 on red, the expected utility of betting on red is p(R)u(+$1) + p(B)u(−$1) = (1/2) × 1 + (1/2) × −1 = 0, as expected, and similarly for betting on black. So neither act is superior, as seems right. (In the case of a tie for best, EU -max says that it does not matter which act we choose.) Savage’s framework is simple, elegant, and in many cases perfectly adequate. But there are decision problems for which it seems to be inappropriate, as Jeffrey [1965] pointed out with examples like this one.26 Suppose that I park my car in a sketchy neighborhood and am approached by a suspicious character who offers to “protect” it for me while I am about my business; his price is $10. Now suppose I reason as follows. There are two possible states of the world, one in which this suspicious character will smash my windshield and one in which he will not. If I pay him the $10 then the possible outcomes here are one in which my windshield is not smashed and I am out $10, and one in which my windshield is smashed and I am still out $10. If I don’t pay him, the possible outcomes are smash vs. not-smashed, though I am not out $10 in either case. Whatever the probabilities of the states are here, the expected utility of not paying him will clearly be higher on Savage’s account, since the outcomes determined by that act are better in each possible state. But this is surely mistaken. Clearly I will end up with a smashed 26 This
particular example is from Joyce [1999].
Varieties of Bayesianism
529
windshield if I don’t pay, and may well not if I do. Given the low price of $10 to avoid the extreme badness of having my windshield smashed, I should pay.
5.2
Evidential Decision Theory
What Savage’s framework overlooks, and this problem exploits, is that my action is relevant to which state of the world will obtain, smashed or not-smashed. To remedy this problem, Jeffrey proposed that we allow the probabilities of the possible states to vary depending on which act we are considering. In particular, Jeffrey thought that the relevant probabilities were the probabilities conditional on the assumption that we take that act. Once we notice that there is an important probabilistic relationship between actions and states, we also notice that the connection between action-state pairs and outcomes can be probabilistic as well. Even once we fix my act (bet on red) and the state of the world (land on black), there is some chance that I will win $1 anyway, through some oversight on the part of the casino, for example. Thus it looks as though acts, states, and outcomes should all be propositions under the domain of the probability function. In Savage’s framework, states are basic elements in the domain of the probability function, outcomes are basic elements in the domain of the utility function, and acts are deterministic functions connecting these two otherwise distinct domains. In contrast to this, Jeffrey’s system treats acts, states, and outcomes as all the same kind of thing, propositions, which can be arguments for both the probability and utility functions. We can talk about the probability that we will take a given act, the probability that a certain state obtains, and the probability that a certain outcome will result. Similarly for utilities. In fact, Jeffrey’s system doesn’t officially distinguish between acts, states, and outcomes at all. On Jeffrey’s system, a decision problem is just composed of a propositional language, with a probability function and a utility function over it. How do we figure the expected utility of an act in Jeffrey’s system? We consider all the ways things could turn out, all the singletons of Ω, the ωi ,27 and multiply the probability of each one conditional on doing the act by the utility of that act and that outcome obtaining, then sum up. That is, we calculate the evidential expected utility of A, p(ωi |A)u(ωi ∧ A) EEU (A) = i
=
p(ωi |A)u(ωi )
i
Applying this new system to the car-park example, we find that a plausible assignment of utilities and probabilities leads to a higher EEU for paying the $10, since 27 Actually, for technical reasons, Jeffrey’s system works with an atomless algebra, so there is no Ω, just an ever more finely divisible set of propositions. To keep things simple and more in harmony with our formalism, we’ll ignore this feature of Jeffrey’s system. We also assume that Ω is countable, so that we can do everything in terms of sums.
530
Jonathan Weisberg
this significantly increases the probability of the vastly superior outcome where my windshield is not smashed. Unlike in Savage’s system, the p(ωi |A) vary depending on what A is. Savage was aware of cases like the car-park example, where the probability of the outcomes depends on the act being considered. He thought they should be handled by individuating the possible states in such a way that the states would be independent of the acts. So, for example, we might distinguish: S1 : The windshield will be broken whatever I do. S2 : The windshield will be broken if you don’t pay, but not if you do. S3 : The windshield won’t be broken if you don’t pay, but will be if you do. S4 : The windshield will not be broken regardless. Then the expected utility calculation comes out as desired on Savage’s formula. So if we partition the space of possible states appropriately, we can find a formulation of the problem that gets the right answer. The problem, of course, is that we then need a general theory of how to partition the states. The obvious answer is to use a partition where the states are probabilistically independent of the acts, but acts do not have probabilities on Savage’s framework; there is no way to express the thought that p(AS) = p(A)p(S), since the first two terms are undefined. This is why Jeffrey’s theory is preferable. Not only does evidential decision theory allow us to express probabilistic relationships between acts and outcomes, but it also gives us the same answer no matter how we set up the problem. The expected utility of a proposition is equal to the expected value of the expected utility over any partition: PROPOSITION 23. If {Oj } is a partition of Ω, then EEU (A) = p(ωi |A)u(ωi ) i
=
p(Oj |A)EEU (Oj ∧ A).
j
Thus it doesn’t matter how we partition a problem, contra Savage, provided we plug the right values in for the utilities, namely the evidential expected utilities. This partition-invariance is important because it shows that Jeffrey’s theory is just as applicable to “small-world” problems as it is to the “grand-world” problem of managing your decisions over the most fine-grained possible representation of the problem. Whether we look at a decision problem as we normally do, only considering very coarse outcomes like “windshield gets smashed” and “windshield doesn’t get smashed”, or we look at the possible outcomes in maximal detail, everything coordinates. The expected utility of an act will come out the same either way. So, while Savage’s framework depended on finding the “right” partition, Jeffrey’s evidential decision theory does not.
Varieties of Bayesianism
5.3
531
Causal Decision Theory
Still, there are problem cases for evidential decision theory. Suppose, for example, that new research shows us that smoking does not actually cause lung cancer. As it turns out, the correlation between smoking and lung cancer is due to a common cause — there is a gene that disposes one to smoke, and also disposes one to develop lung cancer. In this case, would you have reason to stop smoking (assuming you do smoke)? On the one hand, the probability of you not having the gene, and thus not developing lung cancer, conditional on you quitting, is quite high. So the EEU of quitting is high. On the other hand, either you have the gene or you don’t, and there’s nothing you can do about it. So you might as well go ahead and enjoy your favorite habit.28 Most find the latter argument the persuasive one: while it might be good news if you can quit, since it decreases your chance of getting lung cancer, quitting is not a way of preventing lung cancer, and thus has no practical value, despite its evidential value. If you find this a plausible diagnosis of the problem, then you should be on the market for a new kind of decision theory, one that weights acts based on their efficacy, i.e. their ability to bring about good outcomes, rather than based on their value qua good news, which is what evidential decision theory was evaluating. How to formulate such a theory? The classic formulation is due to Skyrms [1980]. Skyrms’s idea is to look for a special partition of Ω, one that separates Ω into what David Lewis [1981] calls dependency hypotheses, theories about how the various outcomes you care about depend causally on what you do. Each element of such a partition would fully specify how your potential actions sway the possible outcomes. Call such a partition a K-partition. We can then calculate the causal expected utility of an act as p(Ki ) p(ωj |A ∧ Ki )u(ωj ). CEU (A) = i
j
The idea is that we calculate the expected evidential value of the act on each possible assumption about what the causal connections might be, p(ωj |A ∧ Ki )u(ωj ), j
and then sum up, weighting by the probability of each assumption about what the causal connections may be, p(Ki ). Causal decision theory agrees that evidential expected utility is what matters when the causal structure of the world is taken as a given. The thought is that, when the causal structure is given, the tendency of an action to bring about an outcome coincides with the conditional probability of the outcome given the action. But, since the causal relationships between your actions and possible outcomes are not known, we have to calculate the evidential 28 This
is a version of the famous Newcomb Paradox.
532
Jonathan Weisberg
expected utility for each possible dependency hypothesis, and then weight each one by its probability. There are other ways of trying to capture the same basic idea, that it is the causal efficacy of your actions, not their evidential value, that matters to decision making. The original proposal, due to Gibbard and Harper [1978], used subjunctive conditionals to capture the causal impact of your actions. Subjunctive conditionals have the form, “were A the case then B would be the case”, and are standardly abbreviated A B. Subjunctive conditionals are very closely related to causation; generally speaking, if B wouldn’t have happened had A not happened, then A is a cause of B. Motivated by this thought, and by subjunctive analyses of causation like Lewis’s [1973a], Gibbard and Harper proposed that we calculate the expected utility of act A by p(A ωi )u(ωi ). i
The idea is that we weight the utilities of the ways things could turn out by the probabilities that they would turn out that way if we did A. Thus we use A’s disposition to bring about good outcomes as our guide. Lewis [1981] argues that the Gibbard-Harper approach and others can be seen as alternate formulations of the same idea, being equivalent to CEU as we defined it in terms of K-partitions. For fuller discussion of various formulations of causal decision theory and their relationships, see [Lewis, 1981] and [Joyce, 1999, pp. 161-176]. Applying the K-partition approach gets the intuitively correct result in cases like the smoking example: that you need not quit. But the cost is that we have to go back to the Savage approach, where applying the theory requires fixing on the right kind of partition, a K-partition. Is this essential to causal decision theory? Joyce argues that it is not. He suggests that we formulate causal expected utility as pA (ωi )u(ωi ), CEU (A) = i A
where p is the probability function we get from p by a process called imaging. Imaging p on A is supposed to capture the idea that we suppose A causally, rather than evidentially. How does imaging work? We start with a notion of “closeness” or “similarity” between the ωi , so that we can pick out the closest singleton to a given ωi within a certain set, A. Let ωiA be the singleton closest to ωi that is in A. Then the image of p on A is given by p(ωi )p(O|ωiA ), pA (O) = i
which amounts to moving the probability from worlds outside of A to their nearest respective worlds inside of A. This is supposed to capture the kind of supposing we do when we ask how probable O would be if A were true. This definition of pA implicitly assumes that there is always a “closest” world within A, an implausible
Varieties of Bayesianism
533
assumption, as Lewis [1973b] argues. But Joyce adopts a more general definition, due to G¨ ardenfors [1988], that does not require this assumption. See [Joyce, 1999, pp. 198–9] for the details. The notion of imaging in hand, Joyce then defines a notion of conditional, causal expected value, pA (ωi |X)u(ωi ), V (X\A) = i
which expresses A’s causal tendency to bring about good outcomes within the range of possibilities specified by X. We then have partition-invariance in the form, PROPOSITION 24. If {Xj } is a partition of X, then pA (Xj |X)V (Xj \A). V (X\A) = j
As a result, when we calculate the causal expected utility of an act, we can do it using any partition: PROPOSITION 25. If {Xj } is a partition of A, then pA (ωi )u(ωi ) CEU (A) = i
=
pA (Xj |A)V (Xj \A)
j
=
V (Ω\A).
This lends Joyce’s formulation a virtue similar to Jeffrey’s theory. The imagingbased formulation also has the nice effect of isolating the difference between causal and evidential expected utility, placing it squarely in the epistemic half of the formula. The difference between evidential and causal decision theory can then be seen as a difference in the way in which we suppose A: causally versus evidentially, by imaging as opposed to by conditioning. That this is the crucial difference between EEU and CEU is not apparent on a Skyrms-style, K-partition account. And yet, causal decision theory may itself be troubled. Here is a problem case recently raised by Egan [2007].29 Johnny has devised a button which, if pressed, will kill all psychopaths. Johnny believes himself not to be a psychopath and places a high value on eliminating psychopaths from the world. And yet, Johnny believes that only a psychopath would push the button, and he values his own preservation much more than he values eliminating all psychopaths from existence. Intuitively, it seems to many, Johnny should not push the button, since that would tell him that his action is very likely to cause his own death (note the mixture of evidential and causal considerations here). 29 Egan offers two illustrations of the problematic phenomenon; he attributes this one to David Braddon-Mitchell.
534
Jonathan Weisberg
But causal decision theory seems to say that he should press the button. Consider the question first in terms of the K-partition formulation. There are two relevant dependency hypotheses here: K1 : If Johnny presses the button he and all psychos will die, but if he doesn’t press it nobody will. K2 : If Johnny presses the button all psychos will die yet he will survive. If he doesn’t, nobody dies. Given K1 , pressing is a terrible idea, whereas it is a good idea given K2 . As for not pressing, well it’s pretty much neutral, maintaining the status quo either way. Since Johnny thinks it very unlikely that he is a psycho, however, K2 seems much more likely, and so pressing will come out on top, since the expected good on K2 will outweigh the very improbable badness that would result on K1 . The trouble seems to stem from the fact that Johnny’s action is evidentially relevant to which Ki obtains, but this factor is not accounted for by the causal account, since we use just p(Ki ) to calculate CEU . How do things work out on the Joycean formulation? Consider how likely Johnny would be to die if he were to push the button. Of the button-pushing ω’s, the ones where Johnny is a psycho and hence dies bear more of the probability initially. But those improbable ones in which Johnny is not a psycho, and hence survives, will acquire a good deal of probability once we image; all the ω’s where johnny is not a psycho and doesn’t push the button will have their probability transfered to the worlds where johnny is still not a psycho but pushes the button, since they are more similar, on any plausible reading of “similar”. Viewing the probabilities over the ω’s schematically may help here. Initially they might look something like this: Psycho ¬ Psycho
Press .1 .01
¬ Press .09 .8
After imaging, they would look something like this: Psycho ¬ Psycho
Press .19 .81
¬ Press 0 0
So it’s pretty likely that he would survive if he pressed the button and, assuming he wants to kill all psychopaths badly enough, he should press it. What seems to be missing is, again, that the act of pressing is not just causally relevant to what happens, but also evidentially relevant to what causal connections obtain. 6
CONFIRMATION THEORY
Not only in science, but in everyday reasoning too, we say that some bit of evidence confirms this or that theory, that this is evidence for or against that hypothesis,
Varieties of Bayesianism
535
or that one theory is better confirmed than another. Explicating such talk about the confirmation and evidential support of theories is a classic problem in the philosophy of science (see the entry on confirmation in this volume). Bayesianism seems especially well equipped to solve it, and indeed Bayesians have had a great deal to say about the concept of confirmation. Traditionally, three different questions about confirmation are distinguished. First, when does a given bit of evidence confirm or disconfirm a given hypothesis? Second, by how much does the evidence confirm the hypothesis? And third, how well confirmed is a given hypothesis, absolutely speaking? Corresponding to these three questions, we have three separate concepts of confirmation: Qualitative Evidential Confirmation A qualitative relation between evidence and hypothesis, specifying whether the evidence tells for or against the hypothesis (or is neutral). Quantitative Evidential Confirmation A quantitative relation between evidence and hypothesis specifying how much the evidence tells for or against the hypothesis (or is neutral). Absolute Confirmation A quantitative property of a hypothesis, specifying its overall evidential standing to date. Bayesians tend to equate Absolute Confirmation with probability, saying that a hypothesis is confirmed to degree x just in case it has probability x. The plausibility of this proposal immediately suggests that we try to give probabilistic analyses of the other two concepts of confirmation. This is where things get tricky, and where Bayesians diverge from one another.
6.1
The Naive Account and Old Evidence
The simplest and most obvious Bayesian account of both Qualitative and Quantitative Evidential Confirmation is this: Naive Bayesian Confirmation E confirms H when p(H|E) > p(H), and the degree to which E confirms H is p(H|E) − p(H). The idea is intuitively appealing: evidence supports a hypothesis when it increases its probability, and the extent of the support is just the degree of the increase. How could something so obvious and simple possibly go wrong? Glymour [1980] posed a notorious problem for the naive account. One of the most crucial bits of evidence supporting Einstein’s General Theory of Relativity comes from an anomalous advance in Mercury’s perihelion. The anomaly was unexplained in classical physics but Einstein showed that it was a consequence of his theory, thus lending his theory crucial support. Let E be the fact of the anomaly and H Einstein’s theory. Because E is evidence that we’ve had for some time, p(E) = 1. But then, Glymour points out, it follows that p(H|E) = p(H), so E does not support H on the naive account. Yet it surely does intuitively.
536
Jonathan Weisberg
The naive account has some accordingly flat-footed responses to this problem of old evidence. The first is what we might call the historical response, which says that, while p(E) may equal 1 now, it was not always 1. It may well be that, before E became old evidence, p(H|E) was significantly higher than p(H), and that is why E confirms H. But this response is troubled on the degree of belief interpretation. What if we knew about the anomaly before ever encountering Einstein’s theory? Then there never was a time when p(H|E) > p(H). This seems to be the case for most of us living now, and was surely the case for Einstein. But even if we did formulate H before learning E, it’s hard to accept that this historical accident is what makes it correct when we say now that E confirms H. A different approach is to abandon historical probabilities in favor of counterfactual probabilities. Maybe the reason we say that E confirms H is that, if we didn’t already know E, then it would be true that p(H|E) > p(H). But this strategy runs into a similar problem, since there is no guarantee that the probabilities in the counterfactual scenario will be as desired. Maybe if we didn’t already know E our degrees of belief would be different in such a way that p(H|E) > p(H) actually wouldn’t hold. See [Maher, 1996, p. 156] for a nice example of this problem.
6.2 Alternate Measures of Confirmation A more promising line of response to Glymour’s problem begins with an appeal to Continuing Regularity (section 3.2), the thesis that we should never assign degree of belief 1 to anything but a tautology. Then it will not be true that p(E) = 1 for the anomaly evidence, and it may be that p(H|E) > p(H). The problem still remains in its quantitative form, however, since p(H|E) and p(H) are still approximately equal, so that the degree to which E confirms H is very little on the naive account, which is almost as bad as it not confirming H at all. This brings us to the second step in the strategy, where we tweak the quantitative aspect of the naive account. The quantity p(H|E) − p(H) is only one possible way of measuring E’s support for H. Another candidate is the quantity log[p(E|H)/p(E|H)]. The idea behind this measure is that we are using E to test between H and its alternatives. If E is more likely on the supposition that H than on the supposition that H, then this quantity is positive. And if H makes E much more likely than H does, then the quantity is large. This latter measure is called the log-likelihood ratio measure, and our original measure is the difference measure. There are actually quite a few measures that have been considered in the literature, the most prominent ones being: d(H, E) r(H, E) l(H, E) s(H, E)
= p(H|E) − p(H) p(H|E) = log p(H) p(E|H) = log p(E|H) = p(H|E) − p(H|E)
Varieties of Bayesianism
537
And corresponding to each candidate measure, we have a variant of the naive account : The cx Account of Confirmation E confirms H if and only if p(H|E) − p(H), and the degree to which E confirms H is cx (H, E). cx (H, E) can be whatever measure we favor, usually one of the four listed. Outside of the context of Glymour’s problem, there are various desiderata that have been used to argue that one or another measure provides a superior account. For historical surveys and thorough coverage of this debate, see [Fitelson, 1999] and [Fitelson, 2001]. As far as Glymour’s challenge goes, we’ve already seen that d does poorly, and a similar problem afflicts r. The virtues of l and s are explored by Christensen [1999], who raises the following problem case for l. Suppose we roll a fair die and have yet to observe how it turns up. Let L be the proposition that the roll was a low number — 1, 2, or 3 — and O the proposition that it was an odd number. Intuitively, L should confirm O. Suppose that L becomes old evidence, however; our reliable friend tells us that the the toss was low, raising the probability of L to .99, and we update our other probabilities by Jeffrey Conditionalization (section 3.4). A little arithmetic shows that p(L|O) = .99 and p(L|O) ≈ .98 so that l(O, L) ≈ .01, saying that L is effectively neutral with respect to O. So l is subject to the problem of old evidence too. But s does better: it says that L confirms O both before and after L becomes old evidence, and even by the same amount at each time. And yet, as Christensen shows, there are problems even for s. Suppose H is the hypothesis that deer live in a given wood, D is the possibility that there are deer droppings in location x, and A is the possibility that there is a shed deer antler at location y. Initially, p(H) = 1/2, p(D) = .001, and p(A) = .0001. Now suppose that we find what look like deer droppings in location x, so that D becomes old evidence, making p(D) and p(H) very high. As before, s will still say that D confirms H, which is good. The trouble is that s will now say that A provides much less confirmation for H, despite the fact that A is just as good evidence for H as D is. But, because H now has a high probability whether or not A is true, s will miss out on this fact. The moral Christensen draws is that intuitive judgments of confirmation are relative to a sort of base-line body of evidence, wherein D is not already assumed. So, once the probabilities reflect our knowledge of D, the probabilistic correlations will no longer reflect our intuitive judgments of confirmation. Christensen draws the pessimistic conclusion that no Bayesian measure will be able to account for intuitive judgments of confirmation. Eells and Fitelson [2000] disagree, arguing that Christensen gives up on Bayesian confirmation too quickly. Their reasoning, however, appears to me to be incompatible with any Bayesianism that interprets probability as degree of belief and lives towards the subjective end of the subjective-objective continuum (section 3), as many Bayesians do.
538
6.3
Jonathan Weisberg
The Popper-R´enyi Solution
The strategy we just considered, of appealing to an alternate measure of confirmation, depended crucially on the appeal to Continuing Regularity. If p(E) = 1, then d and r become neutral, l becomes a measure of absolute confirmation, and s becomes undefined. But even if Continuing Regularity is true, it may be that it would not help with Glymour’s problem. Continuing Regularity says that you shouldn’t become certain of your evidence, but if you do, it seems you should still be able to say whether E supports H. Why should your evaluation of the evidential connection between E and H depend crucially on whether you are certain or just nearly certain of E? Arguably, even if E did attain probability 1, it would still be true that it confirms H. To see this, notice that a being with perfect evidencegathering faculties, for whom assigning p(E) = 1 to evidence is reasonable, should still say that E confirms H, even after E is old evidence. We can eliminate this limitation of the alternate-measure strategy if we appeal to Popper-R´enyi functions, which is what Joyce [1999] recommends. Using PopperR´enyi functions, s can still be defined when p(E) = 1, and can even be arbitrarily close to its maximum value of 1. The fact that given evidence still confirms a theory is, according to Joyce, captured by the fact that s takes a positive value when calculated according to the Popper-R´enyi function that represents the agent’s beliefs. Joyce does not reject d as a measure of confirmation, however. Instead he adopts a pluralist stance on which both d and s measure quantities important to confirmation. Fitelson [2003] faults Joyce for introducing problematic ambiguity into the Bayesian analysis of confirmation. Fitelson points out that, when s is positive despite p(E) = 1, it is still the case that d takes a 0 value. So the move to Popper-R´enyi probabilities allows these two measures to yield different qualitative judgments when p(E) = 1: s can say that the evidence supports the hypothesis even though d says that the evidence is neutral. Fitelson’s concern is that both d and s are supposed to be measures of confirmation, and yet they have different qualitative properties, which would seem to make confirmation ambiguous, not just quantitatively, but qualitatively too. The move to Popper-R´enyi functions also leaves us with the same old problem of old evidence for the confirmational quantity d(H, E), which is still going to be 0. To avoid the ambiguity concern we might adopt Joyce’s Popper-R´enyi approach without his pluralism, embracing just s as our measure of confirmation. But Fitelson also finds fault with the measure s on its own terms, since it violates what he takes to be desiderata on any adequate measure of confirmation [Eells and Fitelson, 2000; Fitelson, 2006]. For example, it is possible to have p(H|E1 ) > p(H|E2 ) but s(H, E1 ) < s(H, E1 ), which Fitelson finds unacceptable. And, as Christensen showed with his deer-in-the-woods example, even s seems to face a version of the old evidence problem.
Varieties of Bayesianism
6.4
539
Objective, Ternary Solutions
Above I said that the historical and counterfactual solutions don’t work for a degree of belief interpretation of probability that lives at the subjective end of the subjective-objective continuum, because the historical or counterfactual degrees of belief can easily be such that p(H|E) > p(H). But a more objectivist view may be able to more successfully exploit the idea behind the historical and counterfactual responses. The driving idea is that, if we delete E from our stock of knowledge, then we do have p(H|E) > p(H). We’ve seen that the probabilities here can’t be historical or counterfactual, but they could be normative. That is, we might be able to say that E confirms H, even after we know E, because if we didn’t know E then the right probabilistic evaluation to make would be p(H|E) > p(H). And we might be able to say something similar if the probabilities are logical or primitive. So it looks like the problem of old evidence really only affects subjective, degree of belief brands of Bayesianism. It’s not that simple, of course. We can’t just use the counterfactual supposition “if we didn’t know E,” because who knows what other things we would not know if we didn’t know E, but which are still relevant to our judgment that E confirms H. So we need a more stable way of “deleting” E from our evidential corpus. One route to go on this is to appeal to the belief revision theory pioneered by Alchourr´ on, G¨ ardenfors, and Makinson [1985]. This theory is designed to give answers to questions like “how should I revise my corpus of knowledge if I must add/remove proposition E?”. Belief revision lays down a set of axioms that the new corpus of beliefs must satisfy after E is deleted but, unfortunately, these axioms tend to leave the end result extremely under-determined. The under-determination can be controlled by regarding some beliefs as more “entrenched” than others, but introducing the notion of entrenchment takes us well out of Bayesian territory, and raises the worry that we might be circling back on a probabilistic notion. A more popular approach has been to make the corpus of assumptions explicit in the concept of confirmation, introducing a third variable to represent the background assumptions relative to which E’s bearing on H is being evaluated. On this view, the proper question to ask is whether E confirms H relative to background assumptions B. To answer this question, we start with the objectively correct probabilities before any evidence is gathered, p(·), and then consult p(·|B). This yields The Objective, Ternary Account of Confirmation E confirms H relative to B if and only if p(H|EB) > p(H|B), and the extent of the confirmation is cx (H, E) computed using the probability function p(·|B). This account depends crucially on being able to make sense of p(·), the correct probability function sans any evidence, which a strongly objectivist view can do.30 30 Though the “correct” prior probability function may yet be interpretable in a subjectivistfriendly way, as the probabilities the agent thinks someone with no evidence ought to have, for example.
540
Jonathan Weisberg
We could have p(·) be somewhat under-determined, but it must be fairly tightly constrained to agree with the judgments of confirmation that we actually make. Maher [1996] endorses this kind of objectivism, and gives a version of the objective, ternary account.31 The objective, ternary account might be accused of cheating by introducing the variable B, since we often simply ask whether E confirms H, making no mention of relativity to a body of assumptions. Maher addresses this issue, suggesting that the relative body of assumptions is implicit in the context of discourse when we talk this way. An account of how the context fixes B would help secure this view’s respectability, and Maher does give a sketch [Maher, 1996, pp. 166-168]. Arguably, insisting on a complete account would be overly demanding, since understanding the mechanics of context is a major project in linguistics and philosophy of language. A better approach might be to just see whether ‘confirms’ has the properties that typify other, paradigmatic context-sensitive terms. This would vindicate the view that B is contextually specified, and then we could leave the details of how for the linguists and philosophers of language. Alternatively, we might treat the relativity to B by analogy with the relativity of simultaneity. Just as relativity theory teaches us that simultaneity is frame-relative, so too confirmation theory teaches us that confirmation is B-relative. We used to talk about confirmation as if it were a two-place relation but our theorizing has shown us the error of our ways. It was just that, in many cases, this relativity was insignificant and hence easy to overlook.
6.5 Ravens and the Tacking Problem Outside of debates about which analysis of confirmation is correct and how Glymour’s problem ought to be solved, Bayesian discussions of confirmation tend to focus on solving classic puzzles of confirmation, especially Hempel’s Raven Paradox [Hempel, 1937; Hempel, 1945] and the tacking problem (a.k.a. the problem of irrelevant conjunction). For summaries of this literature and the current state of affairs, see [Vranas, 2004; Fitelson, 2006] on the Raven Paradox, and [Fitelson, 2002; Fitelson and Hawthorne, 2004] on the tacking problem. We will not cover these topics here, as they do not expose deep divides in the Bayesian camp. 7 THEORIES OF BELIEF (A.K.A. ACCEPTANCE) A good deal of Bayesian theorizing is concerned with the degree to which we ought to believe something. But what about the question whether you should believe something tout court? Over and above questions about probabilities and levels of confidence, there seems to be an additional, qualitative question about 31 Eells and Fitelson adopt a formally similar account of confirmation (they call it “evidence for”, to distinguish it from a different notion of confirmation they discuss) in their aforementioned criticism of Christensen [Eells and Fitelson, 2000], though they seem to want to interpret p(·) historically, as the agent’s past degrees of belief.
Varieties of Bayesianism
541
what one ought to believe. Should you believe that there is a God, or that your ticket won’t win the lottery? Probabilistic theorizing may answer the quantitative question of how confident you should be, but it leaves a major epistemological question unanswered. When should/shouldn’t you believe a given proposition, qualitatively speaking? Or, as some prefer to phrase it, which propositions should you accept?32 The obvious thing to conjecture is that you should believe those propositions that have attained a certain minimum threshold of probability, say .99. But Kyburg’s [1961] lottery paradox shows that this conjecture leads to inconsistent belief states. Suppose that there is a fair lottery with 100 tickets, one of which will be the winner. Each ticket has a .99 probability of losing, and so the threshold conjecture says that you should believe of each ticket that it will lose. But the resulting set of beliefs is inconsistent, since you believe of each ticket that it will lose, and you also believe that one will win. Things get even worse if we endorse the principle that, if you believe A and you believe B, then you should believe AB.33 Then you will be led to believe the explicit contradiction that all the tickets will lose and yet one will win. The lottery paradox (and also the related preface paradox [Makinson, 1965]) puts a point on the problem of elaborating the connection between probability and belief, and this might push us in either of two directions. One would be to eliminate belief talk in favor of degree-of-belief talk. Jeffrey [1968; 1970a], for example, seems to have felt that the folk notion of belief should be replaced by the more refined notion of degree of belief, since talk of belief is just a sloppy approximation of degree of belief. If we go this route, then we avoid the lottery paradox and get to skip out on the job of elaborating the probability-belief connection.34 The other direction we might go is to look for a more sophisticated account of that connection. For example, we might say that high probability warrants belief, except when certain logical or probabilistic facts obtain. Pollock [1995], Ryan [1991; 1996], Nelkin [2000], and Douven [2002] represent some recent proposals in this vein. One such proposal is that high probability is sufficient for belief, except when the belief in question is a member of a set of similarly probable propositions 32 The
terms ‘belief’ and ‘acceptance’ get used in a confusing variety of ways, to mark a variety of distinctions. I will adopt the following convention here. I will use ‘belief’ and ‘acceptance’ interchangeably to refer to the same, qualitative propositional attitude, which is to be distinguished from the gradable attitude denoted by ‘degree of belief’ and ‘credence’. I will sometimes use ‘qualitative belief’ to stress the contrast with degree of belief. The distinction between ‘belief’ and ‘acceptance’ often introduced in discussions of scientific realism (e.g., [van Fraassen, 1980]) will be ignored. 33 Foley [2009] responds to the lottery paradox by rejecting this principle. 34 Christensen [2004] endorses a less extreme option in this neighborhood. He holds that qualitative belief talk is a way of sorting more fine-grained epistemic attitudes, similar to our sorting of dogs into the the categories ‘large’, ‘medium’, and ‘small’. Just as sorting dogs in this way has great utility but does not carve nature at the joints, so too sorting our finer-grained attitudes into beliefs and non-beliefs is useful but does not expose any “real” kind. According to Christensen, whatever our scheme for sorting finer-grained attitudes into beliefs and non-beliefs, the states so sorted do not need to obey a norm of deductive cogency. Deductive logic acts as an epistemic norm only insofar as it governs degrees of belief, via the probability axioms.
542
Jonathan Weisberg
whose conjunction is improbable.35 Douven and Williamson [2006] argue that such proposals fail because every probable proposition is a member of such a set. They go on to argue that any proposal that restricts itself to probabilistic and logical criteria will fail on similar grounds. To avoid this problem we could look for other variables that, together with probability, tell us what to believe. This approach was pioneered by Levi [1967a], who argues that probability determines what we should believe when conjoined with a notion of epistemic value. Roughly, Levi’s view is that you should believe something if doing so maximizes expected epistemic utility, also known as cognitive utility. Thus belief is a sort of decision problem, where we use a special notion of utility, one that captures cognitive values like true belief. Levi originally took epistemic utility to weigh two competing epistemic concerns, amalgamating them into a single scale of value. On the one hand, our inquiry is aimed at eliminating agnosticism in favor of giving informative answers to questions posed. On the other hand, we want those answers to be true as often as possible. Epistemic utility takes account of both aims: informativeness and truth. We should then believe exactly those propositions that maximize expected epistemic utility. In later work, Levi [1980] incorporates additional epistemic values into the scale of epistemic utility, such as simplicity and explanatory power. There is a troublesome tension for views that, like Levi’s, endorse both degrees of belief and belief simpliciter. On such views, belief threatens to be epiphenomenal, in the sense that it is determined by degree of belief but does nothing to determine degree of belief. Just like with epiphenomenalism in the philosophy of mind, this has the further consequence that there is no room for the epiphenomenon to determine action, since that job is taken. On the standard Bayesian view, rational action is determined by your degrees of belief via expected utility maximization. So qualitative belief is doubly idle; it neither influences your levels of confidence, nor what decisions you make. On this epiphenomenalist picture, it becomes hard to see the point of keeping belief in our theory, tempting us to follow Jeffrey’s lead and just forget about qualitative belief altogether. To avoid the epiphenomenalist conundrum, we might give qualitative belief a more substantial role, as Levi does in later work. Levi [1980] takes a view of belief where the believed propositions are treated more like evidence, with all accepted propositions receiving probability 1. Going in this direction is tricky though. According to expected utility maximization, attributing probability 1 to a proposition obliges you to bet on it at any odds, yet we believe many things we would not to stake our lives on. A related worry is that everything we believe becomes equally certain on this view, and yet we are able to distinguish grades of certainty in the things we believe. In short, if belief is not epiphenomenal because accepted beliefs acquire probability 1, then belief threatens to become too dogmatic. To avoid these worries we might take a more tentative view of belief. Levi’s own view is that beliefs, while “infallible” in the sense of having credence 1, are not 35 This
is roughly Ryan’s [1996] proposal.
Varieties of Bayesianism
543
“incorrigible”, since they can, and often should, be retracted. In particular, Levi thinks there are two cases where beliefs should be retracted: (i) when an inconsistency is discovered, or (ii) when we want to contemplate accepting a proposition that has been rejected, but might improve our explanatory picture if added. Such retractions are not enough to solve our problems though. My belief that Albany is the capital of New York does not contradict any of my other beliefs, nor is it inconsistent with any hypothesis that promises to improve my explanatory picture, so I have no reason of kind (i) or (ii) to retract it. Nevertheless, I would not bet on it at any odds, and I can appreciate that it is less certain than my belief that Albany is in New York. Because it is difficult to give belief a substantial cognitive role without inappropriately interfering with the role of degrees of belief, some Bayesians prefer to stick to the epiphenomenalist conception of belief, arguing that it is an important epistemic attitude even if it is idle in the ways mentioned. Maher [1993] takes this approach, arguing that the notion of acceptance is crucial to understanding the history of science. Without it, he argues, we cannot explain classic episodes of scientific confirmation, the Kuhnian observation that paradigms are only rejected once an alternative is available, nor the truism that it is never unreasonable to gather cost-free evidence. Would it be an adequate defense of the epiphenomenal view if we showed that belief is an interesting epiphenomenon? As Maher acknowledges, “it is standardly assumed that you believe H just in case you are willing to act as if H were true,” [Maher, 1993, p. 152] and an epiphenomenal view is hard to reconcile with this truism, no matter how good a job we do at showing that belief is crucial to our understanding of science. For how could belief be reliably tied to action if it is epiphenomenal? Maher’s reaction is to conclude that the folk concept of belief is simply flawed. It presupposes that two distinct states are really one; or at least that they are necessarily correlated. On the one hand there is the willingness to act as if H were true, and on the other there is the state expressed by sincere assertions that H. Maher’s conception of acceptance is directed at the latter state, whereas the truism about willingness to act is directed at the former. The right approach to take, according to Maher, is to acknowledge the difference between these two states by theorizing about them separately, instead of looking for a notion of belief that covers both. Alternatively, one might try to reconcile full belief with degrees of belief by telling a story on which both states play a significant role in determining assertions, inferences, and actions. Frankish [2009], for example, suggests that full beliefs are realized in certain degrees of belief and utilities. When you believe that H, this is because you are highly confident that you have adopted a policy of using H as a premise in certain deliberative contexts (both epistemic and pragmatic), and you attach a high utility to adhering to that policy. On Frankish’s view, those degrees of belief and utilities are what make it the case that you believe H. Thus, when you assert H, draw conclusions from H, and act based on H, there are two, compatible explanations: that you believe that H, and that you have certain
544
Jonathan Weisberg
degrees of belief and utilities. Because you believe H in virtue of having those degrees of belief and utilities, both explanations are correct. One might worry that this story does not reconcile your belief that H with the right degrees of belief. On Frankish’s view, when you use H as a premise in a piece of practical reasoning, the degrees of belief and utilities that explain why you did so are your degrees of belief about what policies you’ve adopted and the utilities you attach to sticking by those policies. But, on the standard Bayesian story, the explanation proceeds in terms of your degrees of belief about the possible outcomes of the act in question. What if you believe that you have adopted a policy that mandates assuming H in this context, but your degrees of belief about the consequences of acting on H make such action sub-optimal? According to the Bayesian, you should not so act, but according to Frankish, it seems that you would. But evaluating the seriousness of this worry would require us to go into more detail on the nature of premising policies and their adoption. See [Frankish, 2009, pp. 83–90] for more detail on premising policies, and for responses to some worries in this neighborhood. Another approach is to allow the distinct existence of beliefs and degrees of belief, while trying to give both states a significant cognitive role. Consider the way we normally operate, unreflectively, when going about our business. In most contexts it feels as if we don’t distinguish between what is more or less probable; we simply work off of a body of assumptions that we take for granted. When I set my alarm at night, wake up to its ring, step in and out of the shower, and head down the street to catch my bus, I am not thinking reflectively and I seem to treat the assumptions that drive my planning and behavior as all on a par. None are more or less likely, I simply accept that my alarm will ring at the right time, that it is time to get up when it rings, that my shower will turn on when I twist the knob, that my bus will be there in the morning just like it always is, and that it will take me to campus on time like it always does. If you asked me which of the assumptions that go into my daily routine are more or less likely, I could surely give some differentiating answers. My bus is more likely to be late than my shower is to not turn on when I twist the knob. But these differences are ones that I have decided not to worry about, because the assumptions in play are all reliable enough that it would not be worth thinking about my morning routine at a more fine-grained level. If I were more careful about how much to rely on each assumption, I might be able to decrease the chances that, one day, I will be late to class. But as it is, I am late rarely enough that it is worth my while to simplify by just taking them all for granted, and to act accordingly. These considerations suggest two levels of cognitive operation, one at which cognition and planning happen in a qualitative mode, and one where they happen in a more fine-grained, quantitative mode. The obvious thing to conjecture is that the purely qualitative mode is useful because it is simpler and more efficient, making daily tasks and life in general manageable. But the qualitative mode is rough, and may need to be reset, rearranged, or overridden altogether when we encounter an eventuality that we assumed would not happen, when we face a new
Varieties of Bayesianism
545
situation, when greater accuracy is desired, etc. If my shower doesn’t turn on I’m going to have to think more carefully about how to rearrange my morning routine, and if I move to a new location I’m going to have to create an entirely new morning routine, based on a whole new set of qualitative assumptions and intermediate goals. On such a model, quantitative and qualitative belief might be able to coexist, and even be complementary. But developing this thought into a substantive view poses a serious challenge, since we would want a precise story about when each mode gets engaged, whether/how the two modes interact, share information, and so on. 8
SUMMARY
We distinguished the varieties of Bayesianism by considering six questions Bayesians disagree about. First we asked what the proper subject matter of Bayesianism is, and we considered three (not necessarily exclusive) answers: degrees of logical truth, degrees of belief, and epistemic probabilities taken as primitive. Second, we asked what rules over and above the probability axioms are correct, exploring eight proposals: five synchronic rules, and three diachronic rules of increasing generality. Third, we considered what justificatory arguments were appropriate for these various rules, noting the many rebuttals and criticisms that have been offered over the years. Our discussion of those three questions laid out a variety of Bayesian views on core issues. The remaining three questions exposed a variety of views on applications of Bayesianism to related subject matter. Our fourth question was how Bayesianism applies to decision making. Presupposing a numerical notion of value (“utility”), we considered three versions of the expected utility rule, each resulting from a different view about the salient probabilistic relationship between an act and its possible outcomes. Fifth, we considered Bayesian attempts to analyze discourse about confirmation. In our attempts to deal with Glymour’s challenge for the naive account, we considered four ways of measuring probabilistic support, the merits of Popper-R`enyi probabilities, and the possible advantage of treating confirmation as a three-place relation (as opposed to two-place). Sixth and finally, we considered how qualitative belief fits with the Bayesian framework. We saw that the lottery paradox motivates a dispute between those who prefer to eliminate belief in favor of degrees of belief and those who see an important role for belief in cognition. We also saw how degrees of belief threaten to take over the role of qualitative belief in cognition, making qualitative belief epiphenomenal. In the course of our discussion we encountered many connections between possible answers to these six questions. For example, some proposed rules were not subject to the same sorts of justificatory arguments as others; there are Dutch book arguments for the probability axioms but not for the Principle of Indifference. Thus those who only feel comfortable with principles that can be defended by a Dutch book argument may prefer a more subjective brand of Bayesianism.
546
Jonathan Weisberg
Another connection we encountered was between Popper-R`enyi probabilities and confirmation; Popper-R`enyi probabilities allowed us to measure confirmation in ways we could not on the standard axiomatization. Thus considerations in confirmation theory may motivate a preferred axiomatization of probability. Still, the list of possible views to be generated by mixing and matching answers to our six questions is too large to be considered explicitly. And our list of contentious questions is incomplete anyway. Hence the famous quip that “there must be at least as many Bayesian positions as there are Bayesians.” [Edwards et al., 1963] ACKNOWLEDGEMENTS Thanks to Frank Arntzenius, Kenny Easwaran, Branden Fitelson, James Hawthorne, Franz Huber, James Joyce, Phil Kremer, and Chris Meacham for helpful discussion and feedback. BIBLIOGRAPHY [Acz´ el, 1966] J. Acz´ el. Lectures on Functional Equations and Their Applications. Academic Press, 1966. [Adams, 1964] Ernest W. Adams. On rational betting systems. Archiv f¨ ur Mathematische Logik und Grundlagenforschung, 6:7–29, 1964. [Albert, 2001] David Albert. Time and Chance. Harvard University Press, 2001. [Alchourr´ on et al., 1985] Carlos E. Alchourr´ on, Peter G¨ ardenfors, and David Makinson. On the logic of theory change: Partial meet contraction and revision functions. The Journal of Symbolic Logic, 50(2):510–530, 1985. [Allais, 1979] Maurice Allais. The so-called allais paradox and rational decisions under uncertainty. In Maurice Allais and Ole Hagen, editors, Expected Utility Hypotheses and the Allais Paradox. D. Reidel, 1979. [Arntzenius and Hall, 2003] Frank Arntzenius and Ned Hall. On what we know about chance. British Journal for the Philosophy of Sciencei, 54:171–179, 2003. [Arntzenius et al., 2004] Frank Arntzenius, Adam Elga, and John Hawthorne. Bayesianism, infinite decisions, and binding. Mind, 113:251–283, 2004. [Bertrand, [1888] 2007] Joseph L. F. Bertrand. Calcul Des Probabilit´ es. Oxford University Press, [1888] 2007. [Bradley, 2005] Richard Bradley. Probability kinematics and bayesian conditioning. Philosophy of Science, 72, 2005. [Briggs, forthcoming] Rachael Amy Briggs. Distorted reflection. The Philosophical Review, forthcoming. [Carnap, 1950] Rudolph Carnap. Logical Foundations of Probability. Chicago: University of Chicago Press, 1950. [Carnap, 1952] Rudolph Carnap. The Continuum of Inductive Methods. Chicago: University of Chicago Press, 1952. [Christensen, 1991] David Christensen. Clever bookies and coherent beliefs. The Philosophical Review, 100(2):229–247, 1991. [Christensen, 1992] David Christensen. Confirmational holism and bayesian epistemology. Philosophy of Science, 59, 1992. [Christensen, 1999] David Christensen. Measuring confirmation. Journal of Philosophy, 96:437– 461, 1999. [Christensen, 2001] David Christensen. Preference-based arguments for probabilism. Philosophy of Science, 68, 2001.
Varieties of Bayesianism
547
[Christensen, 2004] David Christensen. Putting Logic in its Place. Oxford University Press, 2004. [Christensen, 2007] David Christensen. Epistemology of disagreement: The good news. The Philosophical Review, 116(2):187–217, 2007. [Cox, 1946] Richard T. Cox. Probability, frequency, and reasonable expectation. American Journal of Physics, 14:1–13, 1946. [De Finetti, 1937] Bruno De Finetti. La pr´evision: Ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincar´ e, 17, 1937. [De Finetti, 1970] Bruno De Finetti. Theory of Probability. New York: John Wiley, 1970. [De Finetti, 1972] Bruno De Finetti. Probability, Induction, and Statistics. New York: John Wiley, 1972. [Diaconis and Zabell, 1982] Persi Diaconis and Sandy L. Zabell. Updating subjective probability. Journal of the American Statistical Association, 77(380):822–830, 1982. [Domotor, 1980] Zoltan Domotor. Probability kinematics and representation of belief change. Philosophy of Science, 47, 1980. [Doris, 2002] John M. Doris. Lack of Character: Personality and Moral Behavior. Cambridge University Press, 2002. [Douven and Williamson, 2006] Igor Douven and Timothy Williamson. Generalizing the lottery paradox. British Journal for the Philosophy of Science, 57(4):755–779, 2006. [Douven, 2002] Igor Douven. A new solution to the paradoxes of rational acceptability. British Journal for the Philosophy of Science, 53(3):391–410, 2002. [Earman, 1992] John Earman. Bayes or Bust: A Critical Examination of Bayesian Confirmation Theory. The MIT Press, 1992. [Edwards et al., 1963] Ward Edwards, Harold Lindman, and Leonard J. Savage. Bayesian statistical inference for psychological research. Psychological Review, 7(3):193–242, 1963. [Eells and Fitelson, 2000] Ellery Eells and Branden Fitelson. Comments and criticism: Measuring confirmation and evidence. Journal of Philosophy, 97(12):663–672, 2000. [Egan, 2007] Andy Egan. Some counterexamples to causal decision theory. The Philosophical Review, 116(1):93–114, 2007. [Elga, 2000] Adam Elga. Self-locating beliefs and the sleeping beauty problem. Analysis, 60(2):143–147, 2000. [Elga, 2007] Adam Elga. Reflection and disagreement. Noˆ us, 41(3):478–502, 2007. [Eriksson and H´ ajek, 2007] Lina Eriksson and Alan H´ ajek. What are degrees of belief? Studia Logica, 86:185–215, 2007. [Etchemendy, 1990] John Etchemendy. The Concept of Logical Consequence. Harvard University Press, 1990. [Field, 1978] Hartry Field. A note on jeffrey conditionalization. Philosophy of Science, 45, 1978. [Fine, 1973] Terence L. Fine. Theories of Probability. Academic Press, 1973. [Fitelson and Hawthorne, 2004] Branden Fitelson and James Hawthorne. Discussion: Resolving irrelevant conjunction with probabilistic independence. Philosophy of Science, 71:505– 514, 2004. [Fitelson, 1999] Branden Fitelson. The plurality of bayesian measures of confirmation and the problem of measure sensitivity. Philosophy of Science, 66 (Proceedings):S362–S378, 1999. [Fitelson, 2001] Branden Fitelson. Studies in Bayesian Confirmation Theory. PhD thesis, University of Wisconsin, Madison, 2001. [Fitelson, 2002] Branden Fitelson. Putting the irrelevance back into the problem of irrelevant conjunction. Philosophy of Science, 69:611–622, 2002. [Fitelson, 2003] Branden Fitelson. Review of the foundations of causal decision theory. Mind, 112:545–551, 2003. [Fitelson, 2006] Branden Fitelson. The paradox of confirmation. Philosophy Compass, 1:95, 2006. [Foley, 2009] Richard Foley. Beliefs, degrees of belief, and the lockean thesis. In Franz Huber and Christoph Schmidt-Petri, editors, Degrees of Belief, volume 342 of Synthese Library, pages 37–47. Springer, 2009. [Frankish, 2009] Keith Frankish. Partial belief and flat-out belief. In Franz Huber and Christoph Schmidt-Petri, editors, Degrees of Belief, volume 342 of Synthese Library. Springer, 2009. [Gaifman and Snir, 1982] Haim Gaifman and Marc Snir. Probabilities over rich languages, testing, and randomness. Journal of Symbolic Logic, 47:495–548, 1982.
548
Jonathan Weisberg
[Garber, 1980] Daniel Garber. Field and jeffrey conditionalization. Philosophy of Science, 47, 1980. [G¨ ardenfors, 1988] Peter G¨ ardenfors. Knowledge in Flux: Modelling the Dynamics of Epistemic States. The MIT Press, 1988. [Gibbard and Harper, 1978] Allan Gibbard and William Harper. Counterfactuals and two kinds of expected utility. In A. Hooker, J. J. Leach, and E. F. McClennen, editors, Foundations and Applications of Decision Theory. D. Reidel, 1978. [Glymour, 1980] Clark Glymour. Theory and Evidence. Princeton University Press, 1980. [Goldstick, 2000] Daniel Goldstick. Three epistemic senses of probability. Philosophical Studies, 101:59–76, 2000. [Goodman, 1954] Nelson Goodman. Fact, Fiction, and Forecast. Cambridge: Harvard University Press, 1954. [Greaves and Wallace, 2006] Hilary Greaves and David Wallace. Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115:607–632, 2006. [Grove and Halpern, 1997] Adam Grove and Joseph Halpern. Probability update: Conditioning vs. cross-entropy. In Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence, 1997. [Hacking, 1971] Ian Hacking. Equipossibility theories of probability. British Journal for the Philosophy of Science, 22(4):339–355, 1971. [H´ ajek, 2003] Alan H´ ajek. What conditional probability could not be. Synthese, 137:273–323, 2003. [H´ ajek, 2008] Alan H´ ajek. Dutch book arguments. In Paul Anand, Prasanta Pattanaik, and Clemens Puppe, editors, The Oxford Handbook of Rational and Social Choice. Oxford University Press, 2008. [Hall, 1994] Ned Hall. Correcting the guide to objective chance. Mind, 103:505–517, 1994. [Halmos, 1974] Paul R. Halmos. Measure Theory (Graduate Texts in Mathematics). SpringerVerlag, 1974. [Halpern, 1999] Joseph Y. Halpern. A counterexample to theorems of cox and fine. Journal of Artificial Intelligence, 10:67–85, 1999. [Halpern, 2001] Joseph Y. Halpern. Lexicographic probability, conditional probability, and nonstandard probability. In Proceedings of the Eighth Conference on Theoretical Aspects of Rationality and Knowledge, pages 17–30. Morgan Kaufmann Publishers Inc., 2001. [Halpern, 2003] Joseph Y. Halpern. Reasoning About Uncertainty. The MIT Press, 2003. [Hawthorne, 2008] James Hawthorne. Inductive logic, August 2008. [Hempel, 1937] Carl G. Hempel. Le probl`eme de la v´ erit´ e. Theoria, 3:206–246, 1937. [Hempel, 1945] Carl G. Hempel. Studies in the logic of confirmation I. Mind, 54:1–26, 1945. [Hoover, 1980] Douglas N. Hoover. A note on regularity. In Richard C. Jeffrey, editor, Studies in Inductive Logic and Probability. Berkeley: University of California Press, 1980. [Howson and Franklin, 1994] Colin Howson and Allan Franklin. Bayesian conditionalization and probability kinematics. British Journal for the Philosophy of Science, 45:451–466, 1994. [Howson and Urbach, 1993] Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. Open Court, 1993. [Huber, 2005] Franz Huber. What is the point of confirmation? Philosophy of Science, 72(5):1146–1159, 2005. [Jaynes, 1968] Edwin T. Jaynes. Prior probabilities. IEEE Transactions On Systems and Cybernetics, SSC-4(3):227–241, 1968. [Jaynes, 1973] Edwin T. Jaynes. The well-posed problem. Foundations of Physics, 3:477–493, 1973. [Jeffrey, 1965] Richard C. Jeffrey. The Logic of Decision. University of Chicago Press, 1965. [Jeffrey, 1968] Richard C. Jeffrey. Review of gambling with truth: An essay on induction and the aims of science. The Journal of Philosophy, 65(10):313–322, 1968. [Jeffrey, 1970a] Richard C. Jeffrey. Dracula meets wolfman: Acceptance vs. partial belief. In Marshall Swain, editor, Induction, Acceptance, and Rational Belief. D. Reidel, 1970. [Jeffrey, 1970b] Richard C. Jeffrey. Untitled review. The Journal of Symbolic Logic, 35(1):124– 127, 1970. [Jeffrey, 1983] Richard C. Jeffrey. Bayesianism with a human face. In John Earman, editor, Testing Scientific Theories. University of Minnesota Press, 1983. [Jeffreys, [1939] 2004] Harold Jeffreys. Theory of Probability. Oxford University Press, [1939] 2004.
Varieties of Bayesianism
549
[Joyce, 1998] James Joyce. A nonpragmatic vindication of probabilism. Philosophy of Science, 65(4):575–603, 1998. [Joyce, 1999] James Joyce. The Foundations of Causal Decision Theory. Cambridge University Press, 1999. [Joyce, 2009] James Joyce. Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In Franz Huber and Christoph Schmidt-Petri, editors, Degrees of Belief, volume 342 of Synthese Library, pages 263–297. Synthese, 2009. [Kahneman and Tversky, 1979] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47, 1979. [Kanovei and Shelah, 2004] Vladimir Kanovei and Saharon Shelah. A definable non-standard model of the reals. Journal of Symbolic Logic, 69(1):159–164, 2004. [Kelly, 1996] Kevin T. Kelly. The Logic of Reliable Inquiry. Oxford University Press, 1996. [Kelly, 2000] Kevin T. Kelly. The logic of success. British Journal for the Philosophy of Science, 51:639–666, 2000. [Kelly, 2005] Thomas Kelly. The epistemic significance of disagreement. In Oxford Studies in Epistemology, volume 1. Oxford University Press, 2005. [Keynes, 1921] John Maynard Keynes. A Treatise on Probability. New York: MacMillan, 1921. [Kyburg, 1961] Henry E. Kyburg. Probability and the Logic of Rational Belief. Wesleyan University Press, 1961. [Kyburg, 1987] Henry E. Kyburg. Bayesian and non-bayesian evidence and updating. Artificial Intelligence, 31:271–293, 1987. [Kyburg, 1992] Henry E. Kyburg. Getting fancy with probability. Synthese, 90:189–203, 1992. [Lange, 2000] Marc Lange. Is jeffrey conditionalization defective by virtue of being noncommutative? remarks on the sameness of sensory experience. Synthese, 123, 2000. [Levi, 1967a] Isaac Levi. Gambling With Truth. A. Knopf, 1967. [Levi, 1967b] Isaac Levi. Probability kinematics. British Journal for the Philosophy of Science, 18:197–209, 1967. [Levi, 1974] Isaac Levi. On indeterminate probabilities. Journal of Philosophy, 71:391–418, 1974. [Levi, 1980] Isaac Levi. The Enterprise of Knowledge: An Essay on Knowledge, Credal Probability, and Chance. The MIT Press, 1980. [Lewis, 1973a] David Lewis. Causation. Journal of Philosophy, 70:556–567, 1973. [Lewis, 1973b] David Lewis. Counterfactuals. Oxford: Blackwell, 1973. [Lewis, 1980] David Lewis. A subjectivist’s guide to objective chance. In Richard C. Jeffrey, editor, Studies in Inductive Logic and Probability, volume II. University of California Press, 1980. [Lewis, 1981] David Lewis. Causal decision theory. Australasian Journal of Philosophy, 59(1):5– 30, 1981. [Lewis, 1994] David Lewis. Humean supervenience debugged. Mind, 103:473–490, 1994. [Lichtenstein and Slovic, 1971] Sarah Lichtenstein and Paul Slovic. Reversals of preferences between bids and choices in gambling decisions. Experimental Psychology, 89, 1971. [Lichtenstein and Slovic, 1973] Sarah Lichtenstein and Paul Slovic. Response-induced reversals of preference in gambling decisions: An extended replication in las vegas. Journal of Experimental Psychology, 101, 1973. [Loewer, 2001] Barry Loewer. Determinism and chance. Studies in the History of Modern Physics, pages 609–620, 2001. [Maher, 1993] Patrick Maher. Betting on Theories. Cambridge University Press, 1993. [Maher, 1996] Patrick Maher. Subjective and objective confirmation. Philosophy of Science, 63, 1996. [Maher, 2002] Patrick Maher. Joyce’s argument for probabilism. Philosophy of Science, 69:73– 81, 2002. [Makinson, 1965] David C. Makinson. The paradox of the preface. Analysis, 25:205–207, 1965. [McGee, 1994] Vann McGee. Learning the impossible. In Ellery Eells and Brian Skyrms, editors, Probability and Conditionals: Belief Revision and Rational Decision, pages 179–199. Cambridge University Press, 1994. [Meacham and Weisberg, unpublished] Christopher J. G. Meacham and Jonathan Weisberg. Debunking representation theorem arguments. manuscript, unpublished. [Meacham, 2005] Christopher J. G. Meacham. Three proposals regarding a theory of chance. Philosophical Perspectives, 19:281–307, 2005.
550
Jonathan Weisberg
[Miller, 1966] David W. Miller. A paradox of information. British Journal for the Philosophy of Science, 17:59–61, 1966. [Nelkin, 2000] Dana K. Nelkin. The lottery paradox, knowledge, and rationality. The Philosophical Review, 109(3):373–408, 2000. [Parikh and Parnes, 1974] Rohit Parikh and Milton Parnes. Conditional probabilities and uniform sets. In A. Hurd and P. Loeb, editors, Victoria Symposium on Non-Standard Analysis. New York: Springer Verlag, 1974. [Plantinga, 2000] Alvin Plantinga. Pluralism: A defense of religious exclusivism. In Philip L. Quinn and Kevin Meeker, editors, The Philosophical Challenge of Religious Diversity. Oxford University Press, 2000. [Pollock, 1995] John L. Pollock. Cognitive Carpentry. Cambridge: MIT Press, 1995. [Popper, 1959] Karl Popper. The Logic of Scientific Discovery. London: Hutchinson & Co., 1959. [Ramsey, [1926] 1990] Frank Plumpton Ramsey. Truth and probability. In D. H. Mellor, editor, Philosophical Papers. Cambridge: Cambridge University Press, [1926] 1990. [Reichenbach, 1949] Hans Reichenbach. The Theory of Probability: An Inquiry into the Logical and Mathematical Foundations of the Calculus of Probability. English translation by E. H. Hutton and Maria Reichenbach. Berkely and Los Angeles: University of California Press, 1949. [R´ enyi, 1970] Alfred R´ enyi. Foundations of Probability. San Francisco: Holden-Day In., 1970. [Robinson, 1966] Abraham Robinson. Non-Standard Analysis: Studies in Logic and the Foundations of Mathematics. Amsterdam: North-Holland, 1966. [Rosenkrantz, 1981] Roger D. Rosenkrantz. Foundations and Applications of Inductive Probability. Ridgeview Press, 1981. [Ryan, 1991] Sharon Ryan. The preface paradox. Philosophical Studies, 64(3):293–307, 1991. [Ryan, 1996] Sharon Ryan. The epistemic virtues of consistency. Synthese, 109(22):121–141, 1996. [Savage, 1954] Leonard J. Savage. The Foundations of Statistics. Wiley Publications in Statistics, 1954. [Schick, 1986] Frederic Schick. Dutch books and money pumps. Journal of Philosophy, 83:112– 119, 1986. [Shafer, 1976] Glenn Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976. [Shimony, 1955] Abner Shimony. Coherence and the axioms of confirmation. Journal of Symbolic Logic, 20:1–28, 1955. [Shimony, 1988] Abner Shimony. An adamite derivation of the calculus of probability. In J. H. Fetzer, editor, Probability and Causality. D. Reidel, 1988. [Skyrms, 1980] Brian Skyrms. The Role of Causal Factors in Rational Decision, chapter 2. Yale University Press, 1980. [Skyrms, 1987] Brian Skyrms. Dynamic coherence and probability kinematics. Philosophy of Science, 54:1–20, 1987. [Skyrms, 1995] Brian Skyrms. Strict coherence, sigma coherence and the metaphysics of quantity. Philosophical Studies, 77:39–55, 1995. [Talbott, 1991] William J. Talbott. Two principles of bayesian epistemology. Philosophical Studies, 62:135–150, 1991. [Teller, 1973] Paul Teller. Conditionalisation and observation. Synthese, 26:218–258, 1973. [Thau, 1994] Michael Thau. Undermining and admissibility. Mind, 103:491–503, 1994. [van Fraassen, 1980] Bas van Fraassen. The Scientific Image. Oxford University Press, 1980. [van Fraassen, 1981] Bas van Fraassen. A problem for relative information minimizers. British Journal for the Philosophy of Science, 32, 1981. [van Fraassen, 1983] Bas van Fraassen. Calibration: A frequency justification for personal probability. In R. Cohen and L. Laudan, editors, Physics, Philosophy, and Psychoanalysis. D. Reidel, 1983. [van Fraassen, 1984] Bas van Fraassen. Belief and the will. The Journal of Philosophy, 81(5):235–256, 1984. [van Fraassen, 1989] Bas van Fraassen. Laws and Symmetry. Oxford University Press, 1989. [van Fraassen, 1990] Bas van Fraassen. Figures in a probability landscape. In J. Dunn and A. Gupta, editors, Truth or Consequences. Kluwer, 1990.
Varieties of Bayesianism
551
[van Fraassen, 1995] Bas van Fraassen. Belief and the problem of ulysses and the sirens. Philosophical Studies, 77:7–37, 1995. [van Fraassen, 1999] Bas van Fraassen. Conditionalization, a new argument for. Topoi, 18:93– 96, 1999. [von Mises, [1928] 1981] Richard E. von Mises. Probability, Statistics, and Truth. New York: Dover, [1928] 1981. [Vranas, 1998] Peter B. M. Vranas. Who’s afraid of undermining. Erkenntnis, 57(2):151–174, 1998. [Vranas, 2004] Peter B. M. Vranas. Hempel’s raven paradox: A lacuna in the standard bayesian account. British Journal for the Philosophy of Science, 55:545–560, 2004. [Wagner, 2002] Carl Wagner. Probability kinematics and commutativity. Philosophy of Science, 69, 2002. [Wakker and Tversky, 1993] Peter Wakker and Amos Tversky. An axiomatization of cumulative prospect theory. Journal of Risk and Uncertainty, 7(7):147–176, 1993. [Walley, 1991] Peter Walley. Statistical Reasoning With Imprecise Probabilities. Chapman & Hall, 1991. [Weatherson, 1999] Brian Weatherson. Begging the question and bayesianism. Studies in History and Philosophy of Science, 30:687–697, 1999. [Weisberg, 2007] Jonathan Weisberg. Conditionalization, reflection, and self-knowledge. Philosophical Studies, 135(2):179–197, 2007. [Weisberg, 2009] Jonathan Weisberg. Commutativity or holism? a dilemma for conditionalizers. British Journal for the Philosophy of Science, forthcoming, 2009. [White, 2009] Roger White. Evidential symmetry and mushy credence. In Oxford Studies in Epistemology. Oxford University Press, 2009. [Williams, 1980] P. M. Williams. Bayesian conditionalisation and the principle of minimum information. British Journal for the Philosophy of Science, 32(2):131–144, 1980. [Williamson, 1999] Jon Williamson. Countable additivity and subjective probability. British Journal for the Philosophy of Science, 50:401–416, 1999. [Williamson, 2000] Timothy Williamson. Knowledge and its Limits. Oxford University Press, 2000. [Williamson, 2007] Jon Williamson. Inductive influence. British Journal for the Philosophy of Science, 58(4):689–708, 2007. [Zynda, 2000] Lyle Zynda. Representation theorems and realism about degrees of belief. Philosophy of Science, 67, 2000.
INDUCTIVE LOGIC AND EMPIRICAL PSYCHOLOGY Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
INTRODUCTION An inductive logic is a system for reasoning that derives conclusions which are plausible or credible, but are nonetheless not certain. Thus, inductive logic goes beyond the more familiar systems of deductive logic, in which the truth of the premises requires the truth of the conclusions. Thus, from All people are mortal, we may deductively infer that Person A is mortal, Person B is mortal, and so on. But from Person A is mortal, Person B is mortal, and so on, we can inductively derive, with inevitable uncertainty, that All people are mortal. However many instances of the generalization we encounter, it is always possible that there is some counterexample of which we are not yet aware. But inductive inference extends far beyond this type of induction from enumeration. It can be argued, indeed, that many, and perhaps even almost all, inferences outside mathematics involves uncertain, inductive inference. In everyday life, people are routinely forced to work with scraps of information, whether derived from incomplete and noisy sensory input, linguistic information of uncertain provenance, or uncertain background theories or assumptions. Thus, the human mind seems to be more a matter of tentative conjecture, rather than water-tight argument. To get a sense of the ubiquity of inductive inference, notice that a successful deductive argument cannot be overturned by any additional information that might be added to the premises. Thus, if we know that All quadralaterals have angles summing to 360 degrees, and we know that a specific square is a quadrilateral, then we can infer with certainty that it has angles summing to 360 degrees. Any additional information that we might learn about the square cannot overturn this conclusion — if we subsequently learn that it is a large, red, metal square, we can still conclude that its angles have the same sum. Of course, on learning new information we may come to doubt the premises — for example, if I learn that the “square” has been etched onto a globe, I may come to doubt that it is really a square, in the conventional Euclidean sense, at all; and I may suspect that its angles sum to more than 360 degrees. But, although new information may cast doubt on the premises, it cannot lead us to doubt that the conclusion follows, if the premises are true. This property of deductive logic is known as monotonicity: i.e., adding premises can never overturn existing conclusions.
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
554
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
In reasoning about the everyday world, by contrast, nonmonotonicity is the norm: almost any conclusion can be overturned, if additional information is acquired. Thus, consider the everyday inference from Its raining and I am about to go outside to I will get wet. This inference is uncertain — indefinitely many additional premises (the rain is about to stop; I will take an umbrella; there is a covered walkway) can overturn the conclusion, even if the premises are correct. The nonmonotonicity of everyday inference is problematic for the application of logical methods to modelling thought. Nonmonotonic inferences are not logically valid and hence fall outside the scope of deductive logical methods. In psychology, it is clear that many cognitive processes are non-monotonic. In perception, revealing more information about an object can often change the way in which it is interpreted (e.g., a random dot pattern is seen in depth only when it begins to move [Wallach and O’Connell, 1953]; the greyness of a surface is radically altered when information about its three dimensional orientation in relation to the light source is revealed [Adelson, 1993]; and so on). Moreover, in the field of learning, non-mononoticity is clearly the norm: our grammars, causal models or hypotheses may readily be overturned as new sentences are heard, novel actions are performed, or fresh observations are made. Inductive logic may also be required to capture verbally stated inferences that are typically viewed as instances of deduction. For example, consider the argument from if you put 50p in the coke machine, you will get a coke and I’ve put 50p in the coke machine, to I’ll get a coke. This argument appears to be an instance of a canonical monotonic logical inference: modus ponens. Yet in the context of commonsense reasoning, this argument does not appear to be monotonic at all. There are innumerable possible additional factors that may block this inference (power failure, the machine is empty, the coin or the can become stuck, and so on). Thus, you can put the money in, and no can of coke may emerge. Attempting to maintain a logical analysis of this argument, these cases could be interpreted as indicating that, from a logical point of view, the conditional rule is simply false — precisely because it succumbs to counterexamples [Politzer and Braine, 1991]. This is, though, an excessively rigorous standpoint, from which almost all everyday conditionals will be discarded as false. But how could a plethora of false conditional statements provide a useful basis for thought and action? From a logical point of view, after all, we can only make inferences from true premises; a logical argument tells us nothing, if one or more of its premises is false. In any event, the scope of deductive logic is highly restricted; and it is clear that many psychological processes, from perception, to learning, to everyday inference, are inductive in character. Philosophical concerns to uncover a system for reasoning with uncertainty are typically initially concerned with normative questions, e.g., what conclusions can justifiably, if tentatively, be drawn, from given premises? and how can such patterns of uncertain inference be systematized? But, from the point of view of the descriptive problem of understanding how the mind operates, closely related questions arise. After all, dealing with uncertainty is, we might expect, an everyday
Inductive Logic and Empirical Psychology
555
challenge for cognitive systems, human or animal. But for the cognitive system to deal with uncertainty reliably presumably requires the application of some kind of method — i.e., conforming with, perhaps only approximately, some set of principles. Without some such foundation, the question of why the cognitive system copes with uncertainty (well-enough, most of the time) is left answered. Any particular instance of uncertain reasoning may, of course, be explained by postulating that the cognitive system follows some special strategy, rather than general inference principles. But the mind is able to deal with a hugely complex and continually changing informational environment, for which special-purpose strategies cannot credibly pre-exist. Thus, to explain the reliable (if partial) success of the inductive leaps observed in human cognition, we should consider the possibility that thought is based on some set of principles of good inductive reasoning — i.e., perhaps thought can be explained by reference to some form of inductive logic. It turns out, of course, that relatively mild and uncontroversial assumptions about how inductive support should work lead, apparently inexorably, to the probability calculus (e.g., [Fitelson, 2005]). While inductive logic might contain more principles than elementary probability — e.g., principles concerning how to deal with inferential relations between logically complex sentences), it is fairly uncontroversial than inductive logics should include the conventional laws of probability. Thus, in restricted contexts, at least, we may replace the term ‘inductive logic’ with the term ‘probability theory’ — and, with some exceptions (such as empirical research on explicit inductive inference outlined below), psychologists primarily talk about probability rather than inductive logic. From the point of view of empirical psychology, then, the proposal that the mind might, in some sense, embody an inductive logic is generally construed in a relatively restricted way. Thus, early, and now unpopular, theories of inductive logic, which pursued the hope that inductive logic might depend purely on the form of sentences, without reference to the meanings of their non-logical terms, or the state of the world (e.g., [Hempel, 1945; Carnap, 1950]) have been little considered. Moreover, theories in which degrees of inductive support are interpreted in terms of proportions of possible worlds (independent of whether these worlds can be conceived by an individual reasoner) are rarely considered (although some theories of probabilistic reasoning have proposed models which involve counting different types of “mental models,” which might be viewed as a psychological analogue to the notion of possible worlds, e.g., [Johnson-Laird et al., 1999]). By far the most psychologically natural perspective on inductive logic is to view inductive support as a matter of subjective probability — i.e., the degree of belief, by a particular individual, in a specific proposition. After all, the key psychological question is the dynamics of belief-revision: how does the addition of new information modifies ones prior states of belief. The subjectivist view of probability is, particularly in the psychological and artificial intelligence community, known as the Bayesian approach — simple because the simple probabilistic identity which is Bayes’ theorem (discussed below) arises so centrally in the process of belief revision. The extent to which cognition should be viewed as conforming with, or
556
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
departing from, the principles of probability, i.e., the extent to which a Bayesian view of the mind is productive or misleading, has been a central research theme in empirical research in psychology (e.g., [Edwards, 1954; Kahneman et al., 1982; Gigerenzer, 2002]). As we noted, only very mild restrictions on how “degrees of belief” should behave lead to the conclusion that such degrees of belief can be mapped to the [0,1] interval, and should obey the laws of probability. For example, the celebrated “Dutch book theorem” shows that, under fairly general conditions, any gambler whose subjective probabilities deviate from the laws of probability, however slightly, can be mercilessly exploited — i.e., the gambler will cheerful accept a combination of bets such that, whatever happens, she is certain to lose money. Moreover, there are many such arguments, starting with different normative assumptions, which converge on the assumption that “degrees of belief” should be governed by probability. Thus, if we want to explain how it is that people (and, indeed, animals) are able to cope so successfully with their highly uncertain world, the norms of probability provide the beginnings of an answer — to the extent that the mind reasons probabilistically, the normative justifications that imply that this is the “right” way to reason about uncertainty go some way to explaining how it is that the cognitive system deals with uncertainty with a reasonable degree of success. Alongside these a priori normative arguments stands a more practical reason to take probabilistic models of the mind seriously, which arises from artificial intelligence, and related fields such as computer vision and computational linguistics. Understanding any aspect of the biological world is, to some degree, a matter of reverse engineering — of inferring engineering principles from data. Reverse engineering is, though, of course strongly constrained, in practice, by the range of options offered by current “engineering” technologies. There has been something of a probabilistic revolution in the last two decades in proposals concerning engineering solutions to the types of problems solved by the cognitive system. Probabilistic approaches have been increasingly ubiquitous, and widely used, particularly in the light of technical developments that make complex probabilistic models both formally and computationally more manageable than previously. From knowledgebases, to perception, to language and motor control, there has been considerable application of sophisticated probabilistic methods (e.g., [Chater and Oaksford, 2008; Chater et al., 2006; Oaksford and Chater, 1998; Pearl, 1988; 2000]). So we have two reasons to take Bayesian models of the mind seriously — probability is arguably the “right” way to deal with uncertainty; and it proves practically useful in solving cognitively-relevant engineering problems. But how useful does the approach to cognition prove to be in practice? How far do alternative models provide a better account? In precisely what sense, if any, should the mind be viewed as probabilistic? And does the Bayesian perspective immediately collapse, in the light of the fact that people are known to make numerous, and systematic, errors in probabilistic reasoning problems. In this chapter, we sketch the Bayesian, subjectivist view of inductive probability in relation to psychological
Inductive Logic and Empirical Psychology
557
processes. We then survey the application of Bayesian inductive logic in four key areas: language, inductive inference, reasoning, decision making, and argument. Finally, we consider challenges for the attempt to connect inductive logic and empirical psychology. 1
THE BAYESIAN APPROACH TO COGNITION
The vision of probability as a model of thought is as old as the study of probability itself. Indeed, from the outset of the development of the mathematics of probability, the notion had a dual aspect: serving both as a normative calculus dictating how people should reason about chance events, such as shipping losses or rolls of a dice, but at the same time interpreted as a descriptive theory of how people reason about uncertainty. The very title of Bernouilli’s great work, The art of conjecture [Bernouilli, 1713], nicely embodies this ambiguity — suggesting that it is both a manual concerning how this art should be practiced; and an outline of how the art is actually conducted. This dual perspective was, indeed, not confined merely to probability, but also applied equally well to logic, the calculus of certain reasoning. Thus Boole’s [1958/1854] The Laws of Thought, which deals with both logical and probabilistic reasoning, also embodies the ambiguity implicit in its title — it aims to be both a description of how thought works; but also views the laws of thought as providing norms to which reason should conform. In retrospect, the identification, or perhaps conflation, of normative and descriptive programmes seems anomalous. Towards the end of the nineteenth century, mathematics began to break away from the morass of psychological intuition; and throughout the twentieth century, increasingly formal and abstract programmes for the foundations of mathematics developed, seeming ever more distant from psychological notions. Thus, in the context of probability, Kolmogorov provided an axiomatization of probability in terms of σ-algebras, which views probability theory as an abstract formal structure, with no particular linkage to psychological notions concerning degree of belief or plausibility. Indeed, the idea that mathematics should be rooted in psychological notions became increasingly unpopular, and the perspective of psychologism became philosophically disreputable. At a practical level, too, the mathematics and psychology of probability became ever more distant. The mathematics became increasingly formally sophisticated, with spectacular results; but most of this work explicitly disavowed the idea that probability was about beliefs at all. The most popular perspective on probability took the view that probabilities should be interpreted, instead, as limiting frequencies over repeatable events. Thus, to say that the probability of a coin falling heads is 12 is to say something like: in the limit, if this event is repeated indefinitely, the proportion of times that the coin comes up heads will tend towards 12 . This frequentist [von Mises, 1957] interpretation of probability aims to separate probability entirely from the beliefs of any particular person observing the coin — the probability is supposed to be a fact about the coin, not about degrees of belief of an observer of the coin.
558
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
The premise underlying the Bayesian approach to psychology is that this divorce was somewhat premature — and that, at minimum, a limited reconciliation should be attempted. In particular, the conjecture is that many aspect of thought can be understood as, at some level of approximation at least, embodying probabilistic calculations. We mentioned above that, normative considerations aside, one appeal of probabilistic models of cognition is that probability has swept into vogue in fields concerned with engineering solutions to information processing problems analogous to those solved by the brain. And this work has overwhelming taken the subjectivist, rather than the frequentist view of probability. One reason for this is that, in many practical applications, the frequentist interpretation of probability simply does not apply — probabilities can only be viewed as expressing degrees of belief (or, more neutrally, degrees of partial information — after all, we may not want to attribute full-blown belief to a simple computational model, or an elementary cognitive process). Thus, in speech recognition or computational vision, each sensory input is enormously complex and will never be encountered again. Hence, there is no meaningful limiting frequency concerning the probability that this image is a photograph of a dog, or a wolf. It definitively is one or the other (the frequencies are 0 or 1 for each category). Similarly, the frequentist interpretation is not appropriate for interpreting uncertainty concerning scientific hypotheses, because, of course, any scientific hypothesis holds, or it does not; and hence limiting frequencies across many trials make no sense. In cases where the goal is to quantify the uncertainty about a state of the world, the uncertainty resides in the computational system (the human or animal brain, the machine learner) attempting to infer the probability. But once we interpret probability as concerning subjective states of belief or information — i.e., once we adopt the subjective interpretation of probability — then it is natural to frame the computational challenge of recognizing a word, an animal, or an action, or a scientific hypothesis, purely as a matter of probabilistic calculation. Indeed, according to results such as the Dutch book theorem, mentioned above, once we start to assign degrees of uncertainty to states of any kind, it is mandatory that we use the laws of probability to manipulate these uncertainties, on pain of demonstrable irrationality (e.g., being willing to accept combinations of gambles leading to a certain loss). In perception, as well as in many aspects of learning and reasoning, the primary goal is working out the probability of various possible hypotheses about the state of the world, given a set of data. This is typically done indirectly, by viewing the various hypotheses about the world as implying probabilities concerning the possible sensory data — i.e., we view these various states of the world as implicitly making claims about the probability of different patterns of data. An elementary identity of probability allows us to relate the probabilities that we are interested in Pr(Hi |D), the probability that hypothesis Hi is true, given the observed data, D, in terms of the probabilities that are presumed to be implicit in the hypotheses themselves — the probabilities Pr(D|Hi ) of the data, given each Hi . The elementary identity follows immediately from the definition of conditional probability:
Inductive Logic and Empirical Psychology
559
Pr(Hi |D) Pr(D) = Pr(Hi , D) = Pr(D|Hi ) Pr(Hi ) so that we obtain: Pr(Hi |D) =
Pr(D|Hi ) Pr(Hi ) Pr(D)
which is Bayes’ theorem. The probability of the data is not, of course, known independently of the hypotheses that might generate that data — so in practice Pr(D) is typically expanded using the probabilistic identity: Pr(D|Hj ) Pr(Hj ) Pr(D) = j
Thus, taking a subjective approach to probability, where states of the world may be viewed as uncertain, from the point of view of an agent, implies that making inferences about the likely state of the world is a matter of probabilistic calculation; and such calculations typically invoke Bayes’ theorem, to invert the relationship between hypothesis and data. The prevalence of Bayes theorem in this type of calculation has led to this approach to statistics [Bernado and Smith, 1994], machine learning [Mackay, 2003], and scientific reasoning [Howson and Urbach, 1993] to be known as the Bayesian approach — but the point of controversy is not of course the probabilistic identity that is Bayes’ theorem; but rather the adoption of the subjective interpretation of probability. Indeed, in cognitive science, given that almost all applications of probability require a subjective interpretation of uncertainty, the probabilistic approach and the Bayesian approach are largely synonymous.
Levels of probabilistic explanation Probability is, we have suggested, potentially relevant to understanding the mind/ brain. But it can be applied in a range of different ways and at different levels of explanation, ranging from probabilistic analysis of the neural processes in perception and motor control, to normative description of how decision makers should act in economic contexts. But these seem to be explanations at very different levels — and it is worth pausing briefly to consider the range of different levels of analysis at which probabilistic ideas may be applied — and hence to clarify the claims that are (and are not) being reviewed in this chapter. We suggest that the variety of types of explanation can usefully be understood in terms of Marr’s [1982] celebrated distinction between three levels of computational explanation: the computational level, which specifies the nature of the cognitive problem being solved, the information involved in solving it, and the logic by which it can be solved (this is closely related to the level of rational analysis, see [Anderson, 1990; 1991a; Anderson and Milson, 1989; Anderson and Schooler, 1991; Oaksford and Chater, 1994; 1998a]); the algorithmic level, which specifies
560
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
the representations and processes by which solutions to the problem are computed; and the implementational level, which specifies how these representations and processes are realized in neural terms. The Bayesian approach has potential relevance at each of these levels. As we have noted, the very fact that much cognitive processing is naturally interpreted as uncertain inference immediately highlights the relevance of probabilistic methods at the computational level. This level of analysis is focused entirely on the nature of the problem being solved — there is no commitment concerning how the cognitive system actually attempts to solve (or approximately to solve) the problem. Thus, a probabilistic viewpoint on the problem of, say, perception or inference, is compatible with the belief that at the algorithmic level, the relevant cognitive processes operate via a set of heuristic tricks (e.g., [Gigerenzer and Todd, 1999; Ramachandran, 1994]), rather than explicit probabilistic computations. One drawback of the heuristics approach, though, at which we have hinted already, is that it is not easy to explain the remarkable generality and flexibility of human cognition. Such flexibility seems to suggest that cognitive problems involving uncertainty may, in some cases at least, be solved by the application of probabilistic methods. Thus, we may take models such as stochastic grammars for language or vision, or Bayesian networks, as candidate hypotheses about cognitive representation. Yet, when scaled-up to real-world problems, full Bayesian computations are intractable, an issue that is routinely faced in engineering applications. From this perspective, the fields of machine learning, artificial intelligence, statistics, informational theory and control theory can be viewed as rich sources of hypotheses concerning tractable, approximate algorithms that might underlie probabilistic cognition. Finally, turning to the implementational level, one may ask whether the brain itself should be viewed in probabilistic terms. Intriguingly, many of the sophisticated probabilistic models that have been developed with cognitive processes in mind map naturally onto highly distributed, autonomous, and parallel computational architectures, which seem to capture the qualitative features of neural architecture. Indeed, computational neuroscience [Dayan and Abbott, 2001] has attempted to understand the nervous system as implementing probabilistic calculations; and neurophysiological findings, ranging from spike trains in the blow-fly visual system [Rieke et al., 1997], to cells apparently involved in decision making in monkeys [Gold and Shadlen, 2000] have been interpreted as conveying probabilistic information. Nonetheless, large-scale probabilistic calculations over complex internal representations, and reasonably large sets of data, are typically computationally intractable. Thus, typically, the number of possible states of the world grows exponentially with the number of facts that are considered. Calculations over this exponentially large set of world-states is typically viable only to an approximation. Thus, the mind cannot credibly be viewed as a “Laplacian demon,” making complete and accurate probabilistic calculations [Gigerenzer and Goldstein, 1996; Oaksford and Chater, 1998b] — but rather must, at best, be viewed as approximating such calculations, perhaps using some very drastic simplifica-
Inductive Logic and Empirical Psychology
561
tions. How far it is possible to tell an integrated probabilistic story across levels of explanation, or whether the picture is more complex, remains to be determined by future research.
Why is probability so hard? The question of levels is important in addressing what may appear to be direct evidence against the application of inductive logic in psychology — research on how people reason explicitly about probability. Describing probabilities as degrees of belief, as in the subjectivist interpretation of probability, invites comparison with the folk psychological notion of belief, in which our everyday accounts of each other’s behaviour are formed (e.g., [Fodor, 1987]). This in turn suggests that people might reasonably be expected to introspect about the probabilities associated with their beliefs. In practice, people often appear poor at making such numerical judgments; and poor, too, at numerical probabilistic reasoning problems, where they appear to fall victim to a range of probabilistic fallacies (e.g., [Kahneman et al., 1982]). The fact that people can appear to be such poor probabilists may seem to conflict with the thesis that many aspects of cognition can fruitfully be modelled in probabilistic terms. Yet this conflict is only apparent. People struggle not just with probability, but with all branches of mathematics. Yet the fact that, e.g., Fourier analysis, is hard to understand does not imply that it, and its generalizations, are not fundamental to audition and vision. The ability to introspect about the operations of the cognitive system are the exception rather than the rule — hence, probabilistic models of cognition do not imply the cognitive naturalness of learning and applying probability theory. Indeed, probabilistic models may be most applicable to cognitive process that are particularly well-optimized, and which solve the probabilistic problem of interest especially effectively. Thus, vision or motor control may be especially tractable to a probabilistic approach; and our explicit attempts to reason about chance might often, ironically, be poorly modelled by probability theory. Nonetheless, some conscious judgments have proven amenable to probabilistic analyses, such as assessments of covariation or causal efficacy [Cheng, 1997; Griffiths and Tenenbaum, 2005; Waldmann, 2008], uncertain reasoning over causal models [Sloman and Lagnado, 2004], or predicting the prevalence of everyday events [Griffiths and Tenenbaum, 2006]. But unlike textbook probability problems, these are exactly the sorts of critical real-world judgments for which human cognition should be expected to be optimized.
The probabilistic turn in the cognitive and brain sciences We have suggested that probabilistic analysis may be especially appropriate for highly optimized aspects of cognition — i.e., the domains for which it is credible that the brain has some dedicated computational “module” or system of modules (e.g., [Fodor, 1983; Shallice, 1988]). Thus, the probabilistic approach has been
562
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
widely applied in the areas of perception, motor control, and language, where the performance of dedicated computational modules vastly exceeds the abilities of any artificial computational methods by an enormous margin. Before turning to the main topics of this chapter, the somewhat ill-defined area of “central” cognition, we briefly review the much larger and more extensively developed literatures that apply probabilistic methods to these “modular” domains. Consider, for example, the problem of inferring the structure of the world, from visual input. There are, notoriously, infinitely many states of the environment that can give rise to any perceptual input (e.g., [Freeman, 1994]) — this is just an example of the standard observation, in the philosophy of science, that theory is underdetermined by data [Laudan and Leplin, 1991]; or in statistics, that an infinite number of curves can fit any particular set of data points (e.g., [Mackay, 1992]). A natural objective of the perceptual system, faced with an infinite number of possible interpretations of a stimulus, is to aim to choose the interpretations which are most likely. From this perspective, perception is a problem of probabilistic inference almost by definition. The idea that the perceptual system seeks the most likely interpretation can be traced to Helmholtz [1910/1962]. More recently, it has been embodied in the Bayesian approach to visual perception that has become prominent in psychology and in neuroscience. This viewpoint has been backed by direct experimental evidence (e.g., [Gregory, 1970; Rock, 1983]) for the inferential character of perceptual interpretation; and also by the construction of detailed theories of particular aspects of perceptual processing, from a Bayesian perspective, including low-level image interpretation [Weiss, 1997], shape from shading [Freeman, 1994, Adelson and Pentland, 1996], shape from texture [Blake et al., 1996], image segmention, object recognition [Tu et al., Zhu, 2005], and interpolation of boundaries [Feldman, 2001; Feldman and Singh, 2005]. Moreover, the function of neural mechanisms involved in visual perception have also been given a probabilistic interpretation — from lateral inhibition in the retina (e.g., [Barlow, 1959]), to the activity of single cells in the blow-fly [Snippe et al., 2000]. The scope of the probabilistic view of perception may, moreover, be somewhat broader than at might first be thought. Although apparently very different from the likelihood view, the simplicity principle in perception, which proposes that the perceptual system chooses the interpretation of the input which provides the simplest encoding of that input (e.g., [Attneave, 1954; Hochberg and McAlister, 1953; Leeuwenberg, 1969; 1971; Leeuwenberg and Boselie, 1988; Mach, 1959/1914; Restle, 1970; Van der Helm and Leewenberg, 1996], though see [Olivers et al., 2004]) turns out to be mathematically equivalent to the likelihood principle [Chater, 1996]. Specifically, under mild mathematical restrictions, for any probabilistic analysis of a perceptual inference (using particular prior probabilistic assumptions) there is a corresponding simplicity-based analysis (using a particular coding language, in which the code-length of an encoding of perceptual data in terms of an interpretation provides the measure of complexity), such that the most likely and the simplest interpretations co-incide. Thus, theories of perception based on
Inductive Logic and Empirical Psychology
563
simplicity and coding, and theories of neural function based on decorrelation and information compression (e.g., [Barlow, 1959]) can all be viewed as part of the Bayesian probabilistic approach to perception. The study of perceptuo-motor control provides a second important area of Bayesian analysis. Sensory feedback, typically integrated across different modalities (e.g., visual and haptic information about the positions of, e.g., the hand), contributes to estimating the current state of the motor system. Knowing this current state, and the location, and layout, various aspects of the external environment, is essential for the brain to be able to plan successful motor movements. The precise way in which movements, such as a grasp, are carried out, is likely to have consequences in terms of “utility” for the agent. Thus, successfully grasping a glass of orange may presage a pleasant drink; a less successful grasp may result in unnecessary delay, a slight spillage, a broken glass, or a stained sofa. The motor system needs to choose actions which, given the precision of the information that it has, and the agent’s utilities, gives the best expected outcome. The machinery of Bayesian decision theory [Berger, 1985] can be recruited to address this problem. Bayesian decision theory has been widely applied as a theoretical framework for understanding the control of movement (e.g., [Koerding and Wolpert, 2006]). A wide range of experimental evidence has indicated that movement trajectories are indeed accurately predictable in these terms. In a particularly elegant study, Koerding and Wolpert [2004a] showed that people rely on prior knowledge, rather than evidence from sensory input, depending on the relative precision of each source of information, in a simple repeated motor task. This suggests that the brain learns to model both the distribution of outcomes in prior trials, and the reliability of sensory input — as performance is accurately tuned to the particular distributions of each to which participants are exposed. Similar effects arise not just in movement trajectories, but in force estimation [Koerding and Wolpert, 2004b] and sensory motor timing [Miyazaki et al., 2005]. This work can be generalized to consider the on-line planning of motor movements — i.e., the brain must plan trajectories so that it’s on-line estimation of its own state, and ability to dynamically modify that state, lead to the optimal trajectories. The technical extension of Bayesian methods to problems of this type is the subject of the field of on-line feedback control, and there is experimental evidence that people’s movements are well-predicted by these methods (e.g., [Knill and Saunders, 2003; Todorov and Jordon, 2002]). Overall, the Bayesian framework has proved to be a remarkably productive framework in which to analyse human motor control. We now turn to the main topics of this chapter, the somewhat ill-defined area of “central” cognition. However, we begin with language. Despite being characterised as a modular system [Chomsky, 1981; Fodor, 1983], language is really at the borderline between modular input systems and central systems involved in inference, argument and decision making [Fodor, 1983]. The problems that are solved by central systems are invariably posed linguistically and interact strongly with mechanisms of language interpretation.
564
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
2
LANGUAGE
The processing and acquisition of language is a central topic in cognitive science. Yet, perhaps surprisingly, the first steps towards a cognitive science of language involved driving out, rather than building on, probability. Whereas structural linguistics focussed on finding regularities in the statistical complexities of language corpora, the Chomskyan revolution focussed on the abstract rules governing linguistic “competence,” based on judgements of linguistic acceptability [Chomsky, 1965]. Whereas behaviorists viewed language as a stochastic process determined by principles of reinforcement between stimuli and responses, the new psycholinguistics viewed language processing as governed by internally represented linguistic rules [Fodor et al., 1974]. And interest in statistical and information-theoretic properties of language [Shannon, 1951] was replaced by the mathematical machinery of formal grammar. In sum, probability has had bad press in the cognitive science of language. The focus on complex linguistic representations (feature matrices, trees, logical representations) and rules defined over them has crowded out probabilistic notions. And the impression that probabilistic ideas are incompatible with the Chomskyan approach to linguistics has been reinforced by debates which appear to pitch probabilistic and related quantitative/connectionist approaches against the symbolic approach to language [Marcus et al., 1999; Pinker, 1999; Seidenberg, 1997; Seidenberg and Elman, 1997]. The recent development of sophisticated probabilistic models, casts these issues in a different light. Such models may be defined over symbolic rules and representations, rather than being in opposition to them. Thus, grammatical rules may be associated with probabilities of use, capturing what is linguistically likely, not just what is linguistically possible. From this viewpoint, the probabilistic ideas augment symbolic models of language [Klavans and Rednik, 1996; Manning, 2003]. Yet this complementarity does not imply that probabilistic methods merely add to symbolic work, without modification. On the contrary, the “probabilistic turn,” broadly characterized, has led to some radical re-thinking in the cognitive science of language, on a number of levels. In linguistics, there has been renewed interest in phenomena that seem inherently graded and/or stochastic, from phonology to syntax [Bod et al., 2003; Faneslow et al., 2006; Hay, and Baayen, 2005] — this linguistic work is complementary to the focus of Chomskyan linguistics. There have also been revisionist perspectives on the strict symbolic rules thought to underlie language. Although inspired by a type of probabilistic connectionist network, standard optimality theory attempts to define a middle ground of ranked, violable linguistic constraints, used particularly to explain phonological regularities [Smolensky and Legendre, 2006]. It has been extended to employ increasingly rich probabilistic variants. And in morphology, there is debate over whether “rule+exception” regularities (e.g., English past tense, German plural) are better explained by a single stochastic process [Hahn and Nakisa, 2000].
Inductive Logic and Empirical Psychology
565
While touching on these issues, this review explores a narrower perspective: that language is represented by a probabilistic model [Manning, 2003]; that language processing involves generating or interpreting using this model; and that language acquisition involves learning such models. (Another interesting line of work that we do not review assumes instead that language processing is based on memory for past instances, and not via the construction of a model of the language [Daelemans and van den Bosch, 2005]). Moreover, for reasons of space, we shall focus mainly on parsing and learning grammar, rather than, for example, exploring probabilistic models of how words are recognized [Norris, 2006] or learned [Xu and Tenenbaum, 2007]. We will see that a probabilistic perspective adds to, but also substantially modifies, modelling the symbolic rules, representations and processes underlying language.
From grammar to probabilistic models To see the contribution of probability, let us begin without it. According to early Chomskyan linguistics, language is internally represented as a grammar: a system of rules that specify all and only allowable sentences. Thus, parsing is viewed as the problem of inferring an underlying linguistic tree, t ∈ T , from the observed strings of words, s ∈ S. Yet natural language is notoriously ambiguous — there are many ways in which local chunks can be parsed, and exponentially many ways in which these parses can be stitched together to produce a global parse. Searching these possibilities is hugely challenging; and there are often many globally possible parses (many t, for a single s). The problem gets dramatically easier if the cognitive system knows that the bracketing [the [old [man]]] is much more likely than [[the old ] man] (though this latter reading is possible, as in the old man the boats). This helps locally prune the search space; and helps decide between interpretations for globally ambiguous sentences. In particular, Bayesian methods specify a framework showing how information about the probability of generating different grammatical structures, and their associated word strings, can be used to infer grammatical structure from a string of words. This Bayesian framework is analogous to probabilistic models of vision, inference and learning; what is distinctive is the specific structures (e.g., trees, dependency diagrams) relevant for language. In computational linguistics, the practical challenge of parsing and interpreting corpora of real language (typically text, sometimes speech) has led to a strong focus on probabilistic methods. However, computational linguistics often parts company from standard linguistic theory, which focuses on much more complex grammatical frameworks, where probabilistic and other computational methods cannot readily be applied. But computational linguistics does, we suggest, provide a valuable source of hypotheses for the cognitive science of language. Formally, probabilistic parsing involves estimating Prm (t|s), i.e., estimating the likelihood of different trees, t, given a sentence, s, and given a probabilistic model Prm of the language:
566
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Prm (t, s) (1) Pr(t|s) = m Prm (t , s) t
The probabilistic model can take as many forms as there are linguistic theories (and linguistic structures, t,may equally be trees, attribute-value matrices, dependency diagrams, etc.). For example, suppose that our grammar is a context-free phrase structure grammar. Probabilities are defined for expanding each node in a tree using a given rule. The product of probabilities in a derivation gives the overall probability of that tree. A particular syntactic ambiguity, much studied in psycholinguistics, concerns prepositional phrase attachment, e.g., she saw the boy with the telescope. The parser has to decide: does the prepositional phrase (e.g., with the telescope) modify the verb phrase describing the girl’s action i.e., she saw-with-a-telescope the boy; or the noun phrase the boy — i.e., she saw the-boy-with-a-telescope? This question is a useful starting point for discussing the role of probability in the cognitive science of language.
Principles, probability, and plausibility in parsing Classical proposals in psycholinguistics assumed that disambiguation occurs using structural features of the trees. For example, the principle of minimal attachment would prefer the first reading, because it has one less node [Frazier and Fodor, 1978]. The spirit of this proposal could, though, be recast probabilistically: the probability of a tree is the product of the probabilities at each node; and hence, other things being equal, fewer nodes imply higher probability. Structural principles in parsing have come under threat from varied parsing preferences within and across languages. But a stochastic grammar may capture different parsing preferences across languages, because the probability of different structures may differ across languages. A structure with fewer nodes, but using highly improbable rules (estimated from a corpus) will be dispreferred. Psycholinguists are increasingly exploring corpus statistics across languages, and parsing preferences do seem to fit the probabilities evident in each language [Desemet et al., 2006; Desmet and Gibson, 2003]. A second problem for structural parsing principles is the influence of lexical information. Thus, the preference for the structurally analogous the girl saw the boy with a book appears to reverse — because books are not aids to sight as telescopes are. The pattern flips back with a change of verb: the girl hit the boy with a book, because books can be aids to hitting. The probabilistic approach seems useful here — because it seems important to integrate the constraint that seeing-with-telescopes is much more likely than seeing-with-books. One way to capture these constraints aims to capture statistical (or even rigid) regularities between head words of phrases. For example, “lexicalized” grammars, which carry information about what material co-occurs with specific words, substantially improve computational parsing performance [Charniak, 1997; Collins, 2003].
Inductive Logic and Empirical Psychology
567
Plausibility and statistics Statistical constraints between words are, however, a crude approximation to what sentences are plausible. In an off-line judgement task, we use world knowledge, understanding of the social and environmental context, pragmatic principles, and much more, to determine what people might plausibly say or mean. Determining whether a statement is plausible may involve determining how likely it is to be true; but also whether, given the present context, it might plausibly be said. The first issue requires a probabilistic model of general knowledge [Oaksford and Chater, 1998; Tenenbaum et al., 2006]. The second issue requires engaging “theory of mind” (inferring the other’s mental states), and invoking principles of pragmatics. Models of these processes, probabilistic or otherwise are very preliminary [Jurafsky, 2003]. A fundamental theoretical debate is whether plausibility is used on-line in parsing decisions. Are statistical dependencies between words used as a computationally cheap surrogate for plausibility? Or are both statistics and plausibility deployed on-line, perhaps in separate mechanisms? Eye-tracking paradigms [Tanenhaus et al., 1995; McDonald and Shillcock, 2003] have been used to suggest that both factors are used on-line, though the interpretations are controversial. Recent work indicates that probabilistic grammar models often predict the time course of processing [Jurafsky, 1996; Narayanan and Jurafsky, 2002, Hale, 2003], though parsing preferences also appear to be influenced by additional factors, including the linear distance between the incoming word and the prior words to which it has a dependency relation [Grodner and Gibson, 2005]. Is the most likely parse favoured? In the probabilistic framework, it is typically assumed that on-line ambiguity resolution favours the most probable parse. Yet Chater, Crocker and Pickering [1998] suggest that, for a serial parser, whose chance of “recovery” is highest if the “mistake” is discovered soon, this is overly simple. In particular, they suggest that because parsing decisions are made on-line [Pickering et al., 2000] there should be a bias to choose interpretations which make specific predictions that might rapidly be falsified. For example, after John realized his. . . the more probable interpretation is that realized introduces a reduced relative clause (i.e., John realized (that) his. . . ). On this interpretation, the rest of the noun phrase after his is unconstrained. By contrast, the less probable transitive reading (John realized his goals/potential/objectives) places very strong constraints on the subsequent noun phrase. Perhaps, then, the parser should favour the more specific reading, because if wrong, it may rapidly and successfully be corrected. Chater et al. [1998] provide a Bayesian analysis of “optimal ambiguity resolution” capturing such cases. The empirical issue of whether the human parser follows this analysis [Pickering et al., 2000], and even the correct probabilistic analysis of sentences of this type [Crocker and Brant, 2000], is not fully resolved.
568
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Beyond parsing We have here focussed on parsing. But the “probabilistic turn” applies across language processing, from modelling lexical semantics to modelling processing difficulty. Note, though, that integrating these diverse approaches into a unified model of language is extremely challenging; and many of the theoretical issues that have traditionally concerned psycholinguistics are re-framed rather than resolved by a probabilistic approach.
Probabilistic perspectives on language acquisition Probabilistic language processing presupposes a probabilistic model of the language; and uses that model to infer, for example, how sentences should be parsed, or ambiguous words interpreted. But how is such a model, or indeed simply a nonprobabilistic grammar, acquired? Chomsky [1981] frames the problem as follows: the child has a hypothesis-space of candidate grammars; and must choose, on the basis of (primarily linguistic) experience one of these grammars. From a Bayesian standpoint, each candidate grammar is associated with a prior probability; and these probabilities will be modified by experience using Bayesian updating. The learner will presumably choose a language with high, and perhaps the highest, posterior probability. The poverty of the stimulus? Chomsky [1965] influentially argued that the learning problem is unsolvable without strong prior constraints on the language, given the ‘poverty’ (i.e., partiality and errorfulness) of the linguistic stimulus. Indeed, Chomsky [1981] argued that almost all syntactic structure, aside from a finite number of binary parameters, must be innate. Separate mathematical work by Gold [1967] indicated that, under certain assumptions, learners provably cannot converge on a language even “in the limit” as the corpus becomes indefinitely large (see [Pinker, 1979] for discussion). indexideal@“ideal” learnig A probabilistic standpoint yields more positive learnability results. For example, Horning [1971] proved that phrase structure grammars are learnable (with high probability) to within a statistical tolerance, if sentences are sampled as independent, identically distributed data. Chater and Vit´ anyi [2007] generalize to a language which is generated by any computable process (i.e., sentences can be interdependent, and generated by any computable grammar), and show that prediction, grammaticality, and semantics, are learnable, to a statistical tolerance. These results are “ideal” however — they consider what would be learned, if the learner could find the shortest representation of linguistic data. In practice, the learner will find a short code, not the shortest, and theoretical results are not available for this case. Nonetheless, from a probabilistic standpoint, learning looks more tractable — partly because learning need only succeed with high probability; and to an approximation (speakers may learn slightly different idiolects).
Inductive Logic and Empirical Psychology
569
Computational models of language learning Yet the question of learnability, and the potential need for innate constraints, remains. Machine learning methods have successfully learned small artificial contextfree languages (e.g., [Lari and Youg, 1990]), but profound difficulties in extending these results to real language corpora have led computational linguists to focus on learning from parsed trees [Charniak, 1997; Collins, 2003] — presumably not available to the child. Connectionism is no panacea here — indeed, connectionist simulations of language learning typically use small artificial languages [Elman, 1990; Christiansen and Chater, 2001] and, despite having considerable psychological interest, they scale poorly. By contrast, many simple but important aspects of language structure have successfully been learned from linguistic corpora by distributional methods. For example, good approximations to syntactic categories and semantic classes have been learned by clustering words based on their linear distributional contexts (e.g., the distribution over the word that precedes and follows each token of a type) or broad topical contexts (e.g., [Sch¨ utze, 1995; Redington et al., 1998]). One can even simultaneously cluster words exploiting local syntactic and topical similarity [Griffiths et al., 2005]. Recently, though, Klein and Manning [2002; 2004] have made significant progress in solving the problem of learning syntactic constituency from corpora of unparsed sentences. Klein and Manning [2002] extended the success of distributional clustering methods for learning word classes by using the left and right word context of a putative constituent and its content as the basis of similarity calculations. Such a model better realizes ideas from traditional linguistic constituency tests which emphasize (i) the external context of a phrase (“something is a noun phrase if it appears in noun phrase contexts”) at least as much as its internal structure, and (ii) proform tests (testing replacing a large constituent with a single word member of the same category). Klein and Manning [2004] extended this work by combining such a distributional phrase clustering model with a dependencygrammar-based model. The dependency model uses data on word co-occurrence to bootstrap word-word dependency probabilities, but the work crucially shows that more is needed than simply a model based on word co-occurrence. One appears to need two types of prior constraint: one making dependencies more likely between nearby words than far away words, and the other making it more likely for a word to have few rather than many dependents. Both of Klein and Manning’s models capture a few core features of language structure, while still being simple enough to support learning. The resulting combined model is better than either model individually, suggesting a certain complementarity of knowledge sources. Klein and Manning show that high-quality parses can be learned from surprisingly little text, from a range of languages, with no labeled examples and no language-specific biases. The resulting model provides good results, building binary trees which are correct on over 80% of the constituency decisions in hand-parsed English text. This work is a promising demonstration of empirical language learning, but
570
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
most linguistic theories use richer structures than surface phrase structure trees; and a particularly important objective is finding models that map to meaning representations. This remains very much an area of ongoing research, but inter alia there is work on probabilistic parsing with richer formalized grammar models based on learning from parsed data [Johnson and Riezler, 2002; Toutanova et al., 2005] some work on mapping to meaning representations of simple data sets [Zettlemoyer and Collins, 2005], and work on unsupervised learning of a mapping from surface text to semantic role representations [Swier and Stevenson, 2005].
Poverty of the stimulus, again The status of Chomsky’s poverty of the stimulus argument remains unclear, beginning with the question of whether children really do face a poverty of linguistic data (see the debate between Pullum and Scholz [2002] and Legate and Yang [2002]). Perhaps no large and complex grammar can be learned from the child’s input; or perhaps certain specific linguistic patterns (e.g., those encoded in an innate universal grammar) are in principle unlearnable. Probabilistic methods provide a potential way of assessing such questions. Oversimplifying somewhat, suppose that a learner wonders whether to include constraint C in her grammar. C happens, perhaps coincidentally, to fit all the data so far encountered. If the learner does not assume C, the probability of different sentences is, say, Pr(x). Constraint Pr(x). C only applies to probability mass p of these sentences, where p = x:C(x)
Thus, each sentence obeying C is 1/p times more probable, if the constraint is true than if it is not (if we simply rescale the probability of all sentences obeying the constraint). Thus, after n sentences, the probability of the corpus, is 1/pn greater, if the constraint is included. Yet, a more complex grammar will typically have a lower prior probability. If the ratio of priors for grammars with/without the constraint is greater than 1/pn , then, by Bayes’ theorem, the constraint is unlearnable in n items. Presently, theorists using probabilistic methods diverge widely on the severity of prior “innate” constraints they assume. Some theorists focus on applying probability to learning parameters of Chomskyan Universal Grammar [Gibson and Wexler, 1994; Niyogi, 2006]; others focus on learning relatively simple aspects of language, such as syntactic or semantic categories, or approximate morphological decomposition, with relatively weak prior assumptions [Redington et al., 1998; Brent and Cartwright, 1996; Landauer and Dumais, 1997]. Probabilistic methods should be viewed as a framework for building and evaluating theories of language acquisition, and for concretely formulating questions concerning the poverty of the stimulus, rather than as embodying any particular theoretical viewpoint. This point arises throughout cognition — although probability provides natural models of learning, it is an open question whether initial structure may be critical in facilitating such learning. For example, Culicover (1999) argues that prior structure over Bayesian networks is crucial to support learning.
Inductive Logic and Empirical Psychology
571
Language acquisition and language structure How far do probabilistic perspectives on language structure, and language acquisition, interact? Some theorists argue that language should not best be described as rules and exceptions, but as a system of graded “quasi-regular” mappings. Notable examples of such mappings including the English past-tense, the German plural, and spelling-to-sound correspondences in English; but a closely related viewpoint has been advocated for syntax [Culicover, 1999; Tomasello, 2003] and aspects of semantics [Baayen and Moscoso del Prado, 2005]. Some theorists argue [Pierrehumbert, 2001] that such mappings are better learned using statistical or connectionist methods, which learn according to probabilistic principles. By contrast, traditional rule-and-exception views are typically associated with nonprobabilistic hypothesis generation and test. Nonetheless, we see no necessary connection between these debates on the structure of language, and models of acquisition.
Language: Summary Understanding and producing language involves complex patterns of uncertain inference, from processing noisy and partial speech input to lexical identification, syntactic and semantic analysis, to language interpretation in context. Acquiring language involves uncertain inference from linguistic and other data, to infer language structure. These uncertain inferences are naturally framed using probability theory: the calculus of uncertainty. Historically, probabilistic approaches to language are associated with simple models of language structure (e.g., local dependencies between words); but, across the cognitive sciences technical advances have reduced this type of limitation. Probabilistic methods are also often associated with empiricist views of language acquisition — but the framework is equally compatible with nativism — that there are prior constraints on the class of language models. Indeed, as we have seen, probabilistic analysis may provide one line of attack (alongside the empirical investigation of child language) for assessing the relative contribution of innate constraints and corpus input, in language acquisition. Overall, probabilistic methods provide a rich framework for theorising about language structure, processing, and acquisition, which may prove valuable in developing, and contrasting between, a wide range of theoretical perspectives.
3
INDUCTIVE REASONING
Historically, in empirical psychology, inductive reasoning has typically been studied separately from deductive reasoning, by separate groups of researchers using different theoretical frameworks. In the next few sections after this one on inductive reasoning, we will review recent attempts to apply inductive logic to psychological studies of deductive reasoning. This will raise the possibility that a unified approach across these diverse reasoning tasks might be achievable. Even in inference
572
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
tasks that might have a deductive solution people might be more concerned with their inductive strength. In this section, we concentrate on empirical studies of inductive reasoning, and address the question of whether normative inductive logic can explain the factors on which peoples’ judgements of inductive strength depend. In moving to reasoning behaviour, we are now more directly in the realm of central processes and of explicit verbal reasoning tasks of the type dealt with in logic, be it deductive or inductive. Inductive reasoning, in its broadest sense, concerns inference from specific premises to general statements or to other non-logically related specific statements. So, for example, we might be given observations that robins have anatomical feature X, and be asked how likely it is that all birds have anatomical feature X. Or, more usually in experimental tasks, people are asked about the likelihood that eagles or sparrows also have that anatomical feature. Inductive reasoning involves drawing conclusions that are probably true, given a set of premises. Inductive reasoning can thus be contrasted with deductive reasoning, in which the conclusion must necessarily follow from a set of premises. For example, the following two arguments (1 and 2) each have some degree of inductive strength. (1) Cows have sesamoid bones. All mammals have sesamoid bones. (2) Ferrets have sesamoid bones. All mammals have sesamoid bones. Whereas all valid deductive arguments are perfectly strong, inductive arguments can differ in their perceived inductive strength. In the examples above, the conclusion in argument (1) may seem stronger, or more probable given the evidence, than the conclusion in (2) Inductive reasoning is sometimes characterized as drawing inferences from specific statements to more general statements (as in arguments [1] and [2]), in contrast to deductive reasoning which would run from general statements to specifics. Although there is a grain of truth in this characterization, there is actually a broader variety of deductive as well as inductive arguments (Skyrms, 1977). For example, the following deductively valid argument (3) does not draw a more specific inference from general statements: (3) Gorillas are apes. Apes are mammals. Gorillas are mammals. Likewise it would be possible to draw inductive inferences that involve reasoning from one fairly specific statement to another, as in argument (4). (4) Ferrets have sesamoid bones. Squirrels have sesamoid bones.
Inductive Logic and Empirical Psychology
573
There is now a well-documented set of empirical regularities on inductive reasoning. We provide an introduction to these empirical regularities and then describe theoretical accounts of inductive reasoning (see Heit, 2000, for a more extensive review).
Key Results in Inductive Reasoning One of the early experimental studies of inductive reasoning, by Rips [1975], looked at how people project properties of one category of animals to another. Subjects were told to assume that on a small island, it has been discovered that all members of a particular species have a new type of contagious disease. Then subjects judged for various other species what proportion would also have the disease. For example, if all rabbits have this disease, what proportion of dogs have the disease? Rips used a variety of animal categories in the premise and conclusion roles. It was found that two factors consistently promoted inferences from a premise category to a conclusion category. First, similarity between premises and conclusions promoted strong inferences. For example, subjects made stronger inferences from rabbits to dogs than from rabbits to bears. Second, the typicality of the premise, with respect to its superordinate category, was critical in promoting inferences. The result was that more typical premise categories led to stronger inferences than atypical premise categories. For example, with the bird stimuli, having bluejay as a premise category led to stronger inferences overall compared to having goose as a premise category. Using multiple regression analyses, Rips found distinct contributions of premise-conclusion similarity and premise typicality. Interestingly, there was no evidence for a role of conclusion typicality. For example, all other things being equal, people would be as willing to draw a conclusion about a bluejay or about a goose, despite the difference in typicality of these two categories (see [Osherson et al., 1990], for further investigations of similarity and typicality effects). The next major study of induction was by Nisbett et al., [1983], who also asked subjects to draw inferences about items (animals, people, and objects) found on a remote island. For example, subjects were told to imagine that one member of the Barratos tribe is observed to be obese, and they estimated the proportion of all members of this group that would be obese. Likewise, subjects were told that one sample of the substance “floridium” was observed to conduct electricity, and they estimated the proportion of all members of this set that would conduct electricity. One key finding was that subjects were very sensitive to perceived variability of the conclusion category. For a variable category such as Barratos people (and their potential obesity), subjects were rather unwilling to make strong inferences about other Barratos, after just one case. But for a non-variable category such as floridium samples, subjects were willing to generalize the observation of electrical conductance to most or all of the population. This result, that subjects are more willing to draw inferences about less variable conclusion categories, makes a striking contrast to the results of Rips [1975]. Whereas Rips found that typicality of the conclusion did not affect inductive strength, Nisbett et al. showed that
574
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
conclusion categories do matter, at least in terms of their variability. The preceding results show how people reason based on a single premise. However, when people try to make an inference about some object or event, they are typically faced with a great deal of information. Rather than just one past case being available or relevant, in many realistic situations there will be an extensive set of cases or premises that could be relied on. What makes a set of premises seem strong, or useful for promoting inferences? One factor is numerosity. In their study involving inferences about people and objects on an island, Nisbett et al. [1983] systematically varied the given number of observations. For example, subjects were told that 1, 3, or 20 obese members of the Barratos group had been observed, and asked what proportion of all Barratos are obese. In general, inferences were stronger with increased sample size (see also [Osherson et al., 1990]). Although sheer numerosity of cases does have some effect on induction, there is also substantial evidence that variability or diversity of cases affects inductive strength. Intuitively, repeating the same evidence, or highly similar pieces of evidence, again and again should not be much more convincing that just giving the evidence once. Consider the following arguments (adapted from [Osherson et al., 1990]). (5) Cows require vitamin K for the liver to function. Horses require vitamin K for the liver to function. All mammals require vitamin K for the liver to function. (6) Cows require vitamin K for the liver to function. (6) Ferrets require vitamin K for the liver to function. All mammals require vitamin K for the liver to function. Although both arguments seem to have some argument strength, most people find argument (6) to be stronger than argument (5), due to the greater diversity of premise information. Again, there is an interesting comparison to Nisbett et al. [1983], who found that variable conclusions led to weaker inductive inferences. In contrast, it has been found that diverse premise categories lead to stronger inductive inferences. Another fascinating aspect of the diversity effect is that it runs in the opposite direction to the typicality effect: Whereas a typical premise category leads to a fairly strong inductive argument (1), an argument with two typical premise categories (5) is actually weaker than an argument with a typical premise and an atypical premise (6).
Effects of Knowledge on Inductive Reasoning Unlike deductive reasoning, where it should be possible to determine just from the form of an argument whether the conclusion must necessarily follow, inductive reasoning is uncertain by nature. Hence it should be rational to go beyond the information given, seeking other knowledge that could reduce this uncertainty and make inductive inferences more accurate. Indeed, all of the examples of inductive
Inductive Logic and Empirical Psychology
575
reasoning in this section rely on some use of world knowledge that is not explicitly stated in the inductive arguments, such as that cows and horses are more similar than are cows and ferrets. However, in other ways researchers have aimed to study the “essence” of inductive reasoning by discouraging the use of outside knowledge. For example, Rips [1975] used fictional diseases that people would not have strong prior beliefs about and Osherson et al. [1990] used “blank” properties such as “has sesamoid bones” which sounded somewhat biological but were fairly unfamiliar. These decisions by these researchers were helpful indeed in uncovering the various empirical regularities such as similarity, typicality, and diversity effects. Still, other researchers have studied the role of knowledge in induction more directly. For example, Medin et al. [1997] looked at inductive reasoning about categories of plants, by various kinds of tree experts, such as taxonomists and tree maintenance workers. Here the main interest was effects of similarity, for groups that differed in their notions of similarity. For example, in a sorting task, maintenance workers tended to organize tree species in terms of their shape or purpose for various landscaping tasks. Medin et al. devised questions on a test of inductive reasoning that pitted scientific matches against alternative, functional category structures. For example, two tree species might be distant in terms of the scientific taxonomy but they could both be useful for providing shade. It was found that taxonomists (not surprisingly) sorted trees on the basis of scientific taxonomy and likewise favored inductive arguments between categories that were close in the scientific taxonomy. Maintenance workers seemed to favor a more functional category organization for both sorting and reasoning. In sum, the groups of experts generally showed the similarity effects that had been documented in other studies of induction, but their knowledge about trees mediated these similarity effects. Other evidence for knowledge effects has highlighted the effects of the property that is being inferred. The Nisbett et al. [1983] study is a good illustration of how knowledge about the scope of a property affects inductive inference. As already reviewed, seeing that just one member of the Barratos group is obese does not seem to promote the inference that other people in this group will be obese. Obesity seems to be more of an individual characteristic rather than a group characteristic. On the other hand, Nisbett et al. found that people make stronger inferences for the same category but another property, skin color. Here, seeing the skin color of just one Barratos promotes inferences about other members of this group, on the assumption that members of the same ethnic group will likely have some shared physical characteristics. (See [Goodman, 1955] for further discussion of how properties differ in their tendency to promote induction.) Although it might seem from the previous section that some properties have a wider scope for inference than others, the picture is actually more complicated. Depending on the categories in an inductive argument, a particular property may lead to strong inferences or weak inferences or something in between. Consider the following example, from [Heit and Rubinstein, 1994]. For an anatomical property, such as “has a liver with two chambers,” people will make stronger inferences from chickens to hawks than from tigers to hawks. Because chickens and hawks are from
576
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
the same biological category, and share many internal properties, people are quite willing to project a novel anatomical property from one bird to another. But since tigers and hawks differ in terms of many known internal biological properties, it seems less likely that a novel anatomical property will project from one to the other. However, now consider the behavioral property “prefers to feed at night.” Heit and Rubinstein [1994] found that inferences for behavioral properties concerning feeding and predation were weaker between the categories chicken and hawk than between the categories tiger and hawk — the opposite of the result for anatomical properties. Here, it seems that despite the major biological differences between tigers and hawks, people were influenced by the known similarities between these two animals in terms of predatory behavior, thus making strong inferences about a novel behavioral property
Theoretical Accounts of Inductive Reasoning So far, we have described several empirical regularities in inductive reasoning, including similarity effects, typicality effects, diversity effects, and effects based on knowledge about the property being inferred. Together, these results pose a challenge for psychological accounts of induction. Although there have been a number of proposals (see, in particular, [Osherson et al., 1990; Sloman, 1993]), we will focus on a model of inductive reasoning by Heit [1998] (see also [Tenenbam and Griffith, 2001; Kemp and Teenbaum, 2009]) that has been applied to all of these results. This is a model derived from Bayesian statistics and we will show that people’s inductive reasoning behaviour does indeed seem to follow the dictates of inductive logic. According to the Bayesian model, evaluating an inductive argument is conceived of as learning about a property, in particular learning for which categories the property is true or false. For example, in argument (1) above, the goal would be to learn which animals have sesamoid bones and which animals do not. The model assumes that for a novel property such as in this example, people would rely on prior knowledge about familiar properties, to derive a set of hypotheses about what the novel property may be like. For example, people know some facts that are true of all mammals (including cows), but they also know some facts that are true of cows but not some other mammals. The question is which of these known kinds of properties does the novel property, “has sesamoid bones,” resemble most. Is it an all-mammal property, or a cow-only property? What is crucial is that people assume that novel properties follow the same distribution as known properties. Because many known properties of cows are also true of other mammals, argument (1) regarding a novel property seems fairly strong. The Bayesian model addresses many of the key results in inductive reasoning. For example, the model can predict similarity effects as in [Rips, 1975]. Given that rabbits have some kind of disease, it seems more plausuble to infer that dogs have the same disease rather than bears, because rabbits and dogs are more alike in terms of known properties than are rabbits and bears. The Bayesian model also
Inductive Logic and Empirical Psychology
577
addresses typicality effects, under the assumption that according to prior beliefs, atypical categories, such as geese, would have a number of idiosyncratic features. Hence a premise asserting a novel property about geese would suggest that this property is likewise idiosyncratic and not to be widely projected. In contrast, prior beliefs about typical categories, such as bluejays, would indicate that they have many properties in common with other categories, hence a novel property of a typical category should generalize well to other categories. The Bayesian model also addresses diversity effects, with a rationale similar to that for typicality effects. An argument with two similar premise categories, such as cows and horses in (5), could bring to mind a lot of idiosyncratic properties that are true just of large farm animals. Therefore a novel property of cows and horses might seem idiosyncratic to farm animals, and not applicable to other mammals. In contrast, an argument with two diverse premise categories, such as cows and ferrets in (6), could not bring to mind familiar idiosyncratic properties that are true of just these two animals. Instead, the prior hypotheses would be derived from known properties that are true of all mammals or all animals. Hence a novel property of cows and ferrets should generalize fairly broadly. To give a final illustration of the Bayesian approach, when reasoning about the anatomical and behavioral properties in [Heit and Rubinstein, 1994], people could draw on prior knowledge about different known properties for the two kinds of properties. Reasoning about anatomical properties could cause people to rely on prior knowledge about familiar anatomical properties. In contrast, when reasoning about a behavioural property such as “prefers to feed at night,” the prior hypotheses could be drawn from knowledge about familiar behavioural properties. These two different sources of prior knowledge would lead to different patterns of inductive inferences for the two kinds of properties.
Summary: Inductive reasoning To conclude, the Bayesian model does address a fairly broad set of phenomena (see [Heit, 1998; 2000] for further applications, in greater detail). There are other models, such as those proposed by Osherson et al. [1990] and Sloman [1993], that can address many of the same results, however we see a big advantage of the Bayesian model is that it derives from the same principles, probability theory and Anderson’s [1990; 1991] rational analysis, as do recent models of deduction to which we now turn. 4 DEDUCTIVE REASONING In this section, we review recent work which suggests that empirical research on putatively deductive reasoning tasks is better characterised using inductive logic. Empirical studies of deductive reasoning have concentrated on three main experimental tasks, conditional inference, data selection, and quantified syllogistic reasoning. A subsection is devoted to each task. In each, we describe recent
578
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Bayesian probabilistic models that seem able to account for the deviations from deductive prescriptions seen in the experimental results. The key idea behind all these models is to use conditional probability, P (q|p), to account for the meaning of conditional statements, if p then q (e.g., if you turn the key then the car starts). For each area of reasoning, we introduce the task, and the standard findings. We then introduce a Bayesian rational analysis for each problem, show how it accounts for the core data, and how it generalises to a sample of further important data in the area.
Conditional Inference In conditional inference four inference patterns have been extensively studied experimentally: the valid inference forms modus ponens (MP) and modus tollens (MT) and the fallacies denying the antecedent (DA) and affirming the consequent (AC). Each inference consists of the conditional premise and one of four possible categorical premises, which relate either to the antecedent or consequent of the conditional, or their negations (p, ¬p, q, ¬q where “¬” = not). For example, the inference Modus Ponens (MP) combines the conditional premise if p then q with the categorical premise p; and yields the conclusion q. According to standard logic, we would expect everyone to endorse the valid inferences and not to endorse the fallacies. However, people tend endorse all four inferences at rates above 50% and in a characteristic order: MP > MT > AC > DA [Schroyens and Schaeken, 2003]. All the difference in endorsement rate between pairs in the order are highly statistically significant. This performance reveals a large divergence between people’s behaviour the predictions of the standard logical model. A Probabilistic Approach In empirical psychology, there are a variety of probabilistic approaches to conditional inference [Anderson, 1995; Liu, 2003; Evans and Over, 2004; Pfeifer and Kleiter, 2005; Oaksford and Chater, 2007; Oaksford et al., 2000]. Apart from Evans and Over [2004], these approaches have attempted to explain human reasoning performance without invoking a particular psychological implementation of inductive logic. All these accounts share three key ideas. First, the probability of a conditional is the conditional probability, i.e., P (if p then q) = P (q|p). In the normative literature, this identification is simply called “The Equation” [Adams, 1998; Bennett, 2003; Edgington, 1995]. In the psychological literature, the Equation has been confirmed experimentally by Evans, Handley, and Over [2003]; see also, [Over et al., 2007] and by Oberauer and Wilhelm [2003]. Second, as discussed above, probabilities are interpreted “subjectively,” that is, as degrees of belief. It is this interpretation of probability that allows us to provide a probabilistic theory of inference as belief updating. Third, conditional probabilities are determined by a psychological process called the “Ramsey Test” [Bennett, 2003; Ramsey, 1931/1990]. For example, suppose you want to evaluate your conditional
Inductive Logic and Empirical Psychology
579
degree of belief that if it is sunny in Wimbledon, then John plays tennis. By the Ramsey test, you make the hypothetical supposition that it is sunny in Wimbledon and revise your other beliefs so that they fit with this supposition. You then “read off” your hypothetical degree of belief that John plays tennis from these revised beliefs. Liu [2003] and Oaksford et al. [2000]; see also, [Oaksford and Chater, 2007] treat conditional inference as belief revision. We concentrate on this approach because it seems to provide the possibility of accounting for human performance with the minimal additional assumptions about the cognitive system. Treating conditional inference as belief revision concerns how we reason when the categorical premise is not merely supposed, but is actually believed or known to be true. This process is known as conditionalisation. Consider an MP inference, e.g., if it is sunny in Wimbledon, then John plays tennis, and it is sunny in Wimbledon, therefore, John plays tennis. Conditionalisation applies when we know (instead of merely supposing) that it is sunny in Wimbledon; or when a high degree of belief can be assigned to this event (e.g., because we know that it is sunny in nearby Bloomsbury). By conditionalisation, our new degree of belief that John plays tennis should be equal to our prior degree of belief that if it is sunny in Wimbledon, then John plays tennis (here “prior” means before learning that it is sunny in Wimbledon). More formally, by the Equation, we know that P0 (if it is sunny in Wimbledon, then John plays tennis) equals P0 (John plays tennis|it is sunny in Wimbledon), where “P0 (x)” = prior probability of x. When we learn it is sunny in Wimbledon, then P1 (it is sunny in Wimbledon) = 1, where “P1 (x)” = posterior probability of x. Conditionalising on this knowledge tells us that our new degree of belief in John plays tennis P1 (John plays tennis), should be equal to P0 (John plays tennis|it is sunny in Wimbledon). That is, P1 (q) = P0 (q|p), where p = it is sunny in Wimbledon, and q = John plays tennis.1 So from a probabilistic perspective, MP provides a way of updating our degrees of belief in the consequent, q, on learning that the antecedent, p, is true. Quantitatively, if you believe that P0 (John plays tennis|it is sunny in Wimbledon) = .9, then given you discover that it is sunny in Wimbledon (P1 (it is sunny in Wimbledon) = 1) your new degree belief that John plays tennis should be .9, i.e., P1 (John plays tennis) = .9. This contrasts with the logical approach in which believing the conditional premise entails with certainty that the conclusion follows from the minor premise so that P0 (John plays tennis|it is sunny in Wimbledon) = 1. This is surely too strong a claim. The extension to the other conditional inferences is not direct, however. Take an example of (AC), if it is sunny in Wimbledon, John plays tennis and John plays tennis, therefore, it is sunny in Wimbledon. In this case, one knows or strongly 1 The case where the categorical premise is uncertain can be accommodated somewhat controversially using a generalization of this idea, Jeffrey conditionalisation [Jeffrey, 1983]. The new degree of belief that John plays tennis (q), on learning that it is sunny in Bloomsbury (which confers only a high probability that it is sunny in Wimbledon (p)), is: P1 (q) = P0 (q|p)P1 (p) + P0 (q|¬p)P1 (¬p).
580
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
believes that John plays tennis (perhaps we were told by a very reliable source), so P1 (q) = 1. But to use Bayesian conditionalisation to infer one’s new degree of belief that it is sunny in Wimbledon, P1 (p), one needs to know one’s conditional degree of belief that it is sunny in Wimbledon given John plays tennis, i.e., P0 (p|q). However, the conditional premise of AC, like that of MP, is about P0 (q|p) not about P0 (p|q) [Sober, 2002]. The solution proposed by Oaksford et al., [2000] (see also [Wagner, 2004]) is that that people also know the prior marginal probabilities (at least approximately). That is, they know something about the probability of a sunny day in Wimbledon, P0 (p), and the probability that John plays tennis, P0 (q), before learning that it is in fact a sunny day in Wimbledon. With this additional information, P0 (p|q) can be calculated from the converse conditional probability, P0 (q|p), using Bayes’ Theorem.2 The same approach also works for the two othr types of conditional inferene, Denying the Antecedent (DA) and Affirming the Consequent (AC) where the relevant probabilities are P0 (¬q|¬p) and P0 (¬p|¬q) respectively. The fact that the conditional premises of AC, DA and MT do not determine the appropriate conditional probability marks an important asymmetry with MP. For these inferences, further knowledge is required to infer the relevant conditional degrees of belief. The Empirical Data We now show how some of the errors and biases observed in conditional inference can be seen as a consequence of this rational probabilistic model. The first set of “biases” are called “the inferential asymmetries” [Oaksford and Chater, 2008]. That is, MP is drawn more than MT and AC is drawn more than DA (MT is also drawn more than AC). Oaksford and Chater [2003; 2007; 2008] calculated the values of P0 (q|p), P0 (p) and P0 (q) that best fit the data, i.e., they minimize the sum of squared error between the data and the models predictions. The fits were good (R2 = .84) and the probabilities, P0 (q|p) = .88, P0 (p) = .54, and P0 (q) = .70, seems reasonable, i.e., P0 (q|p) is high, P0 (q|p) ≈ .5, and P0 (q) > P0 (p). To predict John’s tennis playing behaviour well P0 (q|p) should be high. Further, one would be unlikely to draw inferences about John’s tennis playing behaviour using this rule in contexts where the probability that it was sunny was less than chance [Adams, 1998]. Moreover, as long as P0 (q|p) high P0 (q) > P0 (p) is most likely to hold. However, this probabilistic model [Oaksford et al., 2000] does not capture the magnitudes of the inferential asymmetries [Evans and Over, 2004; Schroyens and Schaeken, 2003]. It underestimates the MP–MT asymmetry and overestimates the DA–AC asymmetry. Oaksford and Chater [2007] argued that this is because learning that the categorical premise is true can have two inferential roles. The first inferential role is in conditionalisation, as we have described. The second inferential role is based on 2 Bayes’ theorem is the elementary identity of probability theory mentioned above that allows a conditional probability to be calculated from its converse conditional probability and the priors: P (p|q) = (P (q|p)P (p))/P (q).
Inductive Logic and Empirical Psychology
581
the pragmatic inference that being told that the categorical premise is true often suggests that there is a counterexample to the conditional premise. For example, consider the MT inference on the rule if I turn the key the car starts. If you were told that the car did not start, it seems unlikely that you would immediately infer that the key was not turned. Telling someone that the car did not start seems to presuppose that an attempt has been made to start it, presumably by turning the key. Consequently, the categorical premise here seems to suggest a counterexample to the conditional itself, i.e., a case where the key was turned but the car did not start. Hence one’s degree of belief in the conditional should be reduced on being told that the car did not start. Notice, here, the contrast between being told that the car did not start (and drawing appropriate pragmatic inferences), and merely observing a car that has not started (e.g., a car parked in the driveway). In this latter situation, it is entirely natural to use the conditional rule to infer that the key has not been turned. Where the second, pragmatic, inferential role of the categorical premise is operative, this violates what is called the rigidity condition on conditionalisation, P0 (q|p) = P1 (q|p) [Jeffrey, 1983]. That is, learning the categorical premise alters ones degree of belief in the conditional premise. Oaksford and Chater [2007; 2008] argue that taking account of such rigidity violations helps capture the probability of the conditional; and that, for MT, this modified probability is then used in conditionalisation. Furthermore, they argue that DA and AC also suggest violations of the rigidity condition, concerning the case where the car starts without turning the key. These violations lead to reductions in ones’s degree of belief that the cars starts, given that the key is turned (P0 (q|p)). Using this lower estimate to calculate the relevant probabilities for DA, AC and MT can rationally explain the relative magnitudes of the MP–MT and DA–AC asymmetries (see Figure 2, Panel D). Another one of the key empirical biases of conditional inference is negative conclusion bias. This bias arises when negations are used in conditional statements, e.g., if a bird is a swan, then it is not red. In Evans’ [1972] Negations Paradigm, four such rules are used, if p then q, if p then not-q, if not-p then q, and if not-p then not-q. The most robust finding is that people endorse DA, AC, and MT more when the conclusion contains a negation. So, for example, DA on if p then q yields a negated conclusion, not-q, whereas, DA on if p then not-q yields an affirmative conclusion, q (because not-not-q = q). In the data, the frequency with which DA is endorsed for if p then q is much higher than for if p then not-q. To explain negative conclusion bias, Oaksford et al. [2000] appealed to the idea that most categories apply only to a minority of objects [Oaksford and Stenning, 1992]. Hence, the probability of an object being, say, red is lower than the probability of it not being red, i.e., P0 (Red ) < P0 (¬Red ). Consequently, the marginal probabilities (P0 (p) and P0 (q)) will take on higher values when p or q are negated. Higher values of the prior probabilities of the conclusion imply higher values of the relevant conditional probabilities for DA, AC and MT, i.e., to higher values of the posterior probability of the conclusion. So, for example, for our rule if
582
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
a bird is a swan, then it is white, the prior probability of the conclusion of the DA inference (P0 (¬White)) is high. This means that the conditional probability (P0 (¬White|¬Swan)) is also high and, consequently, so is the probability of the conclusion (P1 (¬White)). Therefore, an apparently irrational negative conclusion bias can be seen as a rational “high probability conclusion” effect. Oaksford et al. [2000] tested this explanation by manipulating P0 (p) and P0 (q) directly rather than using negations and showed results closely analogous to negative conclusion bias. To conclude this section on conditional inference, we briefly review one of the most cited problems for a probabilistic account. Like any computational level analysis, this account avoids theorising about the specific mental representations or algorithms involved in conditional reasoning. This may seem unsatisfactory. We suggest, by contrast, that it is premature to attempt an algorithmic analysis. The core of the probabilistic approach interprets conditionals in terms of conditional probability, i.e., using the Equation; and our current best understanding of conditional probability is given by the Ramsey test [Bennett, 2003]. But there is currently no possibility of building a full algorithmic model to carry through the Ramsey test, because this involves solving the notorious frame problem [Pylyshyn, 1987]. That is, it involves knowing how to update one’s knowledge-base, in the light of a new piece of information — and this problem has defied 40 years of artificial intelligence research. Nonetheless, an illustrative small-scale implementation of the Ramsey test is provided by the operation of a constraint satisfaction neural network [Oaksford, 2004; Oaksford and Chater, in press]. In such a model, performing a Ramsey test means clamping on or off the nodes or neurons corresponding to the categorical premise of a conditional inference. Network connectivity determines relevance relations and the weight matrix encodes prior knowledge. Under appropriate constraints, such a network can be interpreted as computing true posterior probabilities [McClelland, 1998]. A challenge for the future is to see whether such small-scale implementations can capture the full range of empirically observed effects in conditional inference. Data Selection Data selection involves choosing data to confirm or disconfirm a hypothesis, and it has been extensively investigated empirically using Wason’s [1968] selection task. This task has featured prominently in the philosophical discussions about human rationality (e.g., [Cohen, 1980; Stich, 1985; Stein, 1996]. In this task, people see four double-sided cards, with a number on one side and a letter on the other. They are asked which cards they should turn over, in order to test the hypothesis that if there is an A (p) on one side of a card, then there is a 2 (q) on the other. The upturned faces of the four cards show an A (p), a K (¬p), a 2 (q), and a 7 (¬q). As Popper [1959/1935] argued, logically one can never be certain that a scientific hypothesis is true in the light of observed evidence, as the very next piece of
Inductive Logic and Empirical Psychology
583
evidence one discovers could be a counterexample. So just because all the swans you have observed up until now have been white is no guarantee that the next one will not be black. Instead, Popper argues that the only logically sanctioned strategy for hypothesis testing is to seek falsifying cases. In testing a conditional rule if p then q, this means seeking out p, ¬q cases. This means that, in the standard selection task, one should select the A (p) and the 7 (¬q) cards, because these are the only cards that could potentially falsify the hypothesis. However, as for conditional inference, there is a large divergence between this logical prediction and the data. Indeed, rather than seek falsifying evidence, participants seem to select the cases that confirm the conditional, i.e., the A (p) and the 2 (q). This is called “confirmation bias.” A Probabilistic Approach As with conditional inference, a variety of probabilistic approaches to data selection have been proposed [Evans and Over, 1996a; 1996b; Klauer, 1999; Nickerson, 1996; Over and Evans, 1994, Over and Jessop, 1998], which they all originate from the optimal data selection (ODS) model of Oaksford and Chater [1994] (see also, [1996; 2003b]). This model is derived from the normative literature on optimal experimental design in Bayesian statistics [Lindley, 1956]. The idea again relies on interpreting a conditional in terms of conditional probability. For example, the hypothesis, if swan (p) then white (p), is interpreted as making the claim that the probability of a bird being white given that it is a swan, P (q|p), is high, certainly higher than the base rate of being a white bird, P (q). This hypothesis is called the dependence hypothesis (HD ). Bayesian hypothesis testing is comparative rather than exclusively concentrating on falsification. Specifically, in the ODS model, it is assumed that people compare HD with an independence hypothesis (HI ) in which the probability of a bird being white, given it is a swan, is the same as the base rate of a bird being white, i.e., P (q|p) = P (q). We assume that, initially, people are maximally uncertain about which hypothesis is true (P (HD ) = P (HI ) = 0.5) and that their goal in selecting cards is to reduce this uncertainty as much as possible while turning the fewest cards. Take, for example, the card showing swan (p). This card could show white on the other side (p, q) or another color (p, ¬q). The probabilities of each outcome will be quite different according to the two hypotheses. For example, suppose that the probability of a bird being white, given that it is a swan is .9 (P (q|p, HD ) = .9) in the dependence hypothesis; the marginal probability that a bird is swan is .2 (P (p) = .2); and the marginal probability that a bird is white is .3 (P (q) = .3). Then, according to the dependence hypothesis, the probability of finding white (q) on the other side of the card is .9, whereas according to the independence hypothesis it is .3 (as the antecedent and consequent are, in this model, independent, we need merely consult the relevant marginal probability). And, according to the dependence hypothesis, the probability of finding a colour other than white (¬q) on the other side of the card is .1, whereas, according to the independence hypothesis,
584
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
it is .7. With this information, it is now possible to calculate one’s new degree of uncertainty about the dependence hypothesis after turning the swan card to find white on the other side (P (HD |p, q)). According to Bayes’ theorem, this probability is .75. Hence, one’s new degree of belief in the dependence model should be .75 and one’s degree of belief in the independence model should be .25. Hence, the degree of uncertainty about which hypothesis is true has been reduced. More specifically, the ODS model is based on information gain, where information is measured in bits as in standard communication theory [Shannon and Weaver, 1949]. Here, the initial uncertainty is 1 bit (because P (HD ) = P (HI ) = 0.5, equivalent to the uncertainty of a single fair coin flip) and in this example this is reduced to .81 bits (because now P (HD ) = .75 and P (HI ) = 0.25). This is an information gain of .19 bits. In Wason’s task, though, participants do not actually turn the cards, and hence they cannot know how much information they will gain by turning a card before doing so. Consequently, they must base their decision on expected information gain, taking both possible outcomes (p, q and p, ¬q) into account. The ODS model assumes that people select each card in direct proportion to its expected information gain. The ODS model also makes a key assumption about the task environment the rarity assumption: that the properties that occur in the antecedents and consequents of hypotheses are almost always rare and so have a low base rate of occurrence. For example, most birds are not swans and most birds are not white. That people make this assumption has received extensive independent verification [McKenzie et al., 2001; McKenzie and Mikkelsen, 2000; 2007]. The Empirical Data The ODS model predicts that the two cards that lead to the greatest expected information gain are the p and the q cards. Fitting the model to the data, reveals a good fit [Oaksford and Chater, 2003b] and when P (q|p, HD ) was set to .9 the best fitting values of P (p) and P (q) were .22 and .27 respectively, i.e., very close to the values used in the above example. The ODS model suggests that performance on the selection task displays rational hypothesis testing behaviour, rather than irrational confirmation bias. Taking rarity to an extreme provides a simple intuition here. Suppose we consider the (rather implausible) conditional if a person is bitten by a vampire bat (p), they will develop pointed teeth (q). Clearly, we should check people who we know to have been bitten, to see if their teeth are pointed (i.e., turn the p card); and, uncontroversially, we can learn little from people we know have not been bitten (i.e., do not turn the ¬p card). If we see someone with pointed teeth, it is surely worth finding out whether they have been bitten — if they have, this raises our belief in the conditional, according to a Bayesian analysis (this is equivalent to turning the q card). But it seems scarcely productive to investigate someone without pointed teeth (i.e., do not turn the ¬q card) to see if they have been bitten. To be sure, it is possible that such a person might have been bitten,
Inductive Logic and Empirical Psychology
585
which would disconfirm our hypothesis, and lead to maximum information gain; but this has an almost infinitesimal probability. Almost certainly, we shall find that they have not been bitten, and learn nothing. Hence, with rarity, the expected informativeness of the q card is higher than that of the ¬q card, diverging sharply from the falsificationist perspective, but agreeing with the empirical data. It has been suggested, however, that behaviour on this task might be governed by what appears to be a wholly non-rational strategy: matching bias. This bias arises in the same context as negative conclusion bias that we discussed above, i.e., in Evans’ [1972] negations paradigm. Take, for example, the rule if there is an A on one side, then there is not a 2 on the other side (if p then ¬q).The cards in this task are described using their logical status, so for this rule, 2 is the false consequent (FC) card and 7 is the true consequent card (TC). For this negated consequent rule, participants tend to select the A card (TA: true antecedent) and the 2 card (FC). That is, participants now seem to make the falsifying response. However, as Evans [1972] pointed out, participants may simply ignore the negations entirely and match the values named in the conditional, i.e., A and 2. Prima facie, this is completely irrational. However, the “contrast set” account of negation shows that due to the rarity assumption — that most categories apply to a minority of items — negated categories are high probability categories (see above). Having a high probability antecedent or consequent alters the expected information gains associated with the cards. If the probability of the consequent is high then the ODS model predicts that people should make the falsifying TA and FC responses, because these are associated with the highest information gain. Consequently, matching bias is a rational hypothesis testing strategy after all. Probabilistic effects were first experimentally demonstrated using the reduced array version of Wason’s selection task [Oaksford et al., 1997], where participants can successively select up to 15 q and 15 ¬q cards (there are no upturned p and ¬p cards that can be chosen). As predicted by the ODS model, where the probability of q is high (i.e., where rarity is violated), participants select more ¬q cards and fewer q cards. Other experiments have also revealed similar probabilistic effects [Green and Over, 1997; 2000; Kirby, 1994; Oaksford et al., 1999; Over and Jessop, 1998]. There have also been some failures to produce probabilistic effects, however (e.g., [Oberauer et al., 1999; 2004]). It has been argued that these arise because of weak probability manipulations or other procedural problems [Oaksford and Chater, 2003b; Oaksford and Moussakowski, 2004; Oaksford and Wakefield, 2003]). Using a natural sampling [Gigerenzer and Hoffrage, 1995] procedure, in which participants sample the frequencies of the card categories while performing a selection task, probabilistic effects have been observed using using the same materials as Oberauer et al. [1999], where these effects were not evident [Oaksford and Wakefield, 2003]. In further work on matching bias, Yama [2001] devised a crucial experiment to contrast the matching bias and the information gain accounts. He used rules
586
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
that introduced a high and a low probability category, relating to the blood types Rhesus Negative (Rh-) and Positive (Rh+). People were told that one of these categories, Rh-, was rare. Therefore, according to the ODS model, the rule if p then ¬Rh+ should lead participants to select the rare Rh- card. In contrast, according to matching bias they should select the Rh+ card. Yama’s [2001] data were largely consistent with the information gain model. Moreover, this finding was strongly confirmed by using the natural sampling procedure with these materials [Oaksford and Moussakowski, 2004]. Alternative probabilistic accounts of the selection task have also been proposed [Evans and Over, 1996a; 1996b; Klauer, 1999; Nickerson, 1996; Over and Evans, 1994, Over and Jessop, 1998]. Recently, Nelson [2005] directly tested the measures of information underpinning these models, including Bayesian diagnosticity [Over and Evans, 1994; Evans and Over, 1996b; McKenzie and Mikkelsen, 2007], information gain [Oaksford and Chater, 1994; 1996; 2003b; Hattori, 2002], Kullback-Liebler distance [Klauer, 1999; Oaksford and Chater, 1996], probability gain (error minimization) (Baron, 1981, 1985), and impact (absolute change) [Nickerson, 1996]. Using a related data selection task, he looked at a range of cases where these norms predicted different orderings of informativeness, for various data types. Nelson found the strongest correlations between his data and information gain (.78). Correlations with diagnosticity (-.22) and log diagnosticity (-.41) were actually negative. These results mirrored Oaksford, Chater, and Grainger’s [1999] results in the Wason selection task. Nelson’s work provides strong convergent evidence for information gain as the index that most successfully captures people’s intuitions about the relative importance of evidence.
Quantified Syllogistic Reasoning Quantified syllogistic reasoning relates two quantified premises. Logic defines four types of quantified premise: All, Some, Some. . . not, and None. An example of a logically valid syllogistic argument is:
Therefore
Some Londoners (P ) are soldiers (Q) All soldiers (Q) are well fed (R) Some Londoners (P ) are well fed (R)
In this example, P and R are the end terms and Q is the middle term which is common to both premises. In the premises, these terms can only appear in four possible configurations which are called figures. When one of these terms appears before the copula verb (“are”) it is called the subject term (in the example, P and Q) and when one appears after this verb it is called the predicate term (Q and R). As the premises can appear in either order there are 16 combinations and as each can be in one of four figures there 64 different syllogisms. There are 22 logically valid syllogisms. If people are reasoning logically, they should endorse these syllogisms and reject the rest. However, observed behaviour is graded, across both valid and invalid syllogisms; and some invalid syllogisms are
Inductive Logic and Empirical Psychology
587
endorsed more than some valid syllogisms. Table 1 shows the graded behaviour over the 22 logically valid syllogisms. There are natural breaks dividing the valid syllogisms into three main groups. Those above the single line are endorsed most, those below the double line are endorsed least, and those in between are endorsed at an intermediate level.
Table 1. Meta-analysis of the logically valid syllogisms showing the form of the conclusion, the number of mental models an alternative non-probabilistic psychological account [Johnson-Laird, 1983] needed to reach that conclusion, and the percentage of times the valid conclusion was drawn, in each of the five experiments analysed by Chater and Oaksford [1999].
Syllogism All (Q,P), All (R,Q) All (P,Q), All (Q,R) All (Q,P), Some(R,Q) Some(Q,P), All (Q,R) All (Q,P), Some(Q,R) Some(P,Q), All (Q,R) No(Q,P), All(R,Q) All(P,Q), No(R,Q) No(P,Q), All(R,Q) All(P,Q), No(Q,R) All(P,Q), Some...not(R,Q) Some...not(P,Q), All(R,Q) All(Q,P), Some...not(Q,R) Some...not(Q,P), All(Q,R) Some(Q,P), No(R,Q) No(Q,P), Some(R,Q) Some(P,Q), No(R,Q) No(P,Q), Some(R,Q) Some(Q,P), No(Q,R) No(Q,P), Some(Q,R) Some(P,Q), No(Q,R) No(P,Q), Some(Q,R)
Conc. All All Some Some Some Some No No No No Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not
MMs 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3
Mean 89.87 75.32 86.71 87.97 88.61 86.71 92.41 84.81 88.61 91.14 67.09 56.33 66.46 68.99 16.46 66.46 30.38 51.90 32.91 48.10 44.30 26.56
Note The means in the final column are weighted by sample size.
588
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
A Probabilistic Approach There has only been one probabilistic approach developed for syllogisms. This is the Probability Heuristics Model (PHM), [Chater and Oaksford, 1999], which was developed at both the computational and the algorithmic levels. One of the primary motivations for this model was the hypothesis that, from a probabilistic point of view, reasoning about all and some might be continuous with reasoning about more transparently probabilistic quantifiers, such as most and few. By contrast, from a logical stand point, such generalised quantifiers require a different, and far more complex, treatment [Barwise and Cooper, 1983], far beyond the resources of existing logic-based accounts in psychology. Perhaps for this reason, although generalized quantifiers were discussed in early mental models theory [Johnson-Laird, 1983], no empirical work on these quantifiers was carried out in the psychology of reasoning. In deriving PHM, the central first step is to assign probabilistic meanings to the central terms of quantified reasoning using conditional probability. Take the universally quantified statement, All P are Q (we use capitals to denote predicates; these should be applied to variables x which are bound by the quantifier, e.g., P (x), but we usually leave this implicit). Intuitively, the claim that All Londoners are soldiers can naturally be cast in probabilistic terms: as asserting that the probability that a person is a soldier given that they are a Londoner is 1. More generally, the probabilistic interpretation of All is straightforward: because its underlying logical form can be viewed as a conditional, i.e., All (x)(if P (x) then Q(x)). Thus, the meaning is given as P (Q|P ) = 1, as specifying the conditional probability of the predicate term (Q), given the subject term (P ). Similar constraints can be imposed on this conditional probability to capture the meanings of the other logical quantifiers. So, Some P are Q means that P (Q|P ) > 0; Some P are not Q means that P (Q|P ) < 1; and No P are Q means that P (Q|P ) = 0. Thus, for example, “Some Londoners are soldiers” is presumed to mean that the probability that a person is a soldier given that they are a Londoner is greater than zero, and similarly for the other quantifiers. Such an account generalises smoothly to the generalised quantifiers most and few. Most P are Q means that 1 − Δ < P (Q|P ) < 1 and Few P are Q means that 0 < P (Q|P ) < Δ, where Δ is small. So, for example, Most Londoners are soldiers may be viewed as stating that the probability that a person is a soldier, given that they are a Londoner is greater than, say, .8, but less than 1. At the computational level, these interpretations are used to build very simple graphical models (e.g., [Pearl, 1988]) of quantified premises, to see if they impose constraints on the conclusion probability. For example, take the syllogism:
Therefore
Some P are Q All Q are R Some P are R
P →Q→R
The syllogistic premises on the left define the dependencies on the right because of their figure, i.e., the arrangement of the middle term (Q) and the end terms
Inductive Logic and Empirical Psychology
589
(P and R) in the premises. There are four different arrangements or figures. The different figures lead to different dependencies, with different graphical structures. Note that these dependency models all imply that the end terms (P and R) are conditionally independent given the middle term because there is no arrow linking P and R, except via the middle term Q. Assuming conditional independence as a default is a further assumption about the environment, an assumption not made in, for example, Adams’ [1998] probability logic. These dependency models can be parameterised. Two of the parameters will always be the conditional probabilities associated with the premises. One can then deduce whether the constraints on these probabilities, implied by the above interpretations, impose constraints on the possible conclusion probabilities, i.e., P (R|P ) or P (P |R). In this example, the constraints that P (Q|P ) > 0, and P (R|Q) = 1 and the conditional independence assumption entail that P (R|P ) > 0. Consequently, the inference to the conclusion Some P are R is probabilistically valid (p-valid). If each of the two possible conclusion probabilities, P (R|P ) or P (P |R), can fall anywhere in the [0, 1] interval given the constraints on the premises, then no p-valid conclusion follows. It is then a matter of routine probability to determine which inferences are p−valid, of the 144 two premise syllogisms that arise from combining most and few and the four logical quantifiers [Chater and Oaksford, 1999]. In PHM, however, this rational analysis is also supplemented by an algorithmic account. It is assumed that people approximate the dictates of this rational analysis by using simple heuristics. Before introducing these heuristics, though, we must introduce two key notions: the notions of the informativeness of a quantified claim, and the notion of probabilistic entailment between quantified statements. According to communication theory, a claim is informative in proportion to how surprising it is: informativeness varies inversely with probability. But what is the probability of an arbitrary quantified claim? To make sense of this idea, we begin by making a rarity assumption, as in our models of the conditional reasoning, and the selection task, i.e., the subject and predicate terms apply to only small subsets of objects. On this assumption, if we selected subject term P , and predicate term, Q, at random, then it is very likely that they will not cross-classify any object (this is especially true, given the hierarchical character of classification, Rosch, 1975). Consequently, P (Q|P ) = 0 and so No P are Q is very likely to be true, e.g., No toupees are tables. Indeed, for any two randomly chosen subject and predicate terms it is probable that No P are Q. Such a statement is therefore quite uninformative. Some P are not Q is even more likely to be true, and hence still less informative, because the probability interval it covers includes that for No P are Q. The quantified claim least likely to be true is All P are Q, which is therefore the most informative. Overall the quantifiers have the following order in informativeness: I(All ) > I(Most) > I(Few ) > I(Some) > I(None) > I(Some-not) (see [Oaksford et al., 2002] for further analysis and discussion). Informativeness applies to individual quantified propositions. The second background idea, probabilistic entailment, concerns inferential relations between quan-
590
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
tified propositions. Specifically, the use of one quantifier frequently provides evidence that another quantifier could also have been used. Thus, the claims that All swans are white is strong evidence that Some swans are white — because P (white|swan) = 1 is included in the interval P (white|swan) > 0 (according to standard logic, this does not follow logically, as there may be no swans). Thus, we say that All probabilistically entails (or p-entails) Some. Similarly, Some and Some. . . not are mutually p-entailing because the probability intervals P (Q|P ) > 0 and P (Q|P ) < 1 overlap almost completely. With this background in place, we can now state the probabilistic heuristics model (PHM) for syllogistic reasoning. There are two types of heuristic: generate heuristics which produce candidate conclusions, and test heuristics, which evaluate the plausibility of the candidate conclusions. The PHM account also admits the possibility that putative conclusions may also be tested by more analytic test procedures. The generate heuristics are: (G1)
Min-heuristic: The conclusion quantifier is the same as that of the least informative premise (min-premise)
(G2)
P-entailments: The next most preferred conclusion quantifier will be the p-entailment of the min-conclusion
(G3)
Attachment-heuristic: If just one possible subject noun phrase (e.g., Some R) matches the subject noun phrase of just one premise, then the conclusion has that subject noun phrase.
The two test heuristics are: (T1)
Max -heuristic: Be confident in the conclusion generated by G1–G3 in proportion to the informativeness of the most informative premise (max premise)
(T2)
Some-heuristic Avoid producing or accepting Some not conclusions, because they are so uninformative.
We show how the heuristics combine in the example below:
Therefore
All P Some Some Some
are Q R are not Q not R are not P
(max -premise) (min-premise) (by min-heuristic) (by attachment-heuristic)
and a further conclusion can be drawn: Some R are P
[by p-entailment]
Comparing the results of these heuristics with probabilistic validity, it can be shown that where there is a p-valid conclusion, the heuristics generally identify it. For example, the idea behind the min-heuristic is to identify the most informative conclusion that validly follows from the premises. Out of the 69 p-valid syllogisms,
Inductive Logic and Empirical Psychology
591
the min-heuristic identifies that conclusion for 54; for 14 syllogisms the p-valid conclusion is less informative than the min-conclusion. There is only one violation, where the p-valid conclusion is more informative than the min-conclusion. The Empirical Data In turning to the experimental results, we first show how all the major distinctions between standard syllogisms captured by other theories are also captured by PHM. So, returning to Table 1, all the syllogisms above the double line have the most informative max -premise, All (see heuristic T1). Moreover, all the syllogisms below the single line have uninformative conclusions Some. . . not (see heuristic T2) and those below the double line violate the min-heuristic (heuristic G1) and require a p−entailment (heuristic G2), i.e., Some. . . not ↔ Some. Consequently, this simple set of probabilistic heuristics makes the same distinctions among the valid syllogisms as the mental models account perhaps the most influential account of syllogistic reasoning [Johson-Laird, 1983]. In this review, we concentrate on novel predictions that allow us to put clear water between PHM and other theories. As we discussed above, the most important feature of PHM is the extension to generalised quantifiers, like most and few. No other theory of reasoning has been applied to syllogistic reasoning with generalised quantifiers. Table 2 shows the p-valid syllogisms involving generalised quantifiers showing the conclusion type and the percentage of participants selecting that conclusion type in Chater and Oaksford’s [1999] Experiments 1 and 2. The single lines divide syllogisms with different max -premises, showing a clear ordering in levels of endorsements dependent on heuristic T1. All those above the double line conform to the min-heuristic (heuristic G1), whereas those below do not and require a p-entailment (heuristic G2). As Chater and Oaksford [1999] pointed out, one difference with experiments using standard logical quantifiers was that the Some. . . not conclusion was not judged to be as uninformative, i.e., heuristic T2 was not as frequently in evidence. However, in general, in experiments using generalised quantifiers in syllogistic arguments the heuristics of PHM predict the findings just as well as for the logical quantifiers [Chater and Oaksford, 1999]. Many further results have emerged that confirm PHM. We discuss briefly discuss three of these results. First, the min-heuristic captures an important novel distinction between strong and weak possible conclusions introduced by Evans, Handley, Harper and Johnson-Laird [1999]. They distinguished conclusions that are necessarily true, possibly true or impossible. For example, taking the syllogism discussed earlier (with premises, Some P are Q, All Q are R), the conclusion Some P are R follows necessarily, No P are R is impossible, and Some P are not R is possible. Some possible conclusions are endorsed by as many participants as the necessary conclusions [Evans, et al. [1999]. Moreover, some of the possible conclusions were endorsed by as few participants as the impossible conclusions. Evans et al. [1999] observe that possible conclusions that are commonly endorsed all conform to the min-heuristic, whereas those which are rarely endorsed violate the
592
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Table 2. The p-valid syllogisms less the syllogisms that are also logically valid (shown in Table 1), showing the form of the conclusion and the proportion of participants picking the p-valid conclusion in Chater and Oaksford’s [1999] Experiments 1 and 2. Syllogism All (Q,P), Most(R,Q) Most(Q,P), All (R,Q) All (P,Q), Most(Q,R) Most(P,Q), All (Q,R) Few (P,Q), All (R,Q) All (P,Q), Few (R,Q) Few (P,Q), All (R,Q) All (P,Q), Few (Q,R) Most(Q,P), Most(R,Q) Most(P,Q), Most(Q,R) Few (Q,R), Most(R,Q) Most(Q,R), Few (R,Q) Most(P,Q), Few (Q,R) Most(Q,P), Some...not(R,Q) Some...not(Q,P), Most(R,Q) Some...not(Q,P), Most(Q,R) Most(Q,P), Some...not(Q,R) Most(P,Q), Some...not(Q,R) Some...not(P,Q), Most(Q,R) Few (Q,P), Some...not(R,Q) Some...not(Q,P), Few (R,Q) Some...not(Q,P), Few (Q,R) Few (Q,P), Some...not(Q,R) Few (P,Q), Some...not(Q,R) Some...not(P,Q), Few (Q,R) All(P,Q), Most(R,Q) Most(P,Q), All(R,Q) Few (Q,P), Few (R,Q) Few (P,Q), Few (Q,R) Few (P,Q), Most(Q,R)
Conc. Most Most Most Most Few Few Few Few Most Most Few Few Few Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not Some...not
Mean 85 65 70 55 80 85 85 75 65 50 60 75 70 80 60 75 65 75 75 60 40 30 60 60 40 35 35 35 30 30
Note This table excludes the eight MI, IM, FI, and IF syllogisms which have two p-valid conclusions only one of which was available in Chater and Oaksford’s [1999] Experiment 2.
Inductive Logic and Empirical Psychology
593
min-heuristic (with one exception). Hence, PHM captures this important finding. Second, recent work relating memory span measures to syllogistic reasoning has also confirmed PHM [Copeland and Radvanksy, 2004]. PHM makes similar predictions to mental models theory because the number of heuristics that need to be applied mirrors the one, two and three model syllogism distinction (see Table 1). For one model syllogisms just the min-heuristic and attachment is required (two heuristics). For two model syllogisms, the some. . . not-heuristic is also required (three heuristics). In addition, for three model syllogisms ap-entailment is required (four heuristics). The more mental operations that need to be performed, the more complex the inference will be and the more working memory it will require. Copeland and Radvansky [2004] found significant correlations between working memory span and strategy use, for both mental models and PHM. While not discriminating between theories, this work confirmed the independent predictions of each theory for the complexity of syllogistic reasoning and its relation to working memory span. Third, Copeland [2006] has provided detailed model fits to experimental data on “extended” syllogisms, i.e., syllogisms involving three quantified premises (he used only the four logical quantifiers). He fitted three different psychological models (see [Rips, 1994; Johnson-Laird and Byrne, 1991]) to these data including PHM. Using a measure of fit that penalised for complexity, he found that PHM provided better fits to the data across two experiments. This is impressive as these data only involved the logical quantifiers, that these other theories were explicitly designed to explain.
Summary: Deductive reasoning To conclude, a Bayesian probabilistic approach to the psychology of deductive reasoning seems to make sense of a fairly broad set of phenomena that would otherwise appear to question human rationality. There are other models, such as those proposed by Rips [1994] and Johnson-Laird and Byrne [1991; 2007], that address many of the same results. However, these theories invariably deal with deviations from rationality at the algorithmic level. 5
DECISION MAKING
Whereas reasoning concerns how people use given information to derive new information, the study of decision making concerns how people’s beliefs and values determine their choices. In the context of reasoning, there is fundamental debate concerning the most basic elements of a normative framework against which human performance should be compared (e.g., whether the framework should be logical [e.g., Johnson-Laird and Byrne, 1991; Rips, 1994] or probabilistic [Oaksford and Chater, 2007]). By contrast, expected utility theory is fairly widely assumed to be the appropriate normative theory to determine how, in principle, people ought to make decisions.
594
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Expected utility theory works by assuming that each outcome, i, of a choice can be assigned a probability, Pr(i) and a utility, U (i) and that the utility of an uncertain choice (e.g., a lottery ticket; or more generally, any action whose consequences are uncertain), is: (i)U (i) Pr
Expected utility theory recommends the choice with the maximum expected utility. This normative account is breathtakingly simple, but hides what may be enormous practical complexities — both in estimating probabilities; and establishing what people’s utilities are. Thus, when faced with a practical personal decision (e.g., whether to take a new job, which house to buy, whether or whom to marry), decision theory is not easy to apply — because the possible consequences of each choice are extremely complex, their probabilities ill-defined, and moreover, we often have little idea what preferences we have, even if the outcomes were definite (e.g., [Gigerenzer, 2002]). Thus, one difficulty with expected utility theory is practicability in relation to many real-world decisions. Nonetheless, where probabilities and utilities can be estimated with reasonable accuracy, expected utility is a powerful normative framework. How far can expected utility theory be used as an explanation not merely for how agents should behave, but of how agents actually do behave? Rational choice theory, which provides a foundation for explanation in microeconomics and sociology (e.g., [Becker, 1976; 1996; Elster, 1986]) as well as perception and motor control [K¨ ording and Wolpert, 2006], animal learning [Courville et al., 2006] and behavioral ecology [Krebs and Davies, 1996; Stephens and Krebs, 1986], assumes that it does. This style of explanation involves inferring the probabilities and utilities that agents possess; and using expected utility theory to infer their choices according to those probabilities and utilities. Typically, there is no specific commitment concerning whether or how the relevant probabilities and utilities are represented — instead, the assumption is that preferences and subjective probabilities are “revealed” by patterns of observed choices. Indeed, given fairly natural consistency assumptions concerning how people choose, it can be shown that the observed pattern of choices can be represented in terms of expected utility — i.e., appropriate utilities and subjective probabilities can be inferred [Savage, 1954], with no commitment to their underlying psychological implementation. Indeed, this type of result can sometimes be used as reassurance that the expected utility framework is appropriate, even in complex real-world decisions, where people are unable to estimate probabilities or utilities. The descriptive study of how people make decisions has, as with the study of reasoning, taken the normative perspective as its starting point; and aimed to test experimentally how far normative assumptions hold good. In a typical experiment, outcomes are made as clear as possible: for example, people may choose between monetary gambles, with known probabilities; or between gambles
Inductive Logic and Empirical Psychology
595
and fixed amounts of money. A wide range of systematic departures from the norms of expected utility are observed in such experiments, as demonstrated by the remarkable research programme initiated by Kahneman, Tversky and their colleagues (e.g., [Kahneman et al., 1982; Kahneman and Tversky, 2000]). Thus, for example, people can be induced to make different decisions, depending on how the problem is “framed.” Thus, if a person is given £10 at the outset, and told that they must choose either a gamble, with a 50% chance of keeping the £10, and a 50% chance of losing it all; or they must give back £5 for certain, they tend to prefer to take the risk. But if they are given no initial stake, but asked whether they prefer a 50-50 chance of £10, or a certain £5, they tend to play safe. Yet, from a formal point of view these choices are identical — the only difference is that in one case the choice is framed in terms of losses (where people tend to be risk-seeking); rather than gains (where they tend to be risk-averse). Expected utility theory cannot account for framing effects of this type — only the formal structure of the problem should matter, from a normative point of view; the way in which it is described should be irrelevant. Indeed, expected utility theory can’t well account for the more basic fact that people are not risk neutral (i.e., neutral between gambles with the same expected monetrary value) for small stakes [Rabin, 2000]. This is because, from the standpoint of expected utility theory, people ought to evaluate the possible outcomes of a gamble in “global” terms — i.e., in relation to the impact on their life overall. Hence, if a person has an initial wealth of £10,000, then both the gambles above amount of choosing between a 50-50 chance of ending up with a wealth of £10,010 or £10,000, or a certain wealth of £10,005. One reaction to this type of clash between human behaviour and rational norms is the observation that the human behaviour is error-prone — and hence, where this is true, expected utility will be inadequate as a descriptive theory of choice. A natural follow-up to this, though, is to attempt to modify the normative theory so that it is provides a better fit with the empirical data. A wide range of proposals of this sort have been put forward, including prospect theory [Kahneman and Tversky, 1979; Tversky and Kahneman, 1992], regret theory [Loomes and Sugden, 1982], and rank-dependent utility theory [Quiggin, 1993]. Indeed, prospect theory, by far the most influential framework, was deliberately conceived as an attempt to find “the minimal set of modifications of expected utility theory that would provide a descriptive account” of risky choices ([Kahneman, 2000, p. 411], as cited in [Brandst¨ atter et al., 2006]). In essence, prospect theory modifies expected utility theory in three main ways. First, monetary outcomes are consider in isolation, rather than aggregated as part of total wealth. This fits with the wider observation that people tend to view different amounts of money, or indeed goals, quantities or events of any kind, oneby-one, rather than forming a view of an integrated whole. This observation is the core of Thaler’s [1985] “mental accounting” theory of how people make real-world financial decisions.
596
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
Second, prospect theory assumes that while the value function (i.e., relating money to subjective value) for positive gains is concave (indicating risk aversion in an expected utility framework), the value function for losses is convex. This implies that the marginal extra pain for an additional unit of loss (e.g., each extra pound or dollar lost) decreases with the size of the loss. Thus, people are riskseeking when a gamble is framed in terms of losses, but risk averse when it is framed in terms of gains, as we noted above. Moreover, the value function is steeper for losses than for gains, which captures the fact that most people are averse to gambles with a 1/2 chance of winning £10, and a 1/2 chance of losing -£10 [Kahneman and Tversky, 1979]. This phenomenon, loss aversion, has been used to explain a wide range of real world phenomena, including the status quo bias (losing one thing and gaining another tends to seem unappealing, because the loss is particularly salient, [Samuelson and Zeckhauser, 1988]) and the equity premium puzzle (share returns may be “unreasonably” high relative to fixed interest bonds, because people dislike falls in stock prices more than they like the equivalent gains, [Benartzi and Thaler, 1995]). The final key modification of expected utility theory is that prospect theory assumes that people operate with a distorted representation of probability. They overestimate probabilities near zero; and underestimate probabilities near 1, such that the relation between probability, p(i) and the “decision weights”, w(i), which are assumed to determined people’s choices, as related by an inverse-S shape. According to prospect theory, this distortion can explain the so-called “four-fold pattern” of risky decision making — that, for small probabilities, risk-preferences reverse both for gains and losses. So for example, when probabilities are high, e.g., .5, people prefer a certain gain of £500 to the probable gain of £1000, but they prefer the probable loss of £1000 to the certain loss of £500. When probabilities are low, e.g., .0005, people prefer a probable gain of £1000 to the certain gain of 50p, but they prefer the certain loss of £500 to the probable loss of £1000. The machinery of prospect theory integrates values and decision weights to assign a value to each gamble (where this is any choice with an uncertain outcome), just as in expected utility theory, so that the value of a risky option is:
w(i)v(i)
i
where w(i) is the decision weight (i.e., distorted probability) for outcome i; and v(i) is the value of that outcome. Psychological Models not Rooted in Economics Prospect theory and other variants of expected utility, hold with the assumption that people represent value and probability on some kind of absolute internal scale; and that they integrate these values by summing the product of weight and value over possible outcomes, to obtain the value of each gamble.
Inductive Logic and Empirical Psychology
597
Two recent psychological theories, however, set aside the structure of expected utility theory; they are inspired not by the attempt to modify normative considerations, but instead to trace the consequences of assumptions about the cognitive system. One recent approach [Br¨ andstatter et al., 2006] focuses on processing limitations, and on the consequences of assuming that the cognitive system is not able to integrate different pieces of information, and that, instead, people can only focus on one piece of information at a time. This assumption is controversial. In perceptual judgements (e.g., concerning the identity of a phoneme, or the depth of a surface), many theories explicitly assume (linear) integration between difference sources of information [Massaro, 1987; Schrater and Kersten, 2000] — in a probabilistic framework, this corresponds, roughly, to adding logs of the strength of evidence provided by each cue. Note, moreover, that such cue integration appears to be computationally natural in neural hardware (e.g., [Deneve et al., 2001]). Many models of higher-level judgement have assumed that information is also integrated, typically linearly (e.g., [Anderson, 1981; Hammond, 1996]). However, Gigerenzer and colleagues (e.g., [Gigerenzer and Goldstein, 1996; Gigerenzer et al., 1999]) have influentially argued that high-level judgements — most famously, concerning the larger of pairs of German cities — do not involve integration. Instead judgement is assumed to involve considering cues, one at a time — if a cue determines which city is likely to be larger, that city is selected; if not, a further cue is chosen, and the process is repeated. There has been considerable, and ongoing, controversy concerning the circumstances under which integration does or does not occur, in the context of judgement [Hogarth and Karelaia, 2005a; 2005b]. Br¨ andstatter, Gigerenzer and Hertwig’s [2006] innovation is to show that a nonintegrative model can make in-roads into understanding how people make risky decisions — a situation which has been viewed as involving the trade-off between “risk” and “return” almost by definition. Their model, the priority heuristic, has the following basic form. For gambles which contain only gains (or £0), the heuristic recommends considering features of the gambles in the order: minimum gain, probability of minimum gain, maximum gain. If gains differ by at least 1/10 of the maximum gain (or, for comparison of probabilities, if probabilities differ by at least 1/10), choose the gamble which is “best” on that feature (defined in the obvious way). Otherwise move the next feature in the list, and repeat. To see how this works, consider the gambles used above to illustrate the “fourfold” pattern of risky choice, described by Kahneman and Tversky [1979]. For the high probability gamble over gains, the minimum gain for the certain outcome is £500; but the minimum gain for the risky gamble is £0; this difference is far more than 1/10 of the maximum gain, £1000. Hence, the safe option is preferred. By contrast, for the low probability gamble, the difference between the minimum gains for the options is just 50p, which is much less than 1/10 of he maximum gain of £1000. Hence, this reason is abandoned, and we switch to probability of minimum gain — this is clearly higher for a certain gamble — as there is only one outcome, which is by definition the minimum. The risky gamble, with the smaller
598
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
probability of minimum gain, is therefore preferred. Thus, we have risk seeking with small probabilitities of large gains (and hence an explanation of why people buy lottery tickets). Br¨ andstatter, Gigerenzer and Hertwig propose a modification of the heuristic for gambles containing just losses, where “gain” is replaced by “loss” throughout, so that the feature order is: minimum loss, probability of minimum loss, maximum loss. If gains differ by at least 1/10 of the maximum loss (or probabilities differ by at least 1/10), choose the gamble which is “best” on that feature (defined in the obvious way). Otherwise move the next feature in the list, and repeat. Tracing through the reasoning described above, for the “loss” gambles in the “four-fold” pattern of risky choice, shows that people should appear risk seeking for losses, except where there is a small probability of a large loss; here people will again be risk averse (e.g., they will buy insurance). The priority heuristic model does, however, make some extremely strong and counterintuitive predictions — e.g., that if the minimum gains differ sufficiently, then all other features of the gambles (including the probability of obtaining those gains) will have no impact on choice. In extreme cases, this seems implausible. For example, a certain 11p whould be preferred to a .999999 probability of £1 (and otherwise £0). Br¨ andstatter, Gigerenzer and Hertwig [2006] restrict their account, however, to cases for which the expected values of the gambles are roughly comparable — where they are not, the gamble with the obviously higher expected value is chosen, and the priority heuristic is not invoked. Another recent approach to risk decision making, starting from cognitive principles rather than a normative economic account, is Decision by Sampling (DbS), [Stewart et al., 2006], see also [Stewart and Simpson, in press]. The starting point of DbS is the psychophysical observation that people can accurately make binary comparisons concerning the louder, or brighter, of two sensory magnitudes, but are extremely poor at judging the absolute magnitudes of such stimuli. Thus, for example, people can typically assign sensory magnitudes, however widely spaced, to no more than about five classes [Miller, 1956]; and even these crude judgements are subject to influences of the previous stimuli (e.g., [Garner, 1953]). Indeed, to a first approximation, people’s judgements can be well modeled by assuming that they have little or no coding of absolute magnitudes; but merely make relative judgements based on the “jumps” between successive magnitudes (for a detailed model along these lines, see [Stewart et al., 2005]). It seems natural to assume that representation of non-sensory magnitudes may behave similarly. If so, then the “gut feel” of how much value is associated with a particular amount of money or a particular probability may be dissociated from the absolute quantities involved. Instead, the DbS framework argues that such magnitudes are judged against a small number of other similar magnitudes, derived either from immediate context, or from memory. The rank of an item is, on this view, all that influences it subjective representation. Thus, if people have been thinking about small sums of money, a medium sized sum of money may seem large; if they have been thinking about larger sums, the same medium size sum
Inductive Logic and Empirical Psychology
599
may seem small. This viewpoint assumes that there people have no underlying internal “scales” for utility or probability — but nonetheless, it turns out to be possible to reconstruct something analogous to the value and decision weight functions from prospect theory. If people assess the gut feel of a magnitude in relation to prior examples, the statistical distribution of such magnitudes is likely to be important. Other things being equal, this distribution will provide an estimate of the probabilities of different comparison items being considered in particular judgements. Thus, if small sums of money are much more commonly encountered than large sums of money, then it is much more likely that people will consider small sums of money as comparison items, other things being equal. Therefore, the difference in “gut” feel between £5 and £50 will be much greater than that between £1005 and £1050, because sampling an item in the first interval (so that the lower and upper items will be assigned different ranks), is much more likely than sampling in the second. More generally, the attractiveness of an option, according to DbS, is determined by its rank in the set of comparison items; and hence, its typical attractiveness (across many sampling contexts) can be estimated by its rank position in a statistical sample of occurrences of the relevant magnitude. To examine this hypothesis, Stewart, Chater, and Brown [2006] examined a sample of “positive” sums of money — credits into accounts from a high street bank — and showed that plotting monetary value against rank produces a concave function, reminiscent of those in utility theory and prospect theory. Thus, the “gut” attractiveness of a sum of money is, on average, a diminishing function of amount. The similar analysis for losses (using bank account debits as a proxy) yields a convex function of value against losses, as in prospect theory. Moreover, for losses, the statistical distribution is more skewed towards small items, which has the consequence that ranks change more rapidly for small values for losses than for gain. This corresponds to a steeper value curve for losses and gains, and hence captures loss aversion. Indeed, putting the curves of rank against value together yields a curve strikingly reminiscent of that postulated in prospect theory. Applying the same logic to probability requires estimating typical probabilities that people consider. Stewart et al., [2006] attempt this by recording the corpus frequencies of probability-related phrases (e.g., likely, slight chance, probable, extremely doubtful, and so on); and secondly asking people to assign numerical probabilities to these phrases. This analysis yielded an estimate of the probabilities that people typically consider — and, perhaps not surprisingly, these are dense near 0 and 1. According to DbS, the gut feel of how large a probability seems depends on relative rank in this distribution — yielding an inverse S-shaped curve, as in prospect theory. Thus, DbS can capture many of the insights of prospect theory, and explain, rather than postulate, the relevant functional forms (e.g., concerning an analog of the inverse S-shape probability weighting function in prospect theory); but it is also able to predict strong local contextual effects, which are presumed to be determined by local sampling biases (e.g., [Stewart et al., 2003). The key gap in DbS is, though, the lack of a detailed theory of how sampling occurs,
600
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
in any specific decision making context. Most decision making research has concentrated on verbally stated “one-shot” problems. But there has been a long tradition in psychology of studying how people (and animals) make repeated decisions, typically under some schedule of reinforcement [Shanks, 1995], which has led to a range of computational models, many within a Bayesian, or partially, Bayesian framework [Kruschke, 2006]. There has also been recent interest in directly comparing “decision-by-experience” with performance on descriptive decision problems [Hertwig et al., 2004]. Early results appear to indicate that behaviour is different — for example, it has been argued that people may under rather than over weight small probabilities, when learning from experience (although see [Fox and Hadar, 2006]). This work is particularly interesting in the light of recent Bayesian models of reinforcement learning in animals and humans (e.g., [Courville et al., 2006]; and imaging studies which are beginning to connect Bayesian decision making models with brain function [Daw et al., 2006]. If aspects of human, and even rat, learning are Bayesian, the normative failure of human choice in descriptive problems seem all the more puzzling. Moving further away from descriptive decision problems, there has been recent investigation of how decision problems framed in terms of perceptuo-motor tasks are performed (e.g., [Trommersh¨ auser et al., 2006]. This is particularly interesting, given the recent surge of interest in sophisticated Bayesian decision-theoretic models of perceptuo-motor control. The spirit of these models is that the motor system may implement (approximations to) highly elaborate probabilistic calculations, in order to reduce costs concerning energy consumption, motor error, or a costs learned direct from experienced in an experimental set-up (e.g., [K¨ ording and Wolpert, 2006]). Trommersh¨ auser et al. [2006] have show how people adjust the direction of pointing towards a “target,” which provides monetary reward, but for which losses are incurred if the target is missed. It appears that the motor system rapidly adapts so that gain is maximized, in a way that is adapted to the intrinsic motor error involved in pointing. Again, the contrast of such apparently Bayesian behaviour with performance on descriptive choice problems is intriguing.
Summary: Decision Making In both reasoning and decision making, indeed, there is a certain air of paradox in human performance [Oaksford and Chater, 1998]. Human common-sense reasoning is far more sophisticated than any current artificial intelligence models can capture; yet people’s performance on, e.g., simple conditional inference, while perhaps explicable in probabilistic terms, is by no means effortless and noise-free; and similarly, in decision making, it appears that “low-level” repeated decision making may be carried out effectively (where, in the context of motor control, the complexity of the decision problem of planning trajectories for the motor system typically far exceed the capabilities of current methods [Todorov, 2004]). But perhaps this situation is not entirely paradoxical. It may be that both human reasoning and decision-making function best in the context of highly adapted cognitive processes
Inductive Logic and Empirical Psychology
601
such as basic learning, deploying world knowledge, or perceptuo-motor control. Indeed, what is striking about human cognition is the ability to handle, even to a limited extent, reasoning and decision making in novel, hypothetical, verbally stated scenarios, for which our past experience and evolutionary history may have provided us only minimal preparation. 6 ARGUMENTATION Reasoning and decision making often takes place in the service of argumentation, i.e., the attempt to persuade yourself or others of a particular, perhaps controversial, position [van Eemeren and Grootendorst, 1992]. Argumentation is the overarching human activity that studies of deductive reasoning, inductive reasoning, judgment and decision making are really required to explain. So one might attempt to persuade someone else to accept a controversial standpoint p by trying to persuade them that p is actually a logical consequence of their prior beliefs or current commitments; or that p has strong inductive support; or, where p is an action, that p will help to achieve their, our, or the country’s current goals. Recently, a Bayesian inductive logic approach has been extended to at least some aspects of argumentation (e.g., [Hahn and Oaksford, 2007]). The approach is very similar to accounts of conditional inference we reviewed above. We are concerned with the how the premises, P , of an argument affect the probability of the conclusion, C. If P (C|P ) is high then the argument has high inductive strength. This account has been applied most directly to reasoning fallacies in the attempt to understand how some instances of a fallacy seem to be good arguments while others do not. Fallacies — arguments that seem correct but aren’t, e.g., denying the antecedent — have been a longstanding focus of debate. Catalogues of reasoning and argumentative fallacies originate with Aristotle and populate books on logic and informal reasoning to this day. The classic tool brought to the analysis of fallacies is formal logic and it is widely acknowledged to have failed in providing a satisfactory account. Testament to this is the fact that fallacies figure in logic textbooks under the header of ‘informal reasoning fallacies’ (see e.g., [Hamblin, 1970]) — an acknowledgement of the inability to provide a sufficient formal logical treatment. In particular, logical accounts have proved unable to capture the seeming exceptions to fallacies that arise with simple changes in content that leave the structure of the argument unaffected. This suggests that either it is not formal aspects of fallacies that make them fallacious, or else that the relevant formal aspects are not being tapped into by classical logics. Oaksford and Hahn [2004], see also, [Hahn and Oaksford, 2006; 2007; Hahn et al., 2005a; Hahn et al., 2005b] provided evidence of such variation and put forward an alternative, Bayesian account: individual arguments are composed of a conclusion and premises expressing evidence for that conclusion. Both conclusion and premises have associated probabilities which are viewed as expressions of subjective degrees of belief. Bayes’ theorem then provides an update rule for the degree of belief associated with the conclusion in light of the evidence. Inductive strength,
602
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
then, on this account is a function of the degree of prior conviction, the probability of evidence, and the relationship between the claim and the evidence, in particular how much more likely the evidence would be if the claim were true. That is, different instances of argumentative fallacies may vary in inductive strength conceived of as the probability of the conclusion given the premises. Oaksford and Hahn [2007] also show how the concept of inductive strength in argumentation is related to the probabilistic analysis of the conditional (see above) and recent discussion in Rips (2001). We illustrate this approach by appeal to a particular informal reasoning fallacy: the argument from ignorance. A Probabilistic Approach A classic informal argument fallacy, which dates back to John Locke, is the socalled argument from ignorance, or argumentum ad ignorantiam. (7) Ghosts exist, because nobody has proven that they don’t. This argument does indeed seem weak. One would hesitate in positing the existence of all manner of things whose non-existence simply had not been proven, whether these be UFO’s or flying pigs with purple stripes. However, is it really the general structure of this argument that makes it weak, and if so what aspect of it is responsible? Other arguments from negative evidence are routine in scientific and everyday discourse and seem perfectly acceptable: (8) This drug is safe, because no-one has found any toxic effects. Should all arguments from negative evidence be avoided, or can a systematic difference between the two examples be recognized and explained? A Bayesian account can capture the difference between (7) and (8) as we show below. Moreover, it can capture the difference between positive and negative evidence which allows one to capture the intuition that the positive argument (9) is stronger than the negative argument (10):3 (9) Drug A is toxic because a toxic effect was observed (positive argument) (10) Drug A is not toxic because no toxic effects were observed (negative argument, i.e., the argument from ignorance). Though (10) too can be acceptable where a legitimate test has been performed, i.e., 3 One might argue that (9) and (10) are problematic because replacing “not toxic” with “safe” would alter the status of these arguments. This is not the case because we do not have a concept of a “safe effect.” The tests are tests for toxic effects. So (10) could be rephrased as, “Drug A is safe because no toxic effects were observed,” but not as, “Drug A is safe because safe effects were observed.” As the observation of toxic effects is driving these distinctions, what “safe” means in this context must be defined in terms of toxicity in order to define the relevant probabilities.
Inductive Logic and Empirical Psychology
603
If drug A were toxic, it would produce toxic effects in legitimate tests. Drug A has not produced toxic effects in such tests Therefore, A is not toxic Demonstrating the relevance of Bayesian inference for negative vs. positive arguments involves defining the conditions for a legitimate test. Let e stand for an experiment where a toxic effect is observed and ¬e stands for an experiment where a toxic effect is not observed; likewise let T stand for the hypothesis that the drug produces a toxic effect and ¬T stand for the alternative hypothesis that the drug does not produce toxic effects. The strength of the argument from ignorance is given by the conditional probability that the hypothesis, T , is false given that a negative test result, ¬e, is found, P (¬T |¬e). This probability is referred to as negative test validity. The strength of the argument we wish to compare with the argument from ignorance is given by positive test validity, i.e., the probability that the hypothesis, T , is true given that a positive test result, e, is found, P (T |e). These probabilities can be calculated from the sensitivity (P (e|T )) and the selectivity (P (¬e|¬T )) of the test and the prior belief that T is true (P (T )) using Bayes’ theorem. Let n denote sensitivity, i.e., n = P (e|T ), l denote selectivity, i.e., l = P (¬e|¬T ), and h denote the prior probability of drug A being toxic, i.e., h = P (T ), then, (11) P (T |e) =
nh nh + (1 − l)(1 − h)
(12) P (¬T |¬e) =
l(1 − h) l(1 − h) + (1 − n)h
Sensitivity corresponds to the “hit rate” of the test and 1 minus the selectivity corresponds to the “false positive rate.” Positive test validity is greater than negative test validity as long as the following inequality holds: (13) h2 (n − n2 ) > (1 − h)2 (l − l2 ) Assuming maximal uncertainty about the toxicity of drug A, i.e., P (T ) = .5 = h, this means that positive test validity, P (T|e), is greater than negative test validity, P (¬T |¬e), when selectivity (l) is higher than sensitivity. As Oaksford and Hahn [2004] argue, this is often a condition met in practice for a variety of clinical and psychological tests. Therefore, in a variety of settings, positive arguments are stronger than negative arguments. The Empirical Data Oaksford and Hahn [2004] provided experimental evidence to the effect that positive arguments such as (9) are indeed viewed as more convincing than their negative counterparts under the conditions just described. The evidence from their experiment further showed that people are sensitive to manipulations in the amount
604
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
of evidence (one versus 50 studies or tests) as predicted by the account. Finally, participants’ in their experiment displayed sensitivity to the degree of prior belief a character in a dialogue initially displayed toward the conclusion as the Bayesian account predicts. This finding captures the ‘audience dependence’ of argumentation assumed in the rhetorical research tradition (e.g., [Perelman and Olbrechts-Tyteca, 1969]). Hahn et al. [2005a] generalised this account to other versions of the argument from ignorance and addressed an outstanding problem. The ghosts example (14) differs from Oaksford and Hahn’s [2004] experimental materials in one, possibly important way. The argument for ghosts not only involves negative evidence, but also a flip in polarity between evidence and conclusion: negative evidence is provided to support the positive existence of something. In other words the inference is of the form: (14) not proven (not exist) → exist as opposed to merely: (15) not proven (exist) → not exist The examples in Oaksford and Hahn [2004] have the structure in (15) not the structure in (14). But it may be the opposite polarity case (14) that constitutes the true fallacy of the argument from ignorance. Classical logic licenses an inference from not(not p) to p, but not the inference underlying (14) which might be rendered as: (16) not says (not p) →? This is because when one has not said ‘not p,’ one can either have said ‘p’ or not spoken about ‘p’ at all. For example, in an argument one might defend oneself with the claim “I didn’t say you were rude”, which could be true either because one had specifically claimed the opposite or because one had not mentioned rudeness at all. So maybe nothing at all can be inferred in such cases? Hahn et al. [2005a] established that (16) can be a strong argument by using a form of the argument from ignorance based on epistemic closure which is related to the negation as failure procedure in artificial intelligence [Clark, 1978]. The case can be made with an informal example: imagine your work colleagues are having a staff picnic. You ask the picnic organizer whether your colleague Smith is coming and receive the reply that “Smith hasn’t said that he’s not coming”. Should this allow you to infer that he is in fact coming, or has he simply failed to send the required reply by e-mail? Your confidence that Smith will be attending will vary depending on the number of people that have replied. If you are told that no one has replied so far, assuming Smith’s attendance seems premature; if by contrast you are told that everyone has replied, you would be assured of his presence. In between these two extremes your degree of confidence will be scaled: the more people have replied the more confident you will be. In other words, the epistemic closure of the database in question (the e-mail inbox of the organizer) can
Inductive Logic and Empirical Psychology
605
vary from no closure whatsoever to complete closure, giving rise to corresponding changes in the probability that not says (not p) does in fact suggest that p. Hahn et al. [2005a] experiments confirmed that people are sensitive to variations in the epistemic closure of a database and that this affects their willingness to endorse argument like (16). Moreover, they found that arguments like (16) can be regarded as stronger than the standard negative evidence case (15). Therefore, as our example suggested, there would seem to be nothing in the structure of arguments like the Ghosts example that make them inherently unacceptable. The real reasons why negative evidence on ghosts is weak, i.e., why (7) is a weaker argument than (8), are the lack of sensitivity (ability to detect ghosts) of our tests as well as our low prior belief in their existence, i.e., (7) is weak a because of the probabilistic factors that affect the strength of the argument. Hahn and Oaksford [2006; 2007] have shown how this account generalises to other inferential fallacies, such as circularity and the slippery slope argument.
Summary: Argumentation In summary, in this section we have shown how comparing arguments in terms of their inductive strength can resolve the problem of why some instances of informal argument fallacies nonetheless seem like perfectly acceptable arguments that should rationally persuade an audience. 7 CHALLENGES AND FUTURE DIRECTIONS The models we have discussed in this review generally treat probabilistic methods as shedding important light on cognitive processes, although in a variety of ways, and at a variety of levels of explanation, as we have seen. Yet these applications of probability can, individually and collectively, be criticized — and the debates between proponents of probabilistic methods, and advocates of alternative viewpoints, have played an important role in the development of the cognitive sciences; and are likely to continue to do so. We briefly here consider some of the many concerns that may be raised against probabilistic approaches. Probabilistic approaches may be especially vulnerable, as noted above, when considered as models of explicit reasoning. As we have mentioned, there have been repeated demonstrations that explicit human decision making systematically deviates from Bayesian decision theory [Kahneman et al., 1982; Kahneman and Tversky, 2000]. Why might such deviations occur? Since Simon [1957], computational tractability has been a primary concern — with the conclusion that computationally cheap heuristic methods, which function reasoning well in the ecological environment in which the task must be performed, should be viewed as an alternative paradigm. Bounded rationality considerations have gradually become increasingly important in economics (e.g., [Rubinstein, 1998] — and hence, economists have increasingly begun to question the usefulness of strong rationality assumptions, such that agents are viewed as implicit probabilists and decision
606
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
theorists. Gigerenzer [Gigerenzer et al., 1999]) has led a particularly influential programme of research, aiming to define an “ecological” rationality, in which good reasoning is that which works quickly and effectively in the real world, rather than necessarily being justified in terms of normative mathematical foundations. This viewpoint may still see a role for probabilistic analysis — but as providing an explanation of why particular heuristics work in particular environments, rather than as characterizing the calculations that the cognitive system performs (a similar approach is adopted in the probability heuristics model of quantified syllogistic reasoning). A very different reason why people may not, in some contexts, be viewed as probabilists or decision theorists, concerns representation, rather than processing power. Some researchers (e.g., [Laming, 1997]) argue that people can only represent sensory magnitudes in relative rather than absolute terms; and that even this relative coding is extremely inaccurate and unstable. Indeed, the radical assumption that, to an approximation, people can make only simple qualitative binary judgements (e.g., “tone A is louder than tone B”; and “the difference in loudness between tones A and B is smaller than the difference in loudness between tones B and C”) is the basis for a recent model, the Relative Judgment Model [Stewart et al., 2005] that provides a simple and comprehensive account of how people can assign sensory magnitudes to discrete categories. If the same principles apply to magnitudes involved in decision making (e.g., time, probability, value, quality, and so on), then this suggests that people may not have a stable cardinal representation of the relevant decision variables, from which probabilistic calculations (of expected utility and the like) must begin — and hence the issue of computational considerations does not even arise. The recent model of risky decision making, Decision by Sampling [Stewart et al., 2006] mentioned above, shows how the assumption that people have no access to internal scales, but rely instead purely in binary judgments, can provide a straightforward account many well-known phenomena concerning risky decision making. This type of approach is extended to consider how far anomalies of choice in which items have multiple dimensions, which must be traded off, can explained in this framework. The concern that people do not have the appropriate representations over which probabilistic calculations can be performed may be most pressing in the context of explicit reasoning — where the underlying computational machinery has not been finely adapted over a long evolutionary history to solve a stable class of problems (e.g., such as perceiving depth, or reaching and grasping) but rather the cognitive system is finding an ad hoc solution, as best it can, to each fresh problem. Thus, as noted above, we may accept that explicit reasoning with probability may be poor, while proposing that underlying computational processes of perception, motor control, learning and so on, should be understood in probabilistic terms. Interestingly, though, related challenges to the Bayesian approach have arisen in perception. For example, Purves and colleagues (e.g., [Howe and Purves, 2004; 2005; Long and Purves, 2002; Nundy and Purves, 2002]) argue that the perceptual system should not be viewed as attempting to reconstruct the external world using
Inductive Logic and Empirical Psychology
607
Bayesian methods. Instead, they suggest that the output of the perceptual system should be seen as determined by the ranking of the present input in relation to the statistical distribution of previous inputs. This viewpoint is particularly clearly expressed in the context of lightness perception. The perceived lightness of a patch in the sensory array is determined not merely by the amount of incident energy in that patch, and its spectral composition, but is also a complex function of the properties of the area surrounding that patch. For example, a patch on the sensory array will be perceived as light if it is surrounded by a dark field; and may be perceived as relatively dark, if surrounded by light field. A natural Bayesian interpretation of this type of phenomena is that the perceptual system is attempting to factor out the properties of the light source, and to represent only the reflectance function of the surface of the patch (i.e., the degree to which that patch absorbs incident light). Thus, the dark surrounding field is viewed as prima facie evidence that the lighting is dim; and hence the patch itself is viewed as reflective; a bright surrounding field appears to support the opposite inference. This type of analysis can be formulated elegantly in probabilistic terms [Knill and Richards, 1996]. Purves and colleagues argue, instead, that the percept should not be viewed as reconstructing an underlying reflectance function — or indeed any other underlying feature of the external world. Instead, they suggest that the background field provides a context in which statistics concerning the amount of incident light is collected; and the lightness of a particular patch, in that context, is determined by its rank in that statistical distribution. Thus, when the surround is dark, patches within that surround tend to be dark (e.g., because both may be explained by the presence of a dim light source); when the surround is light, patches in that surround tend to be light. Hence, the rank position of an identical patch will differ in the two cases, hence leading to contrasting lightness percepts. Nundy and Purves [2002] conduct extensive analysis of the statistical properties of natural images, and argue that the resulting predictions frequently depart from the predictions of the Bayesian analysis; and that the rank-based statistical analysis better fits the psychophysical data. Various responses from a Bayesian standpoint are possible — including, most naturally, the argument that, where statistical properties of images diverge from the properties of an underlying probabilistic model, this is simply an indication that the probabilistic model is incomplete. Thus, a revised Bayesian approach may account for apparent anomalies, as the model should more accurate capture the statistical properties of images. To some degree, this response may seem unsatisfying, as the ability to choose between the enormous variety of probabilistic image models may seem to give the Bayesian excessive theoretical latitude. On the other hand, the choice of model is actually strong constrained, precisely because its output can directly be tested, to see how far it reproduces the statistical properties of natural images [Yuille and Kersten, 2006]. But the challenge of the Purves’s approach is that the probabilistic machinery of the Bayesian approach is unnecessary — that there is a much more direct explanation of perceptual experience, which does not involve factoring apart luminance levels and reflectance
608
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
functions; but which works directly with the amount of incident light in field and surround; and which considers only ordinal properties of relevant statistical distributions, rather than the absolute magnitudes that appear to be the appropriate to a Bayesian analysis. Whether such calculations should best be viewed as departing entirely from the probabilistic approach, or rather as an illustration of how probabilistic calculations can be approximated cheaply, by analogy with heuristic-based approaches to decision making, is not clear. A more general objection to the probabilistic approach to cognition, which we have touched on already, is the complexity of the approach. In one sense, the probabilistic approach is elegantly simple — we need simple assign prior probabilities, and then remorsely follow the laws of the probability calculus, as further data arises. But in another sense, it is often highly complex — because assigning priors to patterns of belief, images, or sentences, may require specifying an extremely complex probabilistic model, from which such information can be generated. Thus, the cognitive modeller may sometimes be accused of putting so much complexity into the model that the ability to capture the relevant data is hardly impressive. This chapter illustrates that the balance between model and data complexity is not necessarily out of balance. Moreover, the contribution of Bayesian models may often be in providing qualitative explanations (e.g., for why there should be a direct relationship between the probability of recurrence of an item, and its retrievability from memory, e.g., [Anderson and Milson, 1989; Anderson and Schooler, 1991, Schooler and Anderson, 1997]. Despite this, however, the question of how to constrain probabilistic models as far as possible in an important one. One approach, for example, is to take representation, rather than probability, as the basic construct. According to this approach, the preferred interpretation of a set of data is that which can be used to provide the shortest encoding of that data. Thus, the problem of probabilistic inference is replaced by a problem of finding short codes. It turns out that there are very close relationships between the two approaches, based in both Shannon’s theory of communication [Shannon and Weaver, 1949; Mackay, 2003]; and the more general concept of algorithmic information, quantified by Kolmogorov complexity theory [Li and Vit´ anyi, 1997]. These relationships are used to argue that the two approaches make identical behavioral predictions [Chater, 1996]. Roughly, the idea is that representations may be viewed as defining priors, such that, for any object x, with a shortest code of length c(x), the prior Pr(x) is 2−c(x) . Conversely, for any prior distribution Q(x) (subject to mild computability constraints that need not detain us here), there will be a corresponding system of representation (i.e., a coding language) cQ , such that, for any data, x, probable representations or hypotheses, Hi , will correspond to those which provide the shortest codes for x. This means, roughly, that the probabilistic objective of finding the most probable hypothesis can be replaced by the coding objective of finding the hypothesis that supports the shortest code. The equivalence of these frameworks can be viewed as resolving a long-standing dispute between simplicity and likelihood (i.e., probabilistic) views of perceptual organization (e.g., [Pomerantz and Kubovy, 1987]),
Inductive Logic and Empirical Psychology
609
as argued by Chater [1996]. Despite these close relationships, taking representation and coding as basic notions has certain advantages. First, the cognitive sciences arguable already have a substantial body of information concerning how different types of information is represented — certainly this has been a central topic of experimental and theoretical concern; but by contrast the project of assessing probabilistic models directly seems more difficult. Second, priors are frequently required for representations which presumably have not been considered by the cognitive system. In a standard Bayesian framework, we typically define a space of hypotheses, and assign priors over that space; but we may also wonder what prior would be assigned to a new hypothesis, if it were considered (e.g., if a particular pattern is noticed by the perceptual system; or if a new hypothesis is proposed by the scientist). Assuming that the coding language is universal, then these priors are well-defined, even for an agent that has not considered them — their prior probability of any H is presumed to be 2−c(H) . Third, rooting priors in a coding language frees the cognitive system from the problem of explicitly having to represent such prior information (though this may be done in a very elegant and compact form, see, e.g., [Tenenbaum et al., 2006]). Technical developments in coding-based approaches to inference (e.g., [Barron et al., 1998; Hutter, 2004; Li and Vit´ anyi, 1997; Rissanen, 1987; 1996; Wallace and Freeman, 1987]) as well as applications to cognition (e.g., [Brent and Cartwright, 1996; Chater and Vit´ anyi, 2007; Dowman, 2000; Feldman, 2000; Goldsmith, 2001; Pothos and Chater, 2002]) have been divided concerning whether a coding-based approach to inference should be viewed as a variant of the probabilistic account (i.e., roughly, as using code lengths as a particular way of assigning priors); or whether it should be viewed as an alternative approach. One argument for the former, harmonious, interpretation is that the probabilistic interpretation appears necessary if we consider choice. Thus, for example, maximizing expected utility (or similar) requires computing expectations — i.e., knowing the probability of various outcomes. Thus, rather than viewing simplicity-based approaches as a rival to the probabilistic account of the mind, we instead tentatively conclude that it should be viewed as an alternative, and often useful, perspective on probabilistic inference. CONCLUSION This chapter has introduced the ways in which inductive logic has been applied in empirical psychology to provide models of a range of high level cognitive abilities. Although Bayesian methods have been applied at a variety of levels of explanation of cognition and perception, we have concentrated in the main on central cognitive processes [Fodor, 1983]. These are the processes of central concern in philosophical logic, i.e., those where the inferences involved can be expressed verbally and where a clear delineation between premises and conclusion can be made. In language, inductive reasoning, deductive reasoning, argumentation, and decision making, we
610
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
have shown that inductive logic has been able to provide new insights in to the processes involved. Thus, in recent years it seems inductive logic has facilitated many promising developments in the attempt to understand human cognition. ACKNOWLEDGEMENTS Acknowledgements: Nick Chater is supported by a Senior Research Fellowship from the Leverhulme Trust, and the ESRC Centre for Economic Learning and Social Evolution (ELSE). BIBLIOGRAPHY [Adams, 1975] E. Adams. The logic of conditionals: An application of probability to deductive logic. Dordrecht: Reidel, 1975. [Adams, 1998] E. Adams. A primer of probability logic. Stanford, CA: CSLI Publications, 1998. [Adelson, 1993] E. H. Adelson. Perceptual organization and the judgment of brightness. Science, 262, 2042-2044, 1993. [Adelson and Pentland, 1996] E. H. Adelson and A. P. Pentland. The perception of shading and reflectance. In D. Knill and W. Richards (Eds.) Perception as Bayesian Inference. Cambridge University Press, pp. 409-423, 1996. [Akaike, 1974] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723, 1974. [Anderson, 1990] J. R. Anderson. The adaptive character of thought. Hillsdale, NJ: Lawrence Erlbaum Associates, 1990. [Anderson, 1991a] J. R. Anderson. Is human cognition adaptive? Behavioral and Brain Sciences, 14, 471-517, 1991. [Anderson, 1991b] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98, 409-429, 1991. [Anderson and Matessa, 1998] J. R. Anderson and M. Matessa. The rational analysis of categorization and the ACT-R architecture. In M. Oaksford and N. Chater (Eds.), Rational models of cognition (pp.197-217). Oxford, England: Oxford University Press, 1998. [Anderson and Milson, 1989] J. R. Anderson and R. Milson. Human memory: An adaptive perspective. Psychological Review, 96, 703–719, 1989. [Anderson and Schooler, 1991] J. R. Anderson and L. J. Schooler. Reflections of the environment in memory. Psychological Science, 1, 396–408, 1991. [Anderson, 1981] N. H. Anderson. Foundations of information integration theory. New York, NY: Academic Press, 1981. [Aristotle, 1908] Aristotle. Nicomachean Ethics (W. D. Ross, trans.). Oxford, England: Clarendon Press, 1908. [Attneave, 1954] F. Attneave. Some informational aspects of visual perception. Psychological Review,61, 183-193, 1954. [Baayen and Moscoso del Prado Mart´ın, 2005] R. H. Baayen and F. Moscoso del Prado Mart´ın. Semantic density and past-tense formation in three Germanic languages. Language 81(3), 666-698, 2005. [Barlow, 1959] H. B. Barlow. Possible principles underlying the transformation of sensory messages. In. W. Rosenblith (Ed.) Sensory Communication (pp. 217-234). Cambridge, MA: MIT Press, 1959. [Baron, 1981] J. Baron. An analysis of confirmation bias. Paper presented at 22nd Annual Meeting of the Psychonomic Society, 1981. [Baron, 1985] J. Baron. Rationality and intelligence. Cambridge, England: Cambridge University Press, 1985. [Barron et al., 1998] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, IT-44, 27432760, 1998.
Inductive Logic and Empirical Psychology
611
[Barsalou, 1987] L. W. Barsalou. The instability of graded structure: Implications for the nature of concepts. In U. Neisser (Ed.), Emory Symposia in Cognition 1, Concepts and Conceptual Development: Ecological and Intellectual Factors in Categorisation (pp. 101-140). Cambridge, England: Cambridge University Press, 1987. [Barwise and Cooper, 1981] J. Barwise and R. Cooper. Generalized quantifiers and natural language. Linguistics and Philosophy, 4,159–219, 1981. [Becker, 1976] G. Becker. The economic approach to human behavior. Chicago: Chicago University Press, 1976. [Becker, 1996] G. Becker. Accounting for tastes. Cambridge, MA: Harvard University Press, 1996. [Benartzi and Thaler, 1995] S. Benartzi and R. H. Thaler. Myopic loss aversion and the equity premium puzzle. Quarterly Journal of Economics, 110, 73-92, 1995. [Bennett, 2003] J. Bennett. A philosophical guide to conditionals. Oxford, England: Oxford University Press, 2003. [Berger, 1985] J. Berger. Statistical decision theory and Bayesian analysis. New York, NY: Springer–Verlag, 1985. [Bernado and Smith, 1994] J. M. Bernado and A. F. Smith. Bayesian theory. New York, NY: Wiley, 1994. [Bernoulli, 1713] J. Bernoulli. Ars conjectandi, The art of conjecturin, (trans. and notes by E. D. Sylla). Baltimore, MD, 1713. John Hopkins University Press (2005). [Blake et al., 1996] A. Blake, H. H. Bulthoff, and D. Sheinberg. Shape from texture: ideal observers and human psychophysics. In D. Knill and W. Richards, (eds.) Perception as Bayesian Inference (pp. 287–321). Cambridge: Cambridge University Press, 1996. [Blakemore, 1990] C. Blakemore. Vision Coding and Efficiency. Cambridge: Cambridge University Press, 1990. [Blei et al., 2004] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. Advances in Neural Information Processing Systems 16, Cambridge, MA: MIT Press, 2004. [Bod et al., 2003] R. Bod, J. Hay, and S. Jannedy, eds. Probabilistic Linguistics, MIT Press, 2003. [Bogacz, 2007] R. Bogacz. Optimal decision-making theories: linking neurobiology with behaviour. Trends in Cognitive Sciences, 11, 118-125, 2007. [Boole, 1854] G. Boole. An investigation of the laws of thought. London: Macmillan, 1854. Reprinted by Dover Publications, New York (1958). [Boolos and Jeffrey, 1980] G. Boolos and R. C. Jeffrey. Computability and logic (2nd Edition). Cambridge, England: Cambridge University Press, 1980. [Bovens and Hartmann, 2003] L. Bovens and S. Hartmann. Bayesian Epistemology. Oxford: Clarendon Press, 2003. [Braine, 1978] M. D. S. Braine. On the relation between the natural logic of reasoning and standard logic. Psychological Review, 85, 1-21, 1978. [Brandst¨ atter et al., 2006] E. Brandst¨ atter, G. Gigerenzer, and R. Hertwig. The priority heuristic: Making choices without trade-offs. Psychological Review, 113, 409-432, 2006. [Brent and Cartwright, 1996a] M. R. Brent and T. A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93-126, 1996. [Brent and Cartwright, 1996b] M. R. Brent and T. A. Cartwright. Distributional Regularity and phonotactic constraints are useful for segmentation. Cognition 61:93-125, 1996. [Brunswik, 1955] E. Brunswik. Representative design and probabilistic theory in a functional psychology. Psychological Review, 62, 193–217, 1955. [Carnap, 1950] R. Carnap. Logical foundations of probability. 2nd Edition. Chicago: University of Chicago Press, 1950. [Charniak, 1997] E. Charniak. Statistical parsing with a context-free grammar and word statistics. In Proceedings of the 14th National Conference on Artificial Intelligence. AAAI Press, Cambridge, MA, pages 598-603, 1997. [Chater, 1996] N. Chater. Reconciling Simplicity and Likelihood Principles in Perceptual Organization, Psychological Review,103, 566–581, 1996. [Chater, 2004] N. Chater. What can be learned from positive data? Insights from an ‘ideal learner.’ Journal of Child Language, 31, 915-918, 2004. [Chater and Manning, 2006a] N. Chater and C. Manning. Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 287-291, 2006.
612
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Chater and Manning, 2006b] N. Chater and C. Manning. Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 335-344, 2006. [Chater and Oaksford, 1999] N. Chater and M. Oaksford. The probability heuristics model of syllogistic reasoning. Cognitive Psychology, 38, 191-258, 1999. [Chater and Oaksford, 2008] N. Chater and M. Oaksford, eds. The probabilistic mind: Prospects for Bayesian cognitive science, Oxford, England: Oxford University Press, 2008. [Chater and Vit´ anyi, 2007] N. Chater and P. Vit´ anyi. ‘Ideal learning’ of natural language: Positive results about learning from positive evidence. Journal of Mathematical Psychology, 51, 135-162, 2007. [Chater et al., 1998a] N. Chater, M. Crocker, and M. Pickering. The rational analysis of inquiry: The case of parsing: In M. Oaksfield and N. Chater (Eds.), Rational models of cognition (pp.441-468). Oxford, England: Oxford University Press, 1998. [Chater et al., 1998b] N. Chater, M. Crocker, and M. Pickering. The rational analysis of inquiry: The case of parsing. In M. Oaksford, and N. Chater (Eds.) Rational models of cognition (pp. 441-469). Oxford: Oxford University Press, 1998. [Chater et al., 2006] N. Chater, J. B. Tenenbaum, and A. Yuille. Special Issue on Probabilistic Models of Cognition, Trends in Cognitive Sciences, 10, 287-344, 2006. [Cheng, 1997] P. W. Cheng. From covariation to causation: A causal power theory. Psychological Review, 104, 367–405, 1997. [Cheng and Holyoak, 1985] P. W. Cheng and K. J. Holyoak. Pragmatic reasoning schemas. Cognitive Psychology, 17, 391-416, 1985. [Chomsky, 1957] N. Chomsky. Syntactic Structures. The Hague: Mouton, 1957. [Chomsky, 1965] N. Chomsky. Aspects of the theory of syntax. Cambridge, Massachusetts: MIT Press, 1965. [Chomsky, 1981] N. Chomsky. Lectures on Government and Binding, Dordrecht: Foris, 1981. [Christiani and Shawe-Taylor, 2000] N. Christiani and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge: Cambridge University Press, 2000. [Christiansen and Chater, 2001] M. H. Christiansen and N. Chater, eds. Connectionist psycholinguistics. Westport, CT: Ablex, 2001. [Clark, 1978] K. L. Clark. Negation as failure. In H. Gallaire and J. Minker (Eds.), Logic and databases (pp. 293-322). New York: Plenum Press, 1978. [Cohen, 1981] L. J. Cohen. Can Human Irrationality Be Experimentally Demonstrated? Behavioral and Brain Sciences, 4, 317-370, 1981. [Collins, 2003] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. Computational Linguistics 29(4): 589-637, 2003. [Copeland, 2006] D. Copeland. Theories of categorical reasoning and extended syllogisms. Thinking and Reasoning, 12, 379-412, 2006. [Copeland and Radvansky, 2004] D. Copeland and G. A. Radvansky. Working memory and syllogistic reasoning. Quarterly Journal of Experimental Psychology, 57A, 1437-1457, 2004. [Cosmides, 1989] L. Cosmides. The logic of social exchange: Has natural selection shaped how humans reason? Studies with the Wason selection task. Cognition, 31, 187-276, 1989. [Cosmides and Tooby, 2000] L. Cosmides and J. Tooby. Evolutionary psychology and the emotions. In M. Lewis and J. M. Haviland-Jones (Eds.), Handbook of Emotions, 2nd Edition. (pp. 91-115). New York, NY: Guilford, 2000. [Courville et al., 2006] A. C. Courville, N. D. Daw, and D. S. Touretzky. Bayesian theories of conditioning in a changing world. Trends in Cognitive Sciences, 10, 294-300, 2006. [Crocker, 2000] M. W. Crocker and T. Brants. Wide-coverage probabilistic sentence processing. Journal of Psycholinguistic Research 29, 647-669, 2000. [Culicover, 1999] P. W. Culicover. Syntactic Nuts. Oxford University Press, 1999. [Cummins, 1995] D. D. Cummins. Na¨ıve theories and causal deduction. Memory and Cognition, 23, 646-658, 1995. [Daelmans and van den Bosch, 2005] W. Daelemans and A. van den Bosch. Memory-based language processing. Cambridge: Cambridge University Press, 2005. [Daston, 1988] L. Daston. Classical probability in the enlightenment. Princeton, NJ: Princeton University Press, 1988. [Davidson, 1984] D. Davidson. Inquiries into truth and interpretation. Oxford: Oxford University Press, 1984. [Daw et al., 2006] N. D. Daw, J. P. O’Doherty, B. Seymour, P. Dayan, and R. J. Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441, 876-879, 2006.
Inductive Logic and Empirical Psychology
613
[Dayan and Abbott, 2001] P. Dayan and L. F. Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press, 2001. [Deneve et al., 2001] S. Deneve, P. E. Latham, and A. Pouget. Efficient computation and cue integration with noisy population codes, Nature Neuroscience, 4, 826-831, 2001. [Dennis, 2005] S. Dennis. A memory-based theory of verbal cognition. Cognitive Science, 29, 145-193. 2005. [Desmet et al., 2006] T. Desmet, De Baecke, Drieghe, Brysbaert and Vonk. Relative clause attachment in Dutch: On-line comprehension corresponds to corpus frequencies when lexical variables are taken into account. Language and Cognitive Processes, 21, 453-485, 2006. [Desmet and Gibson, 2003] T. Desmet and E. Gibson. Disambiguation Preferences and Corpus Frequencies in Noun Phrase Conjunction. Journal of Memory and Language, 49, 353-374, 2003. [Dickstein, 1978] L. S. Dickstein. The effect of figure on syllogistic reasoning. Memory and Cognition, 6, 76-83, 1978. [Dowman, 2000] M. Dowman. Addressing the learnability of verb subcategorizations with Bayesian inference. In L. R. Gleitman and A. K. Joshi (Eds.). Proceedings of the Twenty Second Annual Conference of the Cognitive Science Society. Mahwah, NJ: Erlbaum, 2000. [Earman, 1992] J. Earman. Bayes or bust? Cambridge, MA: MIT Press, 1992. [Edgington, 1995] D. Edgington. On conditionals. Mind, 104, 235–329, 1995. [Edwards, 1954] W. Edwards. The theory of decision making. Psychological Bulletin, 41, 380417, 1954. [Eemeren and Grootendorst, 1992] F. H. van Eemeren and R. Grootendorst. Argumentation, communication, and fallacies. Hillsdale, NJ: Lawrence Erlbaum, 1992. [Elman, 1990] J. L. Elman. Finding structure in time. Cognitive Science, 14, 179-211, 1990. [Elster, 1986] J. Elster, ed. Rational choice. Oxford: Basil Blackwell, 1986. [Evans, 1972] J. St. B. T. Evans. Reasoning with negatives. British Journal of Psychology, 63, 213-219, 1972. [Evans et al., 2003] J. St. B. T. Evans, S. H. Handley, and D. E. Over. Conditionals and conditional probability. Journal of Experimental Psychology: Learning, Memory and Cognition, 29, 321-355, 2003. [Evans et al., 1993] J. St. B. T. Evans, S. E. Newstead, R. J. Byrne. Human Reasoning, Lawrence Erlbaum Associates, Hillsdale, N.J, 1993. [Evans and Handley, 1999] J. St. B. T. Evans and S. J. Handley. The role of negation in conditional inference. Quarterly Journal of Experimental Psychology, 52A, 739-769, 1999. [Evans and Over, 1996a] J. St. B. T. Evans and D. E. Over. Rationality and reasoning. Psychology Press: Hove, Sussex, 1996. [Evans and Over, 1996b] J. St. B. T. Evans and D. E. Over. Rationality in the selection task: Epistemic utility versus uncertainty reduction. Psychological Review, 103, 356-363, 1996. [Evans et al., 1999] J. St. B. T. Evans, S. J. Handley, C. N. J. Harper, and P. N. Johnson-Laird. Reasoning about necessity and possibility: A test of the mental model theory of deduction. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 1495-1513, 1999. [Evans et al., 2003] J. St. B. T. Evans, S. H. Handley, and D. E. Over. Conditionals and conditional probability. Journal of Experimental Psychology: Learning, Memory and Cognition, 29, 321-355, 2003. [Evans and Over, 2004] J. St. B. T. Evans and D. E. Over. If. Oxford, England: Oxford University Press, 2004. [Fanselow et al., 2006] G. Fanselow, C. F´ery, R. Vogel, and M. Schlesewsky, eds. Gradience in Grammar: Generative Perspectives. Oxford: Oxford University Press, 2006. [Feeney and Handley, 2000] A. Feeney and S. J. Handley. The suppression of q card selections: Evidence for deductive inference in Wason’s selection task. Quarterly Journal of Experimental Psychology, 53, 1224-1242, 2000. [Feldman and Singh, 2005] J. Feldman and M. Singh. Information along curves and closed contours. Psychological Review, 112, 243-252, 2005. [Feldman, 2000] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407, 630–633, 2000. [Feldman, 2001] J. Feldman. Bayesian contour integration. Perception and Psychophysics, 63, 1171-1182, 2001. [Fiedler and Freytag, 2004] K. Fiedler and P. Freytag. Pseudocontingencies. Journal of Personality and Social Psychology, 87, 453-467, 2004.
614
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Fiedler and Juslin, 2006] K. Fiedler and P. Juslin. Information sampling and adaptive cognition. New York: Cambridge University Press, 2006. [Fitelson, 2005] B. Fitelson. Inductive logic. In J. Pfeifer, and S. Sarkar (Eds.), The philosophy of science. Oxford, UK: Routledge, 2005. [Fodor, 1983] J. A. Fodor. Modularity of mind. Cambridge, MA: MIT Press, 1983. [Fodor, 1987] J. A. Fodor. Psychosemantics. Cambridge, MA: MIT Press, 1987. [Fodor et al., 1974] J. A. Fodor, T. G. Bever, and M. F. Garrett. The Psychology of Language. New York:. McGraw-Hill, 1974. [Fox and Hadar, 2006] C. R. Fox and L. Hadar. “Decisions from experience” = sampling error + prospect theory: Reconsidering Hertwig, Barron, Weber and Erev (2004). Judgment and Decision Making, 1, 2006. [Frazier and Fodor, 1978] L. Frazier and J. D. Fodor. The sausage machine: A new two-stage parsing model. Cognition, 13, 187-222, 1978. [Frazier, 1979] L. Frazier. On Comprehending Sentences: Syntactic Parsing Strategies. Ph.D. Dissertation, University of Connecticut, 1979. [Freeman, 1994] W. T. Freeman. The generic viewpoint assumption in a framework for visual perception, Nature, 368, 542–545, 1994. [Garner, 1953] W. R. Garner. An informational analysis of absolute judgments of loudness. Journal of Experimental Psychology, 46, 373–380, 1953. [Geman and Geman, 1984] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,6, 721-741, 1984. [Geurts, 2003] B. Geurts. Reasoning with quantifiers. Cognition, 86, 223-251, 2003. [Gibson and Wexler, 1994] E. Gibson and K. Wexler. Triggers. Linguistic Inquiry, 25, 407-454, 1994. [Gigerenzer, 2002] G. Gigerenzer. Reckoning with risk: Learning to live with uncertainty. Harmondsworth, UK: Penguin Books, 2002. [Gigerenzer, 1991] G. Gigerenzer. From tools to theories: A heuristic of discovery in cognitive psychology. Psychological Review, 98, 254-267, 1991. [Gigerenzer and Goldstein, 1996] G. Gigerenzer and D. Goldstein. Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650–669, 1996. [Gigerenzer and Hoffrage, 1995] G. Gigerenzer and U. Hoffrage. How to improve Bayesian reasoning without instruction: Frequency formats. Psychological Review, 102, 684-704, 1995. [Gigerenzer and Murray, 1987] G. Gigerenzer and D. J. Murray. Cognition as intuitive statistics. Hillsdale, NJ: Erlbaum, 1987. [Gigerenzer et al., 1989] G. Gigerenzer, Z. Swijinck, T. Porter, L. Daston, J. Beatty, and L. Kruger. The empire of chance. Cambridge, England: Cambridge University Press, 1989. [Gigerenzer et al., 1999] G. Gigerenzer, P. Todd, and The ABC Group, eds. Simple heuristics that make us smart. Oxford: Oxford University Press, 1999. [Ginsberg, 1987] M. Ginsberg. Readings in nonmonotonic reasoning. Morgan Kaufmann Publishers, 1987. [Glymour, 1980] C. Glymour. Theory and evidence. Princeton: Princeton University Press, 1980. [Gold, 1967a] E. M. Gold. Language identification in the limit. Information and Control, 10:447– 474, 1967. [Gold, 1967b] E. M. Gold. Language identification in the limit. Information and Control, 10, 447-474, 1967. [Gold and Shadlen, 2000] J. I. Gold and M. N. Shadlen. Representation of a perceptual decision in developing oculomotor commands. Nature, 404, 390-394, 2000. [Goldsmith, 2001] J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27, 153-198, 2001. [Goodman, 1951] N. Goodman. The structure of appearance. Cambridge, MA: Harvard University Press, 1951. [Goodman, 1954] N. Goodman. Fact, fiction, and forecast. London: The Athlone Press, 1954. [Gopnik et al., 2004] A. Gopnik, C. Glymour, D. M. Sobel, L. E. Schulz, T. Kushnir, and D. Danks. A theory of causal learning in children: Causal maps and Bayes nets. Psychological Review, 111, 1-31, 2004. [Green and Over, 1997] D. W. Green and D. E. Over. Causal inference, contingency tables and the selection task. Current Psychology of Cognition, 16, 459-487, 1997.
Inductive Logic and Empirical Psychology
615
[Green and Over, 2000] D. W. Green and D. E. Over. Decision theoretical effects in testing a causal conditional. Current Psychology of Cognition, 19, 51-68, 2000. [Gregory, 1970] R. L. Gregory. The Intelligent Eye. London: Weidenfeld and Nicolson, 1970. [Griffiths and Tennenbaum, 2006] T. L. Griffiths and J. B. Tenenbaum. Optimal predictions in everyday cognition. Psychological Science,17, 767-773, 2006. [Griffiths and Tennenbaum, 2005] T. L. Griffiths and J. B. Tenenbaum. Structure and strength in causal induction. Cognitive Psychology, 51, 354-384, 2005. [Griffiths et al., 2007] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic representation. Psychological Review, 114, 211-244, 2007. [Griffiths et al., 2005] T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum. Integrating topics and syntax. Advances in Neural Information Processing Systems 17, 2005. [Griffiths and Steyvers, 2004] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235, 2004. [Grodner and Gibson, 2005] D. Grodner and E. Gibson. Consequences of the serial nature of linguistic input. Cognitive Science, 29, 261-291, 2005. [Hacking, 1975] I. Hacking. The emergence of probability. Cambridge, England: Cambridge University Press, 1975. [Hacking, 1990] I. Hacking. The taming of chance. Cambridge, England: Cambridge University Press, 1990. [Hahn and Nakisa, 2000] U. Hahn and R. Nakisa. German inflection: Single route or dual route? Cognitive Psychology, 41, 313-360, 2000. [Hahn and Oaksford, 2006] U. Hahn and M. Oaksford. A Bayesian approach to informal argument fallacies. Synthese, 152, 207-236, 206. [Hahn and Oaksford, 2007] U. Hahn and M. Oaksford. The rationality of informal argumentation: A Bayesian approach to reasoning fallacies. Psychological Review, 114, 704-732, 2007. [Hahn et al., 2005a] U. Hahn, M. Oaksford, and H. Bayindir. How convinced should we be by negative evidence? In B. Bara, L. Barsalou, and M. Bucciarelli (Eds.), Proceedings of the 27 th Annual Conference of the Cognitive Science Society, (pp. 887-892), Mahwah, N.J.: Lawrence Erlbaum Associates, 2005. [Hahn et al., 2005b] U. Hahn, M. Oaksford, and A. Corner. Circular arguments, begging the question and the formalization of argument strength. In A. Russell, T. Honkela, K. Lagus, and M. P¨ oll¨ a, (Eds.), Proceedings of AMKLC’05, International Symposium on Adaptive Models of Knowledge, Language and Cognition, (pp. 34-40), Espoo, Finland, June 2005. [Hale, 2003] J. Hale. The Information Conveyed by Words in Sentences. Journal of Psycholinguistic Research. 32, 101-123, 2003. [Hamblin, 1970] C. L. Hamblin. Fallacies. London: Methuen, 1970. [Hammond, 1996] K. R. Hammond. Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. Oxford: Oxford University Press, 1996. [Hattori, 2002] M. Hattori. A quantitative model of optimal data selection in Wason’s selection task. Quarterly Journal of Experimental Psychology,55A, 1241-1272. 2002. [Hay and Baayen, 2005] J. Hay and H. Baayen. Shifting paradigms: gradient structure in morphology. Trends in Cognitive Sciences, 9, 342-348, 2005. [Heit, 2000] E. Heit. Properties of inductive reasoning. Psychonomic Bulletin and Review, 7, 569-592, 2000. [Heit, 1998] E. Heit. A Bayesian analysis of some forms of inductive reasoning. In M. Oaksford and N. Chater (Eds.), Rational models of cognition (pp. 248-274). Oxford: Oxford University Press, 1998. [Heit and Rubinstein, 1994] E. Heit and J. Rubinstein. Similarity and property effects in inductive reasoning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 411-422, 1994. [Helmholtz, 1910/1962] H. von Helmholtz. Treatise on Physiological Optics, Vol. 3, J. P. Southall (Ed. and translation). New York, NY: Dover Publications, 1910/1932. [Helpel, 1945] C. G. Hempel. Studies in the logic of confirmation. Mind, 54, 1-26, 97-121 1945. [Henle, 1978] M. Henle. Foreword to R. Revlin and R. E. Mayer (Eds.), Human reasoning. Washington: Winston, 1978. [Hertwig et al., 2004] R. Hertwig, G. Barron, E. U. Weber, and I. Erev. Decisions from experience and the effect of rare events in risky choices. Psychological Science, 15, 534-539, 2004. [Hochberg and McAlister, 1953] J. E. Hochberg and E. and McAlister. A quantitative approach to figural “goodness.” Journal of Experimental Psychology, 46, 361–364, 1953.
616
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Hogarth and Karelaia, 2005a] R. M. Hogarth and N. Karelaia. Ignoring information in binary choice with continuous variables: When is less “more”? Journal of Mathematical Psychology, 49, 115-124, 2005. [Hogart and Karelaia, 2005b] R. M. Hogarth and N. Karelaia. Simple models for multi-attribute choice with many alternatives: When it does and does not pay to face trade-offs with binary attributes. Management Science, 51, 1860-1872, 2005. [Horning, 1971] J. Horning. A Procedure for Grammatical Inference. Proceedings of the IFIP Congress 71 (pp. 519-523), Amsterdam: North Holland, 1971. [Horwich, 1982] P. Horwich. Probability and Evidence. New York: Cambridge University Press. 1982. [Howe and Purves, 2004] C. Q. Howe and D. and Purves. Size contrast and assimilation explained by the statistics of natural scene geometry. Journal of Cognitive Neuroscience, 16, 90-102, 2004. [Howe and Purves, 2005] C. Q. Howe and D. Purves. Perceiving Geometry: Geometrical Illusions Explained by Natural Scene Statistics. Berlin: Springer, 2005. [Howson and Urbach, 1993] C. Howson and P. Urbach. Scientific reasoning: The Bayesian approach (2nd edition). La Salle, IL: Open Court, 1993. [Hutter, 2004] M. Hutter. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Berlin: Springer, 2004. [Inhelder and Piaget, 1955] B. Inhelder and J. Piaget. De la logique de l’enfant a ` la logique de l’adolescent. Paris: Presses Universitaires de France, 1955. (English version: The growth of logical thinking from childhood to adolescence. London: Routledge, 1958). [Jeffrey, 1965] R. Jeffrey. The logic of decision. New York: McGraw Hill, 1965. [Jeffrey, 1983] R. Jeffrey. The Logic of decision. 2nd ed, Chicago, University of Chicago Press, 1983. [Johnson and Riezler, 2002] M. Johnson and S. Riezler. Statistical models of language learning and use. Cognitive Science,26, 239-253, 2002. [Johnson-Laird, 1983] P. N. Johnson-Laird. Mental models. Cambridge, England: Cambridge University Press, 1983. [Johnson-Laird and Byrne, 2002] P. N. Johnson-Laird and R. M. J. Byrne. Conditionals: A theory of meaning, pragmatics, and inference. Psychological Review, 109, 646-678, 2002. [Johnson-Laird and Steedman, 1978] P. N. Johnson-Laird and M. Steedman. The psychology of syllogisms. Cognitive Psychology, 10, 64–99, 1978. [Johnson-Laird et al., 1999] P. N. Johnson-Laird, P. Legrenzi, V. Girotto, M. S. Legrenzi, and J. P. Caverni. Naive probability: A mental model theory of extensional reasoning. Psychological Review, 106, 62–88, 1999. [Johnson-Laird and Byrne, 1991] P. N. Johnson-Laird and R. M. J. Byrne. Deduction. Hillsdale, NJ: Lawrence Erlbaum Associates 1991. [Jurafsky, 1996] D. Jurafsky. A probabilistic model of lexical and syntactic access and disambiguation. Cognitive Science, 20, 137–194, 1996. [Jurafsky, 2003a] D. Jurafsky. Pragmatics and Computational Linguistics. In Laurence R. Horn and Gregory Ward (eds.) Handbook of Pragmatics. Oxford: Blackwell, 2003. [Jurafsky, 2003b] D. Jurafsky. Probabilistic Modeling in Psycholinguistics: Linguistic Comprehension and Production. In R. Bod, J. Hay, and S. Jannedy, (Eds.), Probabilistic Linguistics (pp. 291-320). Cambridge, MA: MIT Press, 2003. [Kahneman, 2000] D. Kahneman. Preface. In D. Kahneman and A. Tversky, (Eds.), Choices, values and frames (pp. ix-xvii). New York: Cambridge University Press and the Russell Sage Foundation, 2000. [Kahneman and Tversky, 2000] D. Kahneman and A. Tversky, eds. Choices, values and frames. New York: Cambridge University Press and the Russell Sage Foundation, 2000. [Kahneman and Tversky, 1979] D. Kahneman and A. Tversky. Prospect theory: An analysis of decisions under risk. Econometrica, 47, 313-327, 1979. [Kahneman et al., 1982] D. Kahneman, P. Slovic, and A. Tversky, eds. Judgment under uncertainty: Heuristics and biases. New York, NY: Cambridge University Press, 1982. [Kakade and Dayan, 2002] S. Kakade and P. Dayan. Acquisition and extinction in autoshaping. Psychological Review, 109, 533-544, 2002. [Kant, 1787/1961] E. Kant. Critique of the pure reason. (trans. N. K. Smith), First Edition, Second Impression. London, England: Macmillan, 1787/1961.
Inductive Logic and Empirical Psychology
617
[Kemp and Tenenbaum, 2009] C. Kemp and J. B. Tenenbaum. Structured statistical models of inductive reasoning. Psychological Review, 116,
618
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Lindley, 1956] D. V. Lindley. On a measurement of the information provided by an experiment. Annals of Mathematical Statistics, 27, 986-1005, 1956. [Long and Purves, 2003] F. Long and D. Purves. Natural scene statistics as the universal basis for color context effects. Proceedings of the National Academy of Science, 100, 15190-15193, 2003. [Loomes and Sugden, 1982] G. Loomes amd R. Sugden. Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92, 805-824, 1982. [MacDonald et al., 1994] M. C. MacDonald, N. Pearlmutter, and M. S. Seidenberg. The lexical nature of syntactic ambiguity resolution. Psychological Review, 101, 676-703, 1994. [Mach, 1959] E. Mach. The analysis of sensations and the relation of the physical to the psychical. New York: Dover Publications, 1959. (Original work published 1914.) [Mackay, 1992] D. J. C. Mackay. Bayesian interpolation. Neural Computation, 4, 415-447. 1992. [Mackay, 2003] D. J. C. Mackay. Information theory, inference, and learning algorithms. Cambridge University Press: Cambridge, 2003. [Manktelow and Over, 1987] K. I. Manktelow and D. E. . Reasoning and rationality. Mind and Language, 2, 199-219, 1987. [Manktelow et al., 1995] K. I. Manktelow, E. J. Sutherland, and D. E. Over. Probabilistic factors in deontic reasoning. Thinking and Reasoning, 1, 201-220, 1995. [Manktelow and Over, 1991] K. I. Manktelow and D. E. Over. Social roles and utilities in reasoning with deontic conditionals. Cognition, 39, 85-105, 1991. [Manning, 2003] C. Manning. Probabilistic Syntax. In Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds), Probabilistic Linguistics, pp. 289-341. Cambridge, MA: MIT Press, 2003. [Marcus et al., 1999] G. F. Marcus, S. Vijayan, S. Bandi Rao, and P. M. Vishton. Rule learning by seven-month-old infants, Science, 283, 77-80, 1999. [Marcus and Rips, 1979] S. L. Marcus and L. J. Rips. Conditional reasoning. Journal of Verbal Learning and Verbal Behavior, 18, 199-223, 1979. [Marr, 1982] D. Marr. Vision. San Francisco, CA: Freeman, 1982. [Massaro, 1987] D. W. Massaro. Speech perception by ear and eye. Hillsdale, NJ: Erlbaum, 1987. [McCarthy and Hayes, 1969] J. McCarthy and P. J. Hayes. Some philosophical problems from the standpoint of artificial intelligence. In B. Meltzer and D. Michie (Eds.), Machine intelligence 4. Edinburgh, Scotland: Edinburgh University Press, 1969. [McClelland, 1998] J. L. McClelland. Connectionist models and Bayesian inference. In M. Oaksford and N. Chater, (Eds.), Rational models of cognition (pp. 21-53). Oxford, England: Oxford University Press, 1998. [McClelland and Elman, 1986] J. L. McClelland and J. L. Elman. The TRACE model of speech perception. Cognitive Psychology, 18, 1-86, 1986. [McDonald and Shillcock, 2003] S. A. McDonald and R. C. Shillcock. Eye movements reveal the on-line computation of lexical probabilities. Psychological Science, 14, 648-652, 2003. [McKenzie, 2004] C. R. M. McKenzie. Framing effects in inference tasks – and why they are normatively defensible. Memory and Cognition, 32, 874-885, 2004. [McKenzie and Mikkelsen, 2000] C. R. M. McKenzie and L. A. Mikkelsen. The psychological side of Hempel’s paradox of confirmation. Psychonomic Bulletin and Review, 7, 360-366, 2000. [McKenzie and Mikkelsen, 2007] C. R. M. McKenzie and L. A. Mikkelsen. A Bayesian view of covariation assessment. Cognitive Psychology, 54, 33-61, 2007. [McKenzie et al., 2001] C. R. M. McKenzie, V. S. Ferreira, L. A. Mikkelsen, K. J. McDermott, and R. P. Skrable. Do conditional statements target rare events? Organizational Behavior and Human Decision Processes, 85, 291-309, 2001. [McRae et al., 1998] K. McRae, M. J. Spivey-Knowlton, and M. K. Tanenhaus. Modeling the influence of thematic fit (and other constraints) in online sentence comprehension. Journal of Memory and Language, 38, 283-312, 1998. [Medin et al., 1997] D. L. Medin, E. B. Lynch, J. D. Coley, and S. Atran. Categorization and reasoning among tree experts: Do all roads lead to Rome? Cognitive Psychology, 32, 49-96, 1997. [Miller, 1956] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for information processing. Psychological Review, 63,81–97, 1956. [Miyazaki et al., 2005] M. Miyazaki, D. Nozaki, and Y. Nakajima. Testing Bayesian models of human coincidence timing. Journal of Neurophysiology, 94, 395-399, 2005.
Inductive Logic and Empirical Psychology
619
[Monaghan et al., 2007] P. Monaghan, M. Christiansen, and N. Chater. The Phonologicaldistributional coherence hypothesis: Cross-linguistic evidence in language acquisition. Cognitive Psychology, 55, 259-305, 2007. [Narayanan and Jurafsky, 2002] S. Narayanan and D. Jurafsky. A Bayesian model predicts human parse preference and reading time in sentence processing. In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in neural information processing systems (volume 14, pp. 59-65). Cambridge, MA: MIT Press, 2002. [Nelson, 2005] J. Nelson. Finding useful questions: On Bayesian diagnosticity, probability, impact, and information gain. Psychological Review, 112, 979-999, 2005. [Newell and Simon, 1972] A. Newell and H. A. Simon. Human problem solving. Englewood Cliffs, N.J: Prentice-Hall, 1972. [Newell et al., 1958] A. Newell, J. C. Shaw, and H. A. Simon. Chess-playing programs and the problem of complexity. IBM Journal of Research and Development, 2, 320-25 1958. [Newstead et al., 1999] S. E. Newstead, S. J. Handley, and E. Buck. Falsifying mental models: Testing the predictions of theories of syllogistic reasoning. Memory and Cognition, 27, 344354, 1999. [Nickerson, 1996] R. S. Nickerson. Hempel ’s paradox and Wason ’s selection task: Logical and psychological puzzles of confirmation. Thinking and Reasoning, 2, 1-32, 1996. [Nisbett et al., 1983] R. E. Nisbett, D. H. Krantz, C. Jepson, and Z. Kunda. The use of statistical heuristics in everyday inductive reasoning. Psychological Review, 90, 339-363, 1983. [Niyogi, 2006] P. Niyogi. The Computational Nature of Language Learning and Evolution. Cambridge, MA: MIT Press, 2006. [Norris, 2006] D. Norris. The Bayesian Reader: Explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113, 327-357, 2006. [Novick and Cheng, 2004] L. R. Novick and P. W. Cheng. Assessing interactive causal influence. Psychological Review, 111, 455-485, 2004. [Nundy and Purves, 2002] S. Nundy and D. Purves. A probabilistic explanation of brightness scaling. Proceedings of the National Academy of Sciences, 99, 14482-14487, 2002. [Oaksford, 2004a] M. Oaksford. Conditional inference and constraint satisfaction: Reconciling probabilistic and mental models approaches? Paper presented at the 5th International Conference on Thinking, University of Leuven, Leuven, Belgium, 2004. [Oaksford, 2004b] M. Oaksford. Reasoning. In N. Braisby and A. Gellatly (Eds.), Cognitive psychology (pp. 418-455). Oxford, England: Oxford University Press, 2004. [Oaksford and Chater, 1991] M. Oaksford and N. Chater. Against logicist cognitive science. Mind and Language, 6, 1-38, 1991. [Oaksford and Chater, 1994] M. Oaksford and N. Chater. A rational analysis of the selection task as optimal data selection. Psychological Review, 101, 608-631. 1994. [Oaksford and Chater, 1996] M. Oaksford and N. Chater. Rational explanation of the selection task. Psychological Review, 103, 381-391, 1996. [Oaksford and Chater, 1998a] M. Oaksford and N. Chater, eds. Rational models of cognition, Oxford University Press, Oxford, 1998. [Oaksford and Chater, 1998b] M. Oaksford and N. Chater. Rationality in an uncertain world. Hove, England: Psychology Press, 1998. [Oaksford and Chater, 2003a] M. Oaksford and N. Chater. Conditional probability and the cognitive science of conditional reasoning. Mind and Language, 18, 359-379. 2003. [Oaksford and Chater, 2003b] M. Oaksford and N. Chater. Optimal data selection: Revision, review and re-evaluation. Psychonomic Bulletin and Review, 10, 289-318. 2003. [Oaksford and Chater, 2007] M. Oaksford and N. Chater. Bayesian rationality: The probabilistic approach to human reasoning. Oxford: Oxford University Press 2007. [Oaksford and Chater, 2008] M. Oaksford and N. Chater. Probability logic and the Modus Ponens-Modus Tollens asymmetry in conditional inference. In N. Chater, and M. Oaksford (Eds.), The probabilistic mind: Prospects for Bayesian cognitive science (pp. 97-120). Oxford: Oxford University Press, 2008. [Oaksford and Hahn, 2004] M. Oaksford and U. Hahn. A Bayesian analysis of the argument from ignorance. Canadian Journal of Experimental Psychology, 58, 75-85, 2004. [Oaksford and Hahn, 2007] M. Oaksford and U. Hahn. Induction, deduction and argument strength in human reasoning and argumentation. In A. Feeney, and E. Heit (Eds.), Inductive reasoning (pp. 269-301). Cambridge: Cambridge University Press, 2007.
620
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Oaksford and Moussakowski, 2004] M. Oaksford and M. Moussakowski. Negations and natural sampling in data selection: Ecological vs. heuristic explanations of matching bias. Memory and Cognition, 32, 570-581, 2004. [Oaksford and Stenning, 1992] M. Oaksford and K. Stenning. Reasoning with conditionals containing negated constituents. Journal of Experimental Psychology: Learning, Memory and Cognition, 18, 835-854, 1992. [Oaksford and Wakefield, 2003] M. Oaksford and M. Wakefield. Data selection and natural sampling: Probabilities do matter. Memory and Cognition, 31, 143-154, 2003. [Oaksford et al., 1999] M. Oaksford, N. Chater, and B. Grainger. Probabilistic effects in data selection. Thinking and Reasoning, 5, 193-244, 1999. [Oaksford et al., 2000] M. Oaksford, N. Chater, and J. Larkin. Probabilities and polarity biases in conditional inference. Journal of Experimental Psychology: Learning, Memory and Cognition, 26, 883-889, 2000. [Oaksford et al., 1997] M. Oaksford, N. Chater, B. Grainger, and J. Larkin. Optimal data selection in the reduced array selection task (RAST). Journal of Experimental Psychology: Learning, Memory and Cognition, 23, 441-458, 1997. [Oaksford et al., 2002] M. Oaksford, L. Roberts, and N. Chater. Relative informativeness of quantifiers used in syllogistic reasoning. Memory and Cognition, 30, 138-149, 2002. [Oberauer and Wilhelm, 2003] K. Oberauer and O. Wilhelm. The meaning(s) of conditionals: Conditional probabilities, mental models and personal utilities. Journal of Experimental Psychology: Learning, Memory and Cognition, 29, 680-693, 2003. [Oberauer, 2006] K. Oberauer. Reasoning with conditionals: A test of formal models of four theories. Cognitive Psychology, 53, 238-283. 2006. [Oberauer et al., 2004] K. Oberauer, A., Weidenfeld, and R. H¨ ornig. Logical reasoning and probabilities: A comprehensive test of Oaksford and Chater (2001). Psychonomic Bulletin and Review, 11, 521-527, 2004. [Oberauer et al., 1999] K. Oberauer, O. Wilhelm, and R. R. Dias. Bayesian rationality for the Wason selection task? A test of optimal data selection theory. Thinking and Reasoning, 5, 115-144, 1999. [Olivers et al., 2004] C. L. N. Olivers, N. Chater, and D. G. Watson. Holography does not account for goodness: A critique of van der Helm and Leeuwenberg (1996). Psychological Review, 111, 242–260, 2004. [Osherson et al., 1990] D. N. Osherson, E. E. Smith, O. Wilkie, A. Lopez, and E. Shafir. Category-based induction. Psychological Review, 97, 185-200, 1990. [Over and Evans, 1994] D. E. Over and J. St. B. T. Evans. Hits and misses: Kirby on the selection task. Cognition, 52, 235–243, 1994. [Over and Jessop, 1998] D. E. Over and A. Jessop. Rational analysis of causal conditionals and the selection task. In M. Oaksford and N.Chater (Eds.), Rational Models of Cognition (pp. 399–414). Oxford, England: Oxford University Press. 1998. [Over et al., 2007] D. E. Over, C. Hadjichristidis, J. St. B. T. Evans, S. J. Handley, and S. A. Sloman. The psychology of causal conditionals. Cognitive Psychology, 54, 62-97, 1007. [Pearce, 1997] J. M. Pearce. Animal Learning and Cognition: An Introduction. Hove: Psychology Press, 1997. [Pearl, 1988] J. Pearl. Probabilistic reasoning in intelligent systems. San Mateo: Morgan Kaufmann, 1998. [Pearl, 2000] J. Pearl. Causality: Models, reasoning and inference. Cambridge, England: Cambridge University Press, 2000. [Perham and Oaksford, 2005] N. Perham and M. Oaksford. Deontic reasoning with emotional content: Evolutionary psychology or decision theory? Cognitive Science, 29, 681-718, 2005. [Pfeifer and Kleiter, 2005] N. Pfeifer and G. D. Kleiter. Toward a mental probability logic. Psychologica Belgica, 45, 71-99, 2005. [Pickering et al., 2000] M. J. Pickering, M. J. Traxler, and M. W. Crocker. Ambiguity resolution in sentence processing: Evidence against frequency-based accounts. Journal of Memory and Language 43, 447-475. 2000. [Pierrehumbert, 2001] J. Pierrehumbert. Stochastic phonology. GLOT, 5(6), 1-13, 2001. [Pinker, 1979] S. Pinker. Formal models of language learning. Cognition, 7, 217-283, 1979. [Pinker, 1999] S. Pinker. Words and rules: The ingredients of language. New York: Basic Books. 1999.
Inductive Logic and Empirical Psychology
621
[Politzer and Braine, 1991] G. Politzer and M. D. Braine. Responses to inconsistent premises cannot count as suppression of valid inferences. Cognition, 38, 103-108, 1991. [Pomerantz and Kubovy, 1986] J. R. Pomerantz and M. Kubovy. Theoretical approaches to perceptual organization: simplicity and likelihood principles. In: K.R. Boff, L. Kaufnam and J. P.Thomas (Eds.) Handbook of perception and human performance, Volume II: Cognitive processes and performance. (pp.36:1-45). New York: Wiley, 1986. [Popper, 1935/1959] K. Popper. The logic of scientific discovery. Basic Books, New York, 1935/1959. [Pothos and Chater, 2002] E. Pothos and N. Chater. A simplicity principle in unsupervised human categorization. Cognitive Science, 26, 303-343, 2002. [Pullum and Scholz, 2002] G. Pullum and B. Scholz. Empirical assessment of stimulus poverty arguments. The Linguistic Review, 19, 9–50, 2002. [Putnam, 1974] H. Putnam. The ‘corroboration’ of theories”, in A. Schilpp (Ed.), The Philosophy of Karl Popper (Vol. 2), La Salle, IL: Open Court, 1974. [Pylyshyn, 1987] Z. Pylyshyn, ed. The robot’s dilemma: The frame problem in artificial intelligence. Norwood, NJ: Ablex, 1987. [Quiggin, 1993] J. Quiggin. Generalized expected utility theory: The rank-dependent model. Norwell, MA: Kluwer Academic Publishers, 1993. [Quine, 1953] W. V. O. Quine. From a logical point of view, Cambridge, MA: Harvard University Press. 1953. [Rabin, 2000] M. Rabin. Diminishing Marginal Utility of Wealth Cannot Explain Risk Aversion. In D. Kahneman and A. Tversky (Eds.) Choices, Values, and Frames (pp. 202-208). New York: Cambridge University Press, 2000. [Ramachandran, 1990] V. S. Ramachandran. The Utilitarian Theory of Perception. In C. Blakemore (Ed.) Vision: Coding and Efficiency (pp. 346–360). Cambridge: Cambridge University Press, 1990. [Ramsey, 1931/1990] F. P. Ramsey. The foundations of mathematics and other logical essays. London: Routledge and Kegan Paul, 1931/1990. [Redington and Chater, 1998] M. Redington and N. Chater. Connectionist and statistical approaches to language acquisition: A distributional perspective. Language and Cognitive Processes, 13, 129-191, 1998. [Redington et al., 1998a] M. Redington, N. Chater, and S. Finch. Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425-469, 1998. [Redington et al., 1998b] M. Redington, N. Chater, and S. Finch. Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425-469, 1998. [Reiter, 1980] R. Reiter. A logic for default reasoning, Artificial Intelligence,13, 81–132, 1980. [Restle, 1970] E. Restle. Theory of serial pattern learning: Structural trees. Psychological Review,77, 481-495, 1970. [Rieke et al., 1997] F. Rieke, R. De Ruyter Van Steveninck, D. Warland, and W. Bialek. Spikes: Exploring the neural code. Cambridge, MA: MIT Press, 1997. [Rips, 1975] L. J. Rips. Inductive judgments about natural categories. Journal of Verbal Learning and Verbal Behavior, 14, 665-681, 1975. [Rips, 1983] L. J. Rips. Cognitive processes in propositional reasoning. Psychological Review, 90, 38-71, 1983. [Rips, 1994] L. J. Rips. The psychology of proof. Cambridge, MA: MIT Press 1994. [Rips, 2001] L. J. Rips. Two kinds of reasoning. Psychological Science, 12, 129-134, 2001. [Rissanen, 1987] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49, 223-239, 1987. [Rissanen, 1996] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions of Information Theory, 42, 40-47, 1996. [Roberts and Pashler, 2000] S. Roberts and H. Pashler. How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358-367, 2000. [Rock, 1983] I. Rock. The logic of perception. Cambridge, MA: MIT Press, 1983. [Rosch, 1975] E. Rosch. Cognitive representation of semantic categories. Journal of experimental psychology: General, 104, 192-233, 1975. [Rubenstein, 1998] A. Rubenstein. Modeling bounded rationality. Cambridge, MA: MIT Press, 1998.
622
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Rumelhart et al., 1986] D. E. Rumelhart, P. Smolensky, J. L. McClelland, and G. E. Hinton. Schemata and sequential thought processes in PDP models, in: J.McClelland and D.Rumelhart (Eds) Parallel distributed processing: Explorations in the microstructure of cognition Vol 2: Psychological and biological models (MIT Press), 1986. [Saumuelson and Zeckhauser, 1988] W. F. Samuelson and R. J. Zeckhauser. Status quo bias in decision making. Journal of Risk and Uncertainty, 1, 7-59, 1988. [Saunders and Knill, 2004] J. A. Saunders and D. C. Knill. Visual feedback control of hand movements. Journal of Neuroscience, 24, 3223-3234, 2004. [Savage, 1954] L. J. Savage. The Foundations of Statistics. New York, NY: Wiley, 1954. [Schooler and Anderson, 1997] L. J. Schooler and J. R. Anderson. The role of process in the rational analysis of memory. Cognitive Psychology, 32, 219-250, 1997. [Schrater and Kersten, 2000] P. R. Schrater and D. Kersten. How optimal depth cue integration depends on the task. International Journal of Computer Vision, 40, 71-89, 2000. [Schroyens and Schaeken, 2003] W. Schroyens and W. Schaeken. A critique of Oaksford, Chater and Larkin’s (2000) conditional probability model of conditional reasoning. Journal of Experimental Psychology: Learning, Memory and Cognition, 29, 140-149, 2003. [Sch¨ utze, 1995] H. Sch¨ utze. Distributional part-of-speech tagging. In Proc. of 7th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141-148, 1995. [Schwarz, 1978] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6, 461464, 1978. [Seidenberg and Elman, 1999] M. S. Seidenberg and J. L. Elman. Do infants learn grammar with algebra or statistics? Science, 284, 434-435, 1999. [Seidenberg, 1997] M. S. Seidenberg. Language acquisition and use: learning and applying probabilistic constraints. Science, 275, 1599-1603, 1997 [Shanks, 1995] D. R. Shanks. The psychology of associative learning. Cambridge: Cambridge University Press, 1995. [Shannon, 1951] C. E. Shannon. Prediction and entropy of printed English. Bell System Technical Journal, 30(1):50-64, January 1951. [Shannon and Weaver, 1949] C. E. Shannon and W. Weaver. The mathematical theory of communication. Urbana: University of Illinois Press, 1949. [Shiffrin and Steyvers, 1998] R. M. Shiffrin and M. Steyvers. The effectiveness of retrieval from memory. In M. Oaksford and N. Chater (Eds.), Rational Models of Cognition (pp. 73-95) Oxford: Oxford University Press, 1998. [Simon, 1957] H. A. Simon. Models of man, New York, NY: Wiley, 1957. [Simpson, 1951] E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Ser, B, 13, 238-241, 1951. [Skyrms, 1986] B. Skyrms. Choice and chance: An introduction to inductive logic. Belmont, California: Wadsworth, 1986. [Sloman, 1993] S. A. Sloman. Feature-based induction. Cognitive Psychology, 25, 231-280, 1993. [Sloman and Lagnado, 2005] S. A. Sloman and D. Lagnado. Do we “do”? Cognitive Science, 29, 5-39, 2005. [Smolensky and Legendre, 2006] P. Smolensky and G. Legendre. The harmonic mind (2 Vols). Cambridge, MA: MIT Press, 2006. [Snippe et al., 2000] H. P. Snippe, L. Poot, and J. H. van Hateren. A temporal model for early vision that explains detection thresholds for light pulses on flickering backgrounds. Visual Neuroscience 17, 449-462, 2000. [Sobel, 2004] J. H. Sobel. Probable modus ponens and modus tollens and updating on uncertain evidence. Unpublished manuscript, Department of Philosophy, University of Toronto, Scarborough, 2004. [Sober, 2002] E. Sober. Intelligent design and probability reasoning. International Journal for Philosophy of Religion, 52, 65-80, 2002. [Stanovich and West, 2000] K. E. Stanovich and R. F. West. Individual differences in reasoning: Implications for the rationality debate? Behavioral and Brain Sciences, 23, 645-665, 2000. [Stephens and Krebs, 1986] D. W. Stephens and J. R. Krebs. Foraging theory. Princeton, NJ: Princeton University Press, 1986. [Stewart and Simpson, 2008] N. Stewart and K. Simpson. A decision-by-sampling account of decision under risk. In N. Chater and M. Oaksford (Eds.) The probabilistic mind: Prospect for Bayesian cognitive science (pp. . Oxford: Oxford University Press, 2008.
Inductive Logic and Empirical Psychology
623
[Stewart et al., 2005] N. Stewart, G. D. A. Brown, and N. Chater. Absolute identification by relative judgment. Psychological Review, 112, 881-911, 2005. [Stewart et al., 2006] N. Stewart, N. Chater, and G. D. A. Brown. Decision by sampling. Cognitive Psychology, 53, 1-26, 2006. [Stewart et al., 2003] N. Stewart, N. Chater, H. P. Stott, and S. Reimers. Prospect relativity: How choice options influence decision under risk. Journal of Experimental Psychology: General, 132, 23-46, 2003. [Swier and Stevenson, 2005] R. Swier and S. Stevenson. Exploiting a Verb Lexicon in Automatic Semantic Role Labelling. Proceedings of the Joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP-05), 2005. [Tanenhaus et al., 1995] M. K. Tanenhaus, M. J. Spivey-Knowlton, K. M. Eberhard, and J. E. Sedivy. Integration of visual and linguistic information in spoken language comprehension. Science, 268, 632-634, 1995. [Taplin, 1971] J. E. Taplin. Reasoning with conditional sentences. Journal of Verbal Learning and Verbal Behavior, 10, 219-225, 1971. [Tenenbaum and Griffiths, 2001] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesia inference. Behavioral and Brain Sciences, 24, 629-641, 2001. [Tenenbaum et al., 2006] J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based Bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10, 309318, 2006. [Thaler, 1985] R. Thaler. Mental accounting and consumer choice. Marketing Science, 4, 199214, 1985. [Todorov and Jordon, 2002] E. Todorov and M. I. Jordon. Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226-1235, 2002. [Todorov, 2004] E. Todorov. Optimality principles in sensorimotor control. Nature Neuroscience, 7, 907-915, 2004. [Tomasello, 2003] M. Tomasello. Constructing a Language: A Usage-Based Theory of Language Acquisition. Cambridge, MA: Harvard University Press, 2003. [Toutanova et al., 2005a] K. Toutanova, C. Manning, D. Flickinger, and S. Oepen. Stochastic HPSG Parse Disambiguation using the Redwoods Corpus. Research on Language and Computation, 3, 83-105, 2005. [Trommersh¨ auser et al., 2006] J. Trommersh¨ auser, M. S. Landy, and L. T. Maloney. Humans rapidly estimate expected gain in movement planning. Psychological Science, 11, 981-988, 2006. [Tu et al., 2005] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unifying segmentation detection and recognition. International Journal of Computer Vision, 2, 113-140, 2005. [Tversky and Kahneman, 1992] A. Tversky and D. Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5, 297-323, 1992. [van der Helm and Leeuwenberg, 1996] P. A. van der Helm and E. L. J. Leeuwenberg. Goodness of visual regularities: A nontransformational approach. Psychological Review, 103,429–456, 1996. [Verschueren et al., 2005] N. Verschueren, W. Schaeken, and G. d’Ydewalle. A dual-process specification of causal conditional reasoning. Thinking and Reasoning, 11, 278-293, 2005. [von Mises, 1957] R. von Mises. Probability, statistics and truth (Revised English Edition). New York, NY: Macmillan, 1957. [Wagner, 2004] C. G. Wagner. Modus tollens probabilized. British Journal for Philosophy of Science, 55, 747-753, 2004. [Wallace and Freeman, 1987] C. S. Wallace and P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society, Series B, 49, 240-251, 1987. [Wallach and O’Connell, 1953] H. Wallach and D. N. O’Connell. The kinetic depth effect. Journal of Experimental Psychology, 45, 205-217, 1953. [Wason and Johnson-Laird, 1972] P. C. Wason and P. M. Johnson-Laird. Psychology of Reasoning: Structure and Content. London: Batsford 192. [Wason, 1960] P. C. Wason. On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology, 12, 129-140, 1960. [Wason, 1968] P. C. Wason. Reasoning about a rule. Quarterly Journal of Experimental Psychology, 20, 273-281 1968.
624
Nick Chater, Mike Oaksford, Ulrike Hahn and Evan Heit
[Weiss, 1997] Y. Weiss. Interpreting images by propagating Bayesian beliefs. In M.C. Mozer, M. I. Jordan and. T. Petsche (Ed.), Advances in Neural Information Processing Systems 9 (pp. 908-915). Cambridge MA: MIT Press, 1997. [Xu and Tenenbaum, 2007] F. Xu and J. B. Tenenbaum. Word learning as Bayesian inference. Psychological Review, 114, 245-272, 2007. [Yama, 2001] H. Yama. Matching versus optimal data selection in the Wason selection task. Thinking and Reasoning, 7, 295-311, 2001. [Yuille and Kersten, 2006] A. Yuille and D. Kersten. Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10, 301-308, 2006. [Zeelenberg et al., 2000] M. Zeelenberg, W. W. Van Dijk, A. S. R. Manstead, and J. van der Pligt. On bad decisions and disconfirmed expectancies: The psychology of regret and disappointment. Cognition and Emotion, 14, 521-541, 2000. [Zettlemoyer and Collins, 2005] L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorical grammars. In Proceedings of the Twenty First Conference on Uncertainty in Artificial Intelligence (UAI-05), 2005. [Zhai and Lafferty, 2001] C. Zhai and J. Lafferty. Document language models, query models, and risk minimization for information retrieval. In W. Croft, D. Harper, D. Kraft and J. Zobel, (Eds.) SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). New York, NY: ACM Press, 2001.
INDUCTIVE LOGIC AND STATISTICS Jan-Willem Romeijn
1
FROM INDUCTIVE LOGIC TO STATISTICS
There are strong parallels between statistics and inductive logic. An inductive logic is a system of inference that describes the relation between propositions on data, and propositions that extend beyond the data, such as predictions over future data, and general conclusions on data. Statistics, on the other hand, is a mathematical discipline that describes procedures for deriving results about a population from sample data. These results include decisions on rejecting or accepting a hypothesis about the population, the determination of probability assignments over such hypotheses, predictions on future samples, and so on. Both inductive logic and statistics are calculi for getting from the given data to propositions or results that transcend the data. Despite this fact, inductive logic and statistics have evolved more or less separately. This is partly because there are objections to viewing statistics, especially classical statistical procedures, as inferential. A more important reason, to my mind, is that inductive logic has been dominated by the Carnapian programme, and that statisticians have perhaps not recognised Carnapian inductive logic as a discipline that is much like their own. Statistical hypotheses and models do not appear in the latter, but they are the start and finish of most statistical procedures. Much of the mutual misunderstanding stems from this difference between the roles of hypotheses in the two programmes, or so I believe. In this chapter I aim to show that Carnapian inductive logic can be developed to encompass inference over statistical hypotheses, and that the resulting inductive logic can, at least partly, capture statistical procedures. In doing so, I hope to bring the philosophical discipline of inductive logic and the mathematical discipline of statistics closer together. I believe both disciplines can benefit from such a rapprochement. First, framing statistical procedures as inferences in an inductive logic may help to clarify the presuppositions and foundations of these procedures. Second, by relating statistics to inductive logic, insights from inductive logic may be used to enrich and improve statistics. And finally, showing the parallels between inductive logic and statistics may show the relevance, also to inductive logicians themselves, of their discipline to the sciences, and thereby direct further philosophical research. The reader may wonder where in this chapter she can read about the history of inductive logic in relation to the historical development of statistics. Admittedly, Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
626
Jan-Willem Romeijn
positions and theories from both disciplines are here discussed from a systematic viewpoint, and not so much as historical entities. I aim to provide a unified picture of inductive inference to which both inductive logic and statistics, past or present, can be related. At the heart of this picture lies the notion of statistical hypothesis. I think the fact that inductive logic and statistics have had comparatively little common past can be traced back to the absence of this notion from inductive logic. In turn, this absence can be traced back to the roots of inductive logic in logical empiricism. In that derived sense, the exposition of this chapter is related to the history of inductive logic. The plan of the chapter is as follows. I start by describing induction and observations in formal terms. Then I introduce a general notion of probabilistic inductive inference over these observations. Following that I present Carnapian inductive logic, and I show that it can be related to Bayesian statistical inference via de Finetti ’s representation theorem. This in turn suggests how Carnapian inductive logic can be extended to include inferences over statistical hypotheses. Finally, I consider two classical statistical procedures, maximum likelihood estimation and Neyman-Pearson hypothesis testing, and I discuss how they can be accommodated in this extended inductive logic. Given the nature of the chapter, the discussion of statistical procedures is relatively short. Many statistical procedures are not dealt with. Similarly, I cannot discuss in detail the many inductive logics devised within Carnapian inductive logic. For the latter, the reader may consult chapters 9 and 10 in this volume, and the further references contained therein. For the former, I refer to a recent volume on the philosophy of statistics, edited by [Bandyopadhyay and Forster, 2009]. 2 OBSERVATIONAL DATA As indicated, inductive inference starts from propositions on data, and ends in propositions that extend beyond the data. An example of an inductive inference is that, from the proposition that up until now all observed pears were green, we conclude that the next few pears will be green as well. Another example is that from the green pears we have seen we conclude that all pears are green, period. The key characteristic is that the conclusion says more than what is classically entailed by the premises. Let me straighten out the notion of observations a bit more. First, I restrict attention to propositions on empirical facts, thus leaving aside such propositions as that pears are healthy, or that God made them. Second, I focus on the results of observations of particular kinds of empirical fact. For example, the empirical fact at issue is the colour of pears, and the results of the observations are therefore colours of individual pears. There can in principle be an infinity of such observation results, but what I call data is always a finite sequence of them. Third, the result of an observation is always one from a designated partition of properties, typically finite and always countable. In the pear case, it may be {red, green, yellow}. I leave aside observations that cannot be classified in terms of a mutually exclusive
Inductive Logic and Statistics
627
set of properties. I now make these ideas on what counts as data a bit more formal. The concept I want to get across is that of a sample space, in which single observations and sequences of observations can be represented as sets, called events. After introducing the observations in terms of a language, I define sample space. All the probabilities in this chapter will be defined over such spaces because, strictly speaking, probability is a measure function over sets. However, the arguments of the probability functions may be taken as sentences from a logical language just as well. We denote the observation of individual i by Qi . This is a propositional variable, and we denote assignments or valuations of this variable by qik , which represents the sentence that the result of observing individual i is the property k. A sequence of such results of length t, starting at 1, is denoted with the propositional variable St , with the assignment sk1 ...kt , often abbreviated as st . In order to simplify notation, I denote properties with natural numbers, so k ∈ K = {0, 1, . . . , n − 1}. For example, if the observations are the aforementioned colours of pears, then n = 3. I write red as 0, green as 1, and yellow as 2, so that s012 means that the first three pairs were red, green, and yellow respectively. Note further that there are logical relations among the sentences, like s012 → q21 . Together, the expressions st and qik form the observation language. It will be convenient to employ a set-theoretical representation of the observations, a so-called sample space, otherwise known as an observation algebra. To this aim, consider the set of all infinitely long sequences K Ω , that is, all sequences like 012002010211112 . . ., each encoding the observations of infinitely many pears. Denote such sequences with ω, and write ω(i) for the i-th element in the sequence ω. Every sentence qik can then be associated with a particular set of such sequences, namely the set of ω whose i-th element is k: qik = {ω ∈ K Ω : ω(i) = k}. Clearly, we can build up all finite sequences of results sk1 ...kt as intersections of such sets: t qiki . sk1 ...kt = i=1
Note that entailments in the language now come out as set inclusions: we have s012 ⊂ q21 . Instead of using a language with sentences qik and logical relations among such sentences, I will in the following use a so-called algebra Q, built up by the sets qik and their conjunctions and intersections. I want to emphasise that the notion of a sample space introduced here is really quite general. It excludes a continuum of individuals and a continuum of properties, but apart from that, any data recording that involves individuals and that ranges over a set of properties can serve as input. For example, instead of pears having colours we may think of subjects having test scores. Or of companies having certain stock prices. The sample space used in this chapter follows the
628
Jan-Willem Romeijn
basic structure of most applications in statistics, and of almost all applications in inductive logic. 3 INDUCTIVE INFERENCE Now that I have made the notion of data more precise, let me turn to inductive inference. Consider the case in which I have observed three red pears: s000 . What can I conclude about the next pear? Or about pears in general? From the data itself, it seems that we can conclude depressingly little. We might say that the next pear is red, q40 . But as it stands, each of the sets s000k = s000 ∩ q4k , for k = 0, 1, 2, is a member of the sample space. In other words, we cannot derive any such q4k from s000 . The event of observing three red pears is consistent with any colour for the next pear. Purely on the basis of the classical relations among observations, as expressed in the sample space, we cannot draw any inductive conclusion. Perhaps we can say that given three green pears, the next pear being red is more probable? This is where we enter the domain of probabilistic inductive logic. We can describe the complete population of pears by a probability function over the observational facts, P : Q → [0, 1]. k , and also every sequence of such pears sk1 ...kt , receives Every possible pear qt+1 a distinct probability. The probability of the next pear being of a certain colour, k |sk1 ...kt ). Similarly, we may conditional on a given sequence, is expressed as P (qt+1 wonder about the probability that all pears are green, which is again determined by the probability assignment, in this case P ({∀i : qi1 }).1 All such probabilistic inductive inferences are completely determined by the full probability function P . The central question of any inductive inference or procedure is therefore how to determine the function P , relative to the data that we already have. What must the probability of the next observation be, given a sequence of observations gone before? And what is the right, or preferable, distribution over all observations, given the sequence? In the framework of this chapter, both statistics and inductive logic aim to provide an answer to these questions, but they do so in different ways. It will be convenient to keep in mind a particular understanding of probability assignments P over the sample space, or observation algebra, Q. Recall that in classical two-valued logic, a model of the premises is a complete truth valuation over the language, subject to the rules of logic. Because of the correspondence between language and algebra, the model is also a complete function over the algebra, taking the values {0, 1}. By analogy, we may consider a probability function over an observation algebra as a model too. Only this model takes values in the interval [0, 1], and it is subject to the axioms of probability. In the following I will use probability functions over sample space as models, that is, as the building blocks of a formal semantics. We must be careful with the terminology here, 1 The set {∀i : q 1 } is included in the domain Q if the latter is a so-called σ-algebra generated i by the observations Qqi .
Inductive Logic and Statistics
629
because in statistics, models often refer to sets of statistical hypotheses. In the following, I will therefore refer to complete probability functions over the algebra as statistical hypotheses. A hypothesis is denoted h, the associated probability function is Ph . In statistics, these probability functions are also often referred to as distributions over a population. All probabilistic inductive logics use probability functions over sample space for the purpose of inductive inference. But there are widely different ways of understanding the inductive inferential step. The most straightforward of these, and the one that is closest to statistical practice, is to map each sample st onto a hypothesis h, or otherwise onto a set of such hypotheses. The inferential step then runs from the data st and a set of statistical hypotheses, each associated with a probability function Ph , towards a more restricted set, or even to a single h and Ph . The resulting inductive logic is called ampliative, because the restriction on the set of probability functions that is effected by the data, i.e. the conclusion, is often stronger than what is deductively entailed by the data and the initial set of probability functions, i.e. the premises. We can also make the inferential step precise by analogy to a more classical, non-ampliative notion of entailment. As will become apparent, this kind of inferential step is more naturally associated with what is traditionally called inductive logic. It is also associated with a basic kind of probabilistic logic, as elaborated in [Hailperin, 1996] and more recently in [Haenni et al., 2009], especially section 2. Finally, this kind of inference is strongly related to Bayesian logic, as advocated by [Howson, 2003]. Recall that an argument is classically valid if and only if the set of models satisfying the premises is contained in the set of models satisfying the conclusion. The same idea of classical entailment may now be applied to the probabilistic models over sample space. In that case, the inferential step is from one set of probability assignments, characterised by a number of restrictions associated with premises, towards another set of probability assignments, characterised by a different restriction that is associated with a conclusion. The inductive inference is called valid if the former is contained in the latter. In such a valid inferential step, the conclusion does not amplify the premises. As an example, say that we fix P (q10 ) = 12 and P (q11 ) = 13 . Both these probability assignments can be taken as premises in a logical argument, and the models of these premises are simply all probability functions P over Q for which these two valuations hold. By the axioms of probability, we can derive that any such function P will also satisfy P (q12 ) = 16 . On its own, the latter expression amounts to a set of probability functions over the sample space Q in which the probability functions that satisfy both premises are included. In other words, the latter assignment is classically entailed by the two premises. Along exactly the same lines, we may derive a probability assignment for a statistical hypothesis h conditional on the data st , written as P (h|st ), from the input probabilities P (h), P (st ), and P (st |h), using the theorem of Bayes. The classical understanding of entailment may thus be used to reason inductively, namely towards statistical hypotheses that themselves determine a probability assignment over data.
630
Jan-Willem Romeijn
In the following the focus will be on non-ampliative inductive logic, because Carnapian inductive logic is most easily related to non-ampliative logic. Therefore, viewing statistical procedures in this perspective makes the latter more amenable to inductive logical analysis. To be sure, I do not want to claim that I thereby lay bare the real nature of the statistical procedures, or that I am providing indubitable norms for statistical inference. Rather, I hope to show that the investigation of statistics along these specific logical lines clarifies and enriches statistical procedures. Furthermore, as indicated, I hope to stimulate research in inductive logic that is directed at problems in statistics. 4
CARNAPIAN LOGICS
With the notions of observation and induction in place, I can present the logic of induction developed by [Carnap, 1950; Carnap, 1952]. Historically, Carnapian inductive logic can lay most claim to the title of inductive logic proper. It was the first systematic study into probabilistic predictions on the basis of data. The central concept in Carnapian inductive logic is logical probability. Recall that the sample space Q corresponds to an observation language, comprising of sentences such as “the second pear is green”, or formally, q21 . The original idea of Carnap was to derive a probability assignment over the language on the basis of symmetries within the language. In the example, we have three mutually exclusive properties for each pear, and in the absence of any further knowledge, there is no reason to think of any of these properties as special or as more, or less, appropriate than the other two. The symmetry inherent to the language suggests that each of the sentences qik for k = 0, 1, 2 should get equal probability: P (qi0 ) = P (qi1 ) = P (qi2 ) =
1 . 3
The idea of logical probability is to fix a unique probability function over the observation language, or otherwise a strongly restricted set of such functions, on the basis of such symmetries. Next to symmetries, the set of probability functions can also be restricted by certain predictive properties. As an example, we may feel that yellow pears are more akin to green pears, so that finding a yellow pear decreases the probability for red pears considerably, while it decreases the probability for green pears much less dramatically. That is, 1 1 P (qt+1 |st−1 ∩ qt2 ) |st−1 ) P (qt+1 > . 0 2 0 |s P (qt+1 |st−1 ∩ qt ) P (qt+1 t−1 )
How such relations among properties may play a part in determining the probability assignment P is described in the literature on analogy reasoning. See [Festa, 1996; Maher, 2000; Romeijn, 2006]. Interesting recent findings on relations between analogical predictive properties can also be found in [Paris and Waterhouse, 2008].
Inductive Logic and Statistics
631
All Carnapian inductive logics are defined by a number of symmetry principles and predictive properties, determining a probability function, or otherwise a set of such functions. One very well-known inductive logic, discussed at length in [Carnap, 1952], employs a probability assignment characterised by the following symmetries,
P (qik ) = P (qik ), P (sk1 ...ki ...kt ) = P (ski ...k1 ...kt ),
(1) (2)
for all values of i, t, k, and k , and for all values ki with 1 ≤ i ≤ t. The latter of these is known as the exchangeability of observations: the order in the observations does not matter to their probability. The inductive logic at issue employs a particular version of exchangeability, known as the requirement of restricted relevance, k |st ) = f (tk , t), (3) P (qt+1
where tk is the number of earlier instances qik in the sequence st and t the total number of observations. These symmetries together determine a set of probability assignments, for which we can derive the following consequences: k |st ) = (4) P (qt+1
tk + nλ , t+λ
where n is the number of values for k, and tk is the number of earlier instances qik in the sequence st . The parameter 0 ≤ λ < ∞ can be chosen at will. Predictive probability assignments of this form are called Carnapian λ-rules. The probability distributions satisfying the afore-mentioned symmetries have some striking features. Most importantly, we have that k k |st−1 ∩ qtk ) > P (qt+1 |st−1 ). (5) P (qt+1
This predictive property is called instantial relevance: the occurrence of qtk ink . It was considered a success for Carnap that this creases the probability for qt+1 typically inductive effect is derivable from the symmetries alone. By providing an independent justification for these symmetries, Carnap effectively provided a justification for induction, thereby answering the age-old challenge of Hume.2 Note that the outlook of Carnapian logic is very different from the outlook of the inductive logics discussed in Section 3. Any such logic starts with a set of probability functions, or hypotheses, over a sample space and then imposes a further restriction on this set, or derives consequences from it, on the basis of the data. By contrast, Carnapian logic starts with a sample space and a number of 2 As recounted in [Zabell, 1982], earlier work that connects exchangeability to the predictive properties of probability functions was done by [Johnson, 1932] and [de Finetti, 1937]. But the specific relation with Hume’s problem noted here is due to Carnap: he motivated predictive properties such as Equation (4) independently, by the definition of logical probability, whereas for the subjectivist de Finetti these properties did not have any objective grounding.
632
Jan-Willem Romeijn
symmetry principles and predictive properties, that together fix a set of probability functions over the sample space. Just like the truth tables restrict the possible truth valuations, so do these principles restrict the logical probability functions, albeit not to a singleton, as λ can still be chosen freely. But from the point of view of statistics, Carnap is thereby motivating, from logical principles, the choice for a particular set of hypotheses. If we ignore the notion of logical probability and concentrate on the inferential step, then Carnapian inductive logics fit best in the template for non-ampliative inductive logic. As said, we fix a set of probability assignments over the sample space by means of a number of symmetry principles and predictive properties. But subsequently the conclusions are reached by working out specific consequences for probability functions within this set, using the axioms of probability only. In particular, Carnapian inductive logic looks at the probability assignments conditional on various samples st . Importantly, in this template the symmetries in the language, like Equation (1) and Equation (2), appear as premises in the inductive logical inference. They restrict the set of probability assignments that is considered in the inference. Insofar as they both concern sets of probability functions over sample space, Carnapian logic and statistical inference are clearly similar. However, while statistics frames these probability functions in terms of statistical hypotheses, these hypotheses do not appear in Carnapian logic. Instead, the emphasis is on characterising probability functions in terms of symmetries and predictive properties. The background of this is logical empiricism: the symmetries directly relate to the empirical predicates in the language of inductive logic, and the predictive properties relate to properties of the probability functions that show up for finite data. By contrast, statistical hypotheses are rather elusive: they cannot be formulated in terms of finite combinations of empirical predicates because they concern chances. If anywhere, these chances only show up in the limit of the data size going to infinity, as the limiting relative frequency. The overview of Carnapian logics given here is admittedly very brief. For example, I have not dealt with a notable exception to the horror hypothesi of inductive logicians, Hintikka systems. For more on the rich research programme of Carnapian inductive logic, I refer to chapter 9, and for Hintikka systems in particular, to chapter 10. For present purposes the thing to remember is that, leaving aside the specifics of logical probability, Carnapian logic can be viewed as a non-ampliative inductive logic, and that it does not make use of statistical hypotheses. 5 BAYESIAN STATISTICS The foregoing introduced Carnapian inductive logic. Now we can start answering the central question of this chapter. Can inductive logic, Carnapian or otherwise, accommodate statistical procedures? The first statistical procedure under scrutiny is Bayesian statistics. The defining characteristic of this kind of statistics is that probability assignments do not just
Inductive Logic and Statistics
633
range over data, but that they can also take statistical hypotheses as arguments. As will be seen in the following, Bayesian inference is naturally represented in terms of a non-ampliative inductive logic. Moreover, it relates very naturally to Carnapian inductive logic. Let H be the space of statistical hypotheses hθ , and let Q be the sample space as before. The functions P are probability assignments over the entire space H × Q. Since the hypotheses hθ are members of the combined algebra, the conditional functions P (st |hθ ) range over the entire algebra Q. We can then define Bayesian statistics as follows. DEFINITION 1 Bayesian Statistical Inference. Assume the prior probability P (hθ ) assigned to hypotheses hθ ∈ H, with θ ∈ Θ, the space of parameter values. Further assume P (st |hθ ), the probability assigned to the data st conditional on the hypotheses, called the likelihoods. Bayes’ theorem determines that (6) P (hθ |st ) = P (hθ )
P (st |hθ ) . P (st )
Bayesian statistics outputs a posterior probability assignment, P (hθ |st ). I refer to [Barnett, 1999] and [Press, 2003] for a detailed discussion. The further results form a Bayesian inference, such as estimations and measures for the accuracy of the estimations, can all be derived from the posterior distribution over the statistical hypotheses. In this definition the probability of the data P (st ) is not presupposed, because it can be computed from the prior and the likelihoods by the law of total probability, ) = P (hθ )P (st |hθ )dθ. P (st Θ
The result of a Bayesian statistical inference is not always a complete posterior probability. Often the interest is only in comparing the ratio of the posteriors of two hypotheses. By Bayes’ theorem we have P (hθ )P (st |hθ ) P (hθ |st ) = , P (hθ |st ) P (hθ )P (st |hθ ) and if we assume equal priors P (hθ ) = P (hθ ), we can use the ratio of the likelihoods of the hypotheses, the so-called Bayes factor, to compare the hypotheses. Let me give an example of a Bayesian procedure. Say that we are interested in the colour composition of pears from Emma’s farm, and that her pears are red, qi0 , or green, qi1 . Any ratio between these two kinds of pears is possible, so we have a set of so-called multinomial hypotheses hθ for which (7) Phθ (qt1 |st−1 ) = θ,
Phθ (qt0 |st−1 ) = 1 − θ
634
Jan-Willem Romeijn
where θ is parameter in the interval [0, 1]. The hypothesis hθ fixes the portion of green pears at θ, and therefore, independently of what pears we saw before, the probability that a randomly drawn pear from Emma’s farm is green is θ. The type of distribution over Q that is induced by these hypotheses is sometimes called a Bernoulli distribution, or a multinomial distribution. We now define a Bayesian statistical inference over these hypotheses. Instead of directly choosing among the hypotheses on the basis of the data, as classical statistics advises, we assign a probability distribution over the hypotheses, expressing our epistemic uncertainty. For example, we may choose a so-called Beta distribution, (8) P (hθ ) = Norm × θλ/2−1 (1 − θ)λ/2−1 with θ ∈ Θ = [0, 1] and Norm a normalisation factor. For λ = 2, this function is uniform over the domain. Now say that we observe a sequence of pears st = sk1 ...kt , and that we write t1 as the number of green pears, or 1’s, in the sequence st , and t0 for the number of 0’s, so t0 + t1 = t. The probability of this sequence st given the hypothesis hθ is (9) P (st |hθ ) =
t
Phθ (qiki |si−1 ) = θt1 (1 − θ)t0 .
i=1
Note that the probability of the data only depends on the number of 0’s and the number of 1’s in the sequence. Applying Bayes’ theorem then yields, omitting a normalisation constant, (10) P (hθ |st ) = Norm × θλ/2−1+t1 (1 − θ)λ/2−1+t0 . This is the posterior distribution over the hypotheses. It is derived from the choice of hypotheses, the prior distribution over them, and the data by means of the axioms of probability theory, specifically by Bayes’ theorem. Most of the controversy over the Bayesian method concerns the determination and interpretation of the probability assignment over hypotheses. As will become apparent in the following, classical statistics objects to the whole idea of assigning probabilities to hypotheses. The data have a well-defined probability, because they consist of repeatable events, and so we can interpret the probabilities as frequencies, or as some other kind of objective probability. But the probability assigned to a hypothesis cannot be understood in this way, and instead expresses an epistemic state of uncertainty. One of the distinctive features of classical statistics is that it rejects such an epistemic interpretation of the probability assignment, and that it restricts itself to a straightforward interpretation of probability as relative frequency. Even if we buy into this interpretation of probability as epistemic uncertainty, how do we determine a prior probability? At the outset we do not have any idea of which hypothesis is right, or even which hypothesis is a good candidate. So how are we supposed to assign a prior probability to the hypotheses? The literature
Inductive Logic and Statistics
635
proposes several objective criteria for filling in the priors, for instance by maximum entropy or by other versions of the principle of indifference, but something of the subjectivity of the starting point remains. The strength of classical statistical procedures is that they do not need any such subjective prior probability. 6
INDUCTIVE LOGIC WITH HYPOTHESES
Bayesian statistics is closely related to the inductive logic of Carnap. In this section I will elaborate on this relation, and indicate how Bayesian statistical inference and inductive logic may have a fruitful common future. To see how Bayesian statistics and Carnapian inductive logic hang together, note first that the result of a Bayesian statistical inference, namely a posterior, is naturally related to the result of a Carnapian inductive logic, namely a prediction, 1 1 1 P (qt+1 |hθ ∩ st )P (hθ |st )dθ, (11) P (qt+1 |st ) = 0
by the law of total probability. We can elaborate this further by considering the multinomial hypotheses given in Equation (7). Recall that conditional on the hypothesis hθ the probability for the next pear to be green is θ, which can therefore 1 |hθ ∩ st ): replace P (qt+1 1 (12) P (qt+1 |st ) = θP (hθ |st )dθ = E[θ]. Θ
This shows that in the case of multinomial statistical hypotheses, the expectation value for the parameter is the same as a predictive probability. But as it turns out, the relation between Carnapian logic and Bayesian statistics is more fundamental. We can work out the integral of Equation (11), assuming a Beta distribution as prior and hence using Equation (10) as the posterior, to obtain 1 |st ) = (13) P (qt+1
t1 + λ2 . t+λ
This means that there is a specific correspondence between certain kinds of predictive probabilities, as described by the Carnapian λ-rules, and certain kinds of Bayesian statistical inferences, namely with multinomial hypotheses and priors of a particular shape. The correspondence of Carnapian logic and Bayesian statistical inference is in fact more general than this. Instead of the well-behaved priors just considered, we might consider as prior any functional form over the hypotheses hθ , and then wonder what the resulting predictive probability is. As [de Finetti, 1937] showed in his representation theorem, the resulting predictive probability will always comply to the predictive property of exchangeability, as given in Equation (2). Conversely, and perhaps more surprisingly, any predictive probability complying to the property of exchangeability can be written down in terms of a Bayesian statistical
636
Jan-Willem Romeijn
inference with multinomial hypotheses and some prior over these hypotheses. In other words, de Finetti showed that there is a one-to-one correspondence between the predictive property of exchangeability on the one hand, and Bayesian statistical inferences using multinomial hypotheses on the other. It may be useful to make this result by de Finetti explicit in terms of the non-ampliative inductive logic discussed in the foregoing. Recall that a Bayesian statistical inference takes a prior and likelihoods as premises, leading to a single probability assignment over the space H × Q as the only assignment that satisfies the premises. We infer probabilistic consequences, specifically predictions, from this probability assignment. A Carnapian inductive logic, on the other hand, is characterised by a single probability assignment, defined over the space Q, from which the predictions can be derived. So the representation theorem by de Finetti effectively shows an equivalence between these two probability assignments: when it comes to predictions, we can reduce the probability assignment over H × Q to an assignment over Q only. For de Finetti , this equivalence was very welcome. He had a strictly subjectivist interpretation of probability, believing that probability expresses uncertain belief only. Moreover, he was eager to rid science of its metaphysical excess baggage to which, in his view, the notion of objective chance belonged. So in line with the logical empiricists working in inductive logic, de Finetti applied his representation theorem to argue against the use of multinomial hypotheses, and thereby against the use of statistical hypotheses more generally. Why refer to these obscure chances if we can achieve the very same statistical ends by employing the unproblematic notion of exchangeability? The latter is a predictive property, and it can therefore be interpreted as an empirical and as a subjective notion. The fact is that statistics, as it is used in the sciences, is persistent in its use of statistical hypotheses. Therefore I want to invite the reader to consider the inverse application of de Finetti ’s theorem. Why does science use these obscure objective chances? As argued in [Romeijn, 2004; Romeijn, 2005; Romeijn, 2006], the reason is that statistical hypotheses provide invaluable help in, indirectly, pinning down the probability assignments over Q that have the required predictive properties. Rather than reducing the Bayesian inferences over statistical hypotheses to inductive predictions over observations, we can use the representation theorem to capture relations between observations in an insightful way, namely by citing the statistical hypotheses that may be true of the data. Hence it seems a rather natural extension of traditional Carnapian inductive logic. Bayesian statistics, as it has been presented here, is a ready made specification of this extended inductive logic, which may be called Bayesian inductive logic. The premises of the inference are restrictions to the set of probability assignments over H × Q, and the conclusions are simply the probabilistic consequences of these restrictions, derived by means of the axioms of probability, often by Bayes’ theorem. The inferential step, as in Carnapian logic, is non-ampliative. When it comes to the predictive consequences, the extension of the probability space with H may be considered unnecessary because, as indicated, we can always project
Inductive Logic and Statistics
637
the probability P over the extended space back onto Q. However, the probability function resulting from that projection may be very hard to define in terms of its predictive properties alone. Naturally, capturing Bayesian statistics in the inductive logic thus defined is immediate. The premises are the prior over the hypotheses, P (hθ ) for θ ∈ Θ, and the likelihood functions, P (st |hθ ) over the sample spaces Q, which are determined for each hypothesis hθ separately. These premises are such that only a single probability assignment over the space H × Q remains. In other words, the premises have a unique probability model. The conclusions all follow from the posterior probability over the hypotheses. They can be derived from the assignment by applying theorems of probability theory. The present view on inductive logic has some important precursors. First, it shows similarities with the so-called presupposition view expounded in [Festa, 1993]. The view of Festa with regards to the choice of λ in Carnapian inductive logic runs parallel to what I here argue concerning the choice of hypotheses more generally. Second, the present view is related to the views expressed by Hintikka in [Auxier and Hahn, 2006], and I want to highlight certain aspects of this latter view in particular. In response to Kuipers’ overview of inductive logic, Hintikka writes that “Inductive inference, including rules of probabilistic induction, depends on tacit assumptions concerning the nature of the world. Once these assumptions are spelled out, inductive inference becomes in principle a species of deductive inference.” Now the symmetry principles and predictive properties used in Carnapian inductive logic are exactly the tacit assumptions Hintikka speaks about. As explained in the foregoing, the use of particular statistical hypotheses in a Bayesian inference comes down to the very same set of assumptions, but now these assumptions are not tacit anymore: they have been made explicit as the choice for a particular set of statistical hypotheses. Therefore, the use of statistical hypotheses that I have advertised above may help us to get closer to the ideal of inductive logic envisaged by Hintikka. 7 NEYMAN-PEARSON TESTING In the foregoing, I have presented Carnapian inductive logic and Bayesian statistical inference. I have shown that these two are strongly related, and that they both fit the template of non-ampliative inductive logic introduced in section 3. This led to the introduction of Bayesian inductive logic in the preceding section. In the following, I will consider two classical statistical procedures, Neyman-Pearson hypothesis testing and Fisher’s maximum likelihood estimation , and see whether they can be captured in this inductive logic. Neyman-Pearson hypothesis testing concerns the choice between two statistical hypotheses, that is, two fully specified probability functions over sample space. Let H = {h0 , h1 } be the set of hypotheses, and let Q be the sample space introduced earlier on. Each of the hypotheses is associated with a complete probability function Phj over the sample space. But note that, unlike in Bayesian statistics, the hypotheses hj are not part of the probability space. No probability is assigned
638
Jan-Willem Romeijn
to the hypotheses themselves, and we cannot write P (·|hj ) anymore. Instead we compare the hypotheses h0 and h1 by means of a so-called test function. See [Barnett, 1999] and [Neyman and Pearson, 1967] for more details. DEFINITION 2 Neyman-Pearson Hypothesis Test. Let F be a function over the sample space Q, P (s ) 1 if Phh1 (stt ) > r, 0 (14) F (st ) = 0 otherwise, where Phj is the probability over the sample space determined by the statistical hypothesis hj . If F = 1 we decide to reject the null hypothesis h0 , else we accept h0 for the time being. Note that, in this simplified setting, the test function is defined for each set of sequences st separately. For each sample plan, and associated sample size t, we must define a separate test function. The decision to accept or reject a hypothesis reject is associated with the socalled significance and power of the test: SignificanceF = α = F (st )Ph0 (st )dst , Q
PowerF = 1 − β =
F (st )Ph1 (st )dst . Q
The significance is the probability, according to the hypothesis h0 , of obtaining data that leads us to reject the hypothesis h0 , or in short, the type-I error of falsely rejecting the null hypothesis, denoted α. Similarly, the power is the probability, according to h1 , of obtaining data that leads us to reject the hypothesis h0 , or in short, the probability under h1 of correctly rejecting the null hypothesis, so that β = 1 − Power is the type-II error of falsely accepting the null hypothesis. An optimal test is one that minimizes the significance level, and maximizes the power. Neyman and Pearson prove that the decision has optimal significance and power for, and only for, likelihood-ratio test functions F . That is, an optimal test P (s ) depends only on a threshold for the ratio Phh1 (stt ) . 0 Let me illustrate the idea of Neyman-Pearson tests. Say that we have a pear whose colour is described by q k , and we want to know from what farm it originates, from farmer Maria (h0 ) or Lisa (h1 ). We know that the colour composition of the pears from the two farms are as follows: Hypothesis \ Data h0 h1
q0 0.00 0.40
q1 0.05 0.30
q2 0.95 0.30
Inductive Logic and Statistics
639
If we want to decide between the two hypotheses, we need to fix a test function. Say that we choose 0 if k = 2, k (15) F (q ) = 1 else. In the definition above, which uses a threshold for the likelihood ratio, this comes 6 and 14, for example r = 1. down to choosing a value for r somewhere between 19 0 1 The significance level is Ph0 (q ∪ q ) = 0.05, and the power is Ph1 (q 0 ∪ q 1 ) = 0.70. Now say that the pear we have is green, so F = 1 and we reject the null hypothesis, concluding that Maria did not grow the pear with the aforementioned power and significance. From the perspective of ampliative inductive logic, it is not too far-fetched to read an inferential step into the Neyman-Pearson procedure. The test function F brings us from a sample st and two probability functions, Ph0 and Ph1 , to a single probability function over the sample space Q. So we might say that the test function is the procedural analogue of an inductive inferential step, as discussed in Section 3. This step is ampliative because both probability functions Phj are consistent with the data. Ruling out one of them cannot be done deductively.3 Neyman-Pearson hypothesis testing is sometimes criticised because its results depend on the entire probability P over sample space, and not just on the probability of the observed sample. That is, the decision to accept or reject the null hypothesis against some alternative hypothesis depends not just on the probability of what has actually been observed, but also on the probability of what could have been observed. A well-known illustration of this problem concerns so-called optional stopping. But here I want to illustrate the same point with an example that can be traced back to [Jeffreys, 1931] p. 357, and of which a variant is discussed in [Hacking, 1965].4 Instead of the hypotheses h0 and h1 above, say that we compare the hypotheses h0 and h1 .
Hypothesis \ Data h0 h1
q0 0.05 0.40
q1 0.05 0.30
q2 0.90 0.30
3 There are attempts to make these ampliative inferences more precise, for example by means of default logic, or a logic that otherwise employs a preferential ordering over probability models. Specifically, so-called evidential probability, proposed by [Kyburg, 1974] and more recently discussed by [Wheeler, 2006], is concerned with inferences that combine statistical hypotheses, which are each accepted with certain significance levels. However, in this chapter I will not investigate these logics. They are not concerned with inferences from the data to predictions or to hypotheses, but rather with inferences from hypotheses to other hypotheses, and from hypotheses to predictions. 4 I would like to thank Jos Uffink for bringing this example to my attention. To the best of my knowledge, the exact formulation of this example is his.
640
Jan-Willem Romeijn
We determine the test function F (q k ) = 1 iff k = 0, by requiring the same significance level, Ph0 (q 0 ) = 0.05, resulting in the power Ph1 (q 0 ) = 0.40. Now imagine that we observe q 1 again, so that we accept h0 . But this is a bit odd, because the hypotheses h0 and h0 have the same probability for q 1 ! So how can the test procedure react differently to this observation? It seems that, in contrast to h0 , the hypothesis h0 escapes rejection because it allocates some probability to q 0 , an event that does not occur, thus shifting the area in sample space on which it is rejected. Examples like this gave rise to the famed complaint by Jeffreys that “the null hypothesis can be rejected because it fails to predict an event that never occurred”. This illustrates how the results of a Neyman-Pearson procedure depend on the entire probability assignment over the sample space, and not just on the actual observation. From the perspective of an inductive logician, it may therefore seem “a remarkable procedure”, to cite Jeffreys again. But it must be emphasised that Neyman-Pearson statistics was never intended as an inference in disguise. It is a procedure that allows us to decide between two hypotheses on the basis of data, generating error rates associated with that decision. Neyman and Pearson themselves were very explicit that the procedure must not be interpreted inferentially. Rather than inquiring into the truth and falsity of a hypothesis, they were interested in the probability of mistakenly deciding to reject or accept a hypothesis. The significance and power concern the probability over data given a hypothesis, not the probability of hypotheses given the data.
8 NEYMAN-PEARSON TEST AS AN INFERENCE In this section, I investigate whether we can turn the Neyman-Pearson procedure of Section 7 into an inference within Bayesian inductive logic. This might come across as a pointless exercise in statistical yoga, trying to make Neyman and Pearson relax in a position that they would not naturally adopt. However, the exercise will nicely illustrate how statistics may be related to inductive logic, and thus invite research on the intersection of inductive logic and statistics in the sciences. An additional reason for investigating Neyman-Pearson hypothesis testing in this framework is that in many practical applications, scientists are tempted to read the probability statements about the hypotheses inversely after all: the significance is often taken as the probability that the null hypothesis is true. Although emphatically wrong, this inferential reading has a strong intuitive appeal to users. The following will make explicit that in this reading, the Neyman-Pearson procedure is effectively taken as a non-ampliative entailment. First, we construct the space H × Q, and define the probability functions Phj over the sample spaces hj , Q. For the prior probability assignment over the two hypotheses, we take P (h0 ) ∈ (l, u), meaning that l < P (h0 ) < u. We write P (hj ) = min P (hj ) and P (hj ) = max P (hj ). Finally, we adopt the restriction that P (h0 ) + P (h1 ) = 1. This defines a set of probability functions over the entire
Inductive Logic and Statistics
641
space, serving as a starting point of the inference. Next we include the data in the probability assignments. Crucially, we coarse-grain the observations to the simple observation f j , with f j = {st : F (st ) = j}, so that the observation simply encodes the value of the test function. Then the type-I and type-II errors can be equated to the likelihoods of the observations according to P (f 1 |h0 ) = α, P (f 0 |h1 ) = β. Finally we use Bayes’ theorem to derive a set of posterior probability distributions over the hypotheses, according to P (f j |h1 )P (h1 ) P (h1 |f j ) = . P (h0 |f j ) P (f j |h0 )P (h0 ) Note that the quality of the test, in terms of size and power, will be reflected in the posteriors. If, for example, we find an observation st that allows us to reject the null hypothesis, so f 1 , then for the posterior interval we will generally have P (h0 |f 1 ) < P (h0 ) and P (h0 |f 1 ) < P (h0 ). With this representation, we have not yet decided on a fully specified prior probability over the statistical hypotheses. This echoes the fact that classical statistics does not make use of a prior probability. However, it is only by restricting the prior probability over hypotheses in some way or other that we can make the Bayesian rendering of the results of Neyman and Pearson work. In particular, if we choose (l, u) = (0, 1) for the prior, then we find the interval (0, 1) for the posterior as well. However, if we choose l≥
β , β+1−α
u≤
1−β , 1−β+α
we find for all P (h0 ) ∈ (l, u) that P (h0 |f 1 ) < 12 < P (h1 |f 1 ). Similarly, we find P (h0 |f 0 ) > 12 > P (h1 |f 0 ). So with this interval prior, an observation st for which F (st ) = 1 tilts the balance towards h1 for all the probability functions P in the interval, and vice versa. Let me illustrate this by means of the example on the farmers Lisa and Maria. We set up the sample space and hypotheses as before, and we then coarse-grain the observations to f j , corresponding to the value of the test function, f 1 = q 0 ∪ q 1 and f 0 = q 2 . We obtain P (f 1 |h0 ) = P (q 0 ∪ q 1 |h0 ) = α = 0.05 P (f 0 |h1 ) = P (q 0 ∪ q 1 |h1 ) = β = 0.30 Choosing P (h0 ) ∈ (0.24, 0.93), this results in P (h0 |f 0 ) = (0.50, 0.98), and P (h0 |f 1 ) = (0.02, 0.50).
642
Jan-Willem Romeijn
Depending on the choice of prior, we might claim that the resulting Bayesian inference replicates the Neyman-Pearson procedure: if the probability over hypotheses expresses our preference over them, then indeed f 0 makes us prefer h0 and f 1 makes us prefer h1 . Importantly, the inference fits the entailment relation mentioned earlier: we have a set of probabilistic models on the side of the premises, namely the set of priors over H, coupled to the full probability assignments over hj , Q for each of the hypotheses. And we have a set of models on the conclusion side, namely the set of posteriors over H. Because the latter is computed from the former by the axioms of probability, the two sets cover the same probability functions over sample space. Therefore the conclusion is classically entailed by the premises, meaning that any element from the set of probability functions that features as premise is also included in the set of probability functions that features as conclusion. The above example shows that we can imitate the workings of a NeymanPearson test in Bayesian inductive logic, and thus in terms of a non-ampliative inductive inference. But the imitation is far from perfect. For one, the results of a Bayesian inference will always be a probability function. By contrast, NeymanPearson statistics ends in a decision to accept or reject, which is a binary decision instead of some sort of weak or inconclusive preference. Of course, there are many attempts to weld a binary decision onto the probabilistic end result of a Bayesian inference, for example in [Levi, 1980] and in the discussion on rational acceptance, e.g., [Douven, 2002]. In particular, we might supplement the probabilistic results of a Bayesian inference with rules for translating the probability assignments into decisions, e.g., we choose h0 if we have P (h0 |st ) > 12 , and similarly for h1 . However, the bivalence of Neyman-Pearson statistics cannot be replicated in a Bayesian inference itself. It will have to result from a decision-theoretic add-on to the inferential part of Bayesian statistics. More in general, the representation in probabilistic logic will probably not appeal to advocates of classical statistics. Quite apart from the issue of binary acceptance, the whole idea of assuming a prior probability, however unspecific, may be objected to on the principled ground that probability functions express long-term frequencies, and that hypotheses cannot have such frequencies. There is one attractive feature, at least to my mind, of the above rendering, that may be of interest in its own right. With the representation in place, we can ask again how to understand the example by Jeffreys, as considered in Section 7. Following [Hacking, 1965; Edwards, 1972], it illustrates that Neyman and Pearson tests do not respect the likelihood principle, because they depend on the probability assignment over the entire sample space and not just on the probability of the observed sample. However, in the Bayesian representation we do respect the likelihood principle, but in addition we condition on f j , not on q k . Instead of adopting the diagnosis by Hacking concerning the likelihood principle, we could therefore say that the approach of Neyman and Pearson takes the observations in terms of a rather coarse-grained partition of information. In other words, rather than saying that Neyman-Pearson procedures violate the likelihood principle, we
Inductive Logic and Statistics
643
can also say that the procedures violate the principle of total evidence. 9 FISHER’S PARAMETER ESTIMATION Let me turn to another important classical statistical procedure, so-called parameter estimation. I focus in particular on an estimation procedure first devised by [Fisher, 1956], maximum likelihood estimation . The two sections following this one will be devoted to the question if and how we can capture this classical statistical procedure in Bayesian inductive logic. The maximum likelihood estimator determines the best among a much larger, possibly infinite, set of hypotheses. It depends on the probability that the hypotheses assign to points in the sample space. See [Barnett, 1999] for more detail. DEFINITION 3 Maximum Likelihood Estimation. Let H = {hθ : θ ∈ Θ} be a set of hypotheses, labeled by the parameter θ, and let Q be the sample space. Then the maximum likelihood estimator of θ, ˆ t ) = {θ ∈ Θ : ∀hθ Ph (st ) ≤ Ph (st ) }, (16) θ(s θ θ is a function over the elements st in the sample space. So the estimator is a set, typically a singleton, of those values of θ for which the likelihood of hθ on the data st is maximal. The associated best hypothesis we denote with hθ(s ˆ t ) , or hθˆ for short. The estimator is a function over the sample space, associating each st with a hypothesis, or a set of them. Often the estimation is coupled to a so-called confidence interval. Restricting the parameter space to Θ = [0, 1] for convenience, and assuming that the true value is θ, we can define a region in sample space within which the estimator function is not too far off the mark. Specifically, we might set the region in such a way that it covers 1 − of the probability Phθ over sample space: ! θ+Δ ˆ ˆ ˆ ˆ Phθ (θ)dθ = 1 − . (17) Conf1− (θ) = θ : |θ − θ| < Δ and θ−Δ
We can provide an unproblematic frequentist interpretation of the so-called confidence interval θˆ ∈ [θ − Δ, θ + Δ]: in a series of estimations, the fraction of times in which the estimator θˆ is further off the mark than Δ will tend to . The smaller the region, the more reliable the estimate. Note, however, that this interval is defined in terms of the unknown true value θ. In Section 11, I will introduce an alternative notion of confidence interval that avoids this drawback. For now, let me illustrate parameter estimation in a simple example on pears, concerning the statistical hypotheses defined in Equation (7). The general idea is that we choose the value of θ for which the probability that the hypothesis gives to the data is maximal. Recall that the likelihoods of the multinomial hypotheses hθ are θt1 (1 − θ)t0 .
644
Jan-Willem Romeijn
This function is maximal at θ =
t1 t ,
so the maximum likelihood estimator is
ˆ t ) = t1 . (18) θ(s t For a true value θ, the probability of finding the estimate in the confidence interval of Equation (17), t1 ∈ [θ − Δ, θ + Δ], t increases for larger data sequences because of the law of large numbers. Fixing the probability at 1 − , the size of the interval will therefore decrease. This completes the introduction into parameter estimation. Note that the statistical procedure can be taken as the procedural analogue of an ampliative logical inference, running from the data to a probability assignment over the sample space. We have H as the set of probability assignments or hypotheses from which the inference starts, and by means of the data we then choose a single hθˆ from these as our conclusion. However, in the following I aim to investigate whether there is a non-ampliative logical representation of this inductive inference. 10
ESTIMATIONS IN INDUCTIVE LOGIC
There are at least two ways in which parameter estimation can be turned into a non-ampliative logic. One of these, fiducial inference, generates a probability assignment over statistical hypotheses without presupposing a prior probability at the outset. We deal with this inference in the next section. In this section, we investigate the relation between parameter estimation and the non-ampliative inductive logics devised in the foregoing. To spot the similarity between parameter estimation and Carnapian inductive logic, note that the procedure of parameter estimation can be used to determine the probability of the next piece of data. In the example on pears, once we have observed s000101 , say, we choose h 31 as our best estimate, and we may on the basis of that predict that the next pear has a probability of 13 to be green. The function θˆ is then used as a predictive system, much like any other Carnapian inductive logic: k k P (qt+1 |st ) = Pθ(s ˆ t ) (qt+1 ), where Pθ(s ˆ t ) refers to the probability function induced by the hypothesis hθ(s ˆ t). ˆ The estimation function θ by Fisher is thereby captured in a single probability function P . So we can present the latter as a probability assignment over sample space, from which estimations can be derived by a non-ampliative inference. Let me make this concrete by means of the example on red and green pears. In the Carnapian prediction rule of Equation (4), choosing λ = 0 will yield the observed relative frequency as predictions. And according to Equation (18) these relative frequencies are also the maximum likelihood estimators. Thus, for each
Inductive Logic and Statistics
645
set of possible observations, {sk1 ...kt : ki = 0, 1}, the Carnapian rule with λ = 0 predicts according to the Fisherian estimate.5 The alignment of Fisher estimation and Carnapian inductive logic is not exactly easy. Already for estimations on multinomial hypotheses, it is not immediately clear how we can define the corresponding probability assignment over sample space. For more complicated sets of hypotheses, and the more complicated estimators associated with it, the corresponding probability assignment P may be even less natural. Moreover, the principles and predictive properties that motivate the choice of that probability function will be very hard to come by. In the following I will therefore not discuss the further intricacies of capturing Fisher’s estimation functions by Carnapian prediction rules. Instead, I want to devote some attention to capturing parameter estimation in Bayesian statistical inference, and thereby in inductive logic with hypotheses. Bayesian inductive logic, the non-ampliative inductive logic that emulates Bayesian statistics, is more suitable for capturing parameter estimation than Carnapian inductive logic. Note that in both parameter estimation and Bayesian statistics, we consider a set of statistical hypotheses and we are looking to find the best fitting one. Moreover, in both of these our choice among the hypotheses is informed by the probability of the data according to the hypotheses, i.e., the likelihoods. To capture something like parameter estimation, the posterior over hypotheses can be used to generate the kind of choices between hypotheses that classical statistics provides. As for parameter estimation, we can use the posterior to derive an expectation for the parameter θ, as in Equation (12): θP (hθ |st )dθ. E[θ] = Θ
Clearly, E[θ] is a function that brings us from the data st to a preferred value for the parameter. The function depends on the prior probability over the hypotheses, but it is nevertheless analogous to the maximum likelihood estimator. In analogy to the confidence interval, we can also define a so-called credal interval from the posterior probability distribution: ! E[θ]+Δ P (hθ |st )dθ = 1 − . Cred1− (st ) = θ : |θ − E[θ]| < Δ and E[θ]−Δ
Therefore, 5 Note
that the probability function P that describes the estimations is a rather unusual one. After three red pears for example, s000 , the probability for the next pear to be green will be 0, so that P (s0001 ) = 0. Then, by the standard axiomatisation and definitions of probability, the probability of any observation q50 conditional on s0001 is not defined. But if the probability function P is supposed to follow the Fisherian estimations, then we must have P (q50 |s0001 ) = 34 . To accommodate the probability function imposed by Fisher’s estimations, we must therefore change the axiomatisation of probability. In particular, we may adopt an axiomatisation in which conditional probability is primitive, as described in [R´ enyi, 1970] and in chapter 15 of this volume. Alternatively, we can restrict ourselves to estimations based on the observation of more than one property.
646
Jan-Willem Romeijn
(19) P ({θ : θ ∈ Cred1− (st )}|st ) = 1 − . This set of values for θ is such that the posterior probability of the corresponding hθ jointly add up to 1 − . We might argue that this expression is an improvement over the classical confidence interval of Equation (17). The latter only expresses how far an estimate is off the mark on average, while it does not warrant an inference about how far away the specific estimate that we have obtained, lies with respect to the true value of the parameter. By contrast, a credal interval does allow for such an inferential reading. Of course there are also large differences between the results of parameter estimation and the results of a Bayesian analysis. One difference is that in parameter estimation, and in classical statistics more generally, the choice for some hypothesis is an all-or-nothing affair: we accept or reject, we choose a single best estimate, and so on. In the Bayesian procedure, by contrast, the choice is expressed in a posterior probability assignment over the set of hypotheses. As indicated in the discussion of Neyman-Pearson hypothesis testing, this difference remains problematic. In addition, there is a well-known, but no less grave drawback to the way in which the Bayesian conclusions are reached: we have to assume a prior probability assignment over the statistical hypotheses. Any expectation and credal interval depends on the exact prior that is chosen. This dependence can only be avoided by assuming that we have sufficient data to swamp the impact of the prior or, in some sense equivalently, by assuming that the prior is sufficiently smooth in comparison to the likelihoods for the data.
11
FIDUCIAL PROBABILITY
This latter problem, of how to choose the prior, motivated [Fisher, 1930; Fisher, 1935; Fisher, 1956] to devise an alternative way of making parameter estimation inferential, the so-called fiducial argument. This argument yields a probability assignment over hypotheses without assuming a prior probability over statistical hypotheses at the outset. The fiducial argument is controversial, however, and its applicability is limited to particular statistical problems. See [Hacking, 1965] and [Seidenfeld, 1979] for detailed critical discussions, and [Barnett, 1999] for a good overview. In the following, I will only provide a brief sketch of the argument. A good way of introducing fiducial probability is by the notion of confidence intervals, introduced in Section 9. In some cases, as detailed below, we can also derive a region of parameter values within which the true value θ can be expected to lie. The general idea is to define a set of parameter values R within which the data are not too unlikely, and to then say that the true parameter value most likely lies within that set. Specifically, in terms of the integral in Equation (17), we can swap the roles of θ and θˆ and define:
Inductive Logic and Statistics
(20) Fid1− (st ) =
ˆ t )| < Δ and θ : |θ − θ(s
ˆ θ+Δ ˆ θ−Δ
647
! ˆ Phθ (θ)dθ =1− .
ˆ but over the true parameter values Crucially, the integral runs not over the data θ, θ. Every element of the sample space st is thus assigned a so-called fiducial interval Fid1− , containing the parameter values that are considered good candidates for truth. The integral of Equation (20) only properly concerns a probability if the parameter θ and the estimation θˆ can indeed swap roles like that. We need to have that ˆ Phθ (θˆ + δ) = Phθ−δ (θ) for all values of δ, so that the distribution over the estimator for a given parameter can be read as a distribution over the parameter for a given estimator. In that case, we can interpret the fiducial interval in much the same way as the credal interval of Equation (19), namely as a probability: (21) P ({θ : θ ∈ Fid1− (st )}|st ) = 1 − . But if the condition is not met, the interval cannot be taken as expressing a probability that the true value of the parameter lies within a certain interval around the estimate. Or at least, we cannot interpret it in this way without further consideration. The determination of the intervals of Equation (20) is an example of the determination of fiducial probability. It relies on a strong requirement. We must ˆ one presuppose the equivalence of two distinct functions, both written Phθ (θ), ˆ taking θ and one taking θ as argument. A much more general formulation of this requirement is provided by [Dawid and Stone, 1982]. They argue that in order to run the fiducial argument, one has to assume that the statistical problem can be captured in a functional model that is smoothly invertible. I want to conclude this exposition of the fiducial argument with an explanation of the notion of a smoothly invertible functional model. It brings out the presupositions of the fiducial argument very nicely. The central assumption for every fiducial argument is that there is a so-called ˆ t ) relating the pivotal quantity, i.e. , some estimator function over the data θ(s statistical parameter θ and an error term ω according to ˆ t ) = f (θ, ω). θ(s We can think of the parameter θ as the systematic component of the process that brings about the data, and of the term ω as the stochastic component, causing individual variation around the systematic component. We further assume a probability function P (ω) over the error terms, so that the functional relation and the probability over error terms together determine a probability ˆ ˆ t )|hθ ) = P ({ω : f (θ, ω) = θ}). (22) P (θ(s
648
Jan-Willem Romeijn
ˆ ω) = θ. Suppose that the function f is invertible: we also have a function f −1 (θ, And finally, we assume that the error terms and the hypotheses are probabilistically independent: (23) P (hθ , ω) = P (hθ )P (ω). This means that the systematic and stochastic components to the data generating process are independent: every value of the parameter θ is associated with the same probability assignment over the stochastic terms. Given this independence, we can write down the overall probability assignment in terms of a graphical structure, a Bayesian network, as depicted below. I refer to [Neapolitan, 2003] for further details on this way of representing probability assignments.
θ^
hθ
ω
ˆ t ), and that we condition on Say that we observe st , thus fixing the value for θ(s this observed data. Then, because of the network structure and the further fact that the relation f (hθ , ω) is deterministic, the variables ω and hθ become perfectly ˆ ω). And because the correlated: each ω is associated with a unique θ = f −1 (θ, observation of st does not itself influence the probability of ω either, we can write ˆ t )) = P ({ω : f −1 (θ, ˆ ω) = θ}), (24) P (hθ |θ(s which is the inverse of Equation (22). This means that after observing st we can transfer the probability distribution over ω onto hθ according to the function f −1 . The fiducial probability over the hypotheses hθ is, I think, a surprising result. No prior probability has been assumed, and nevertheless the construction is such that we can derive something that looks like a posterior. Moreover, the inductive inference in this construction is non-ampliative. The set of probability assignments over hθ , st , and ω is such that P (hθ ) can be any convex combination of elements from a set of functions over θ, while P (hθ |st ) is a distinct element from that set. Given the controversy that surrounds the interpretation and determination of prior probabilities, it is a real pity that the fiducial argument can only be run under such strict conditions. 12
IN CONCLUSION
In the foregoing I have introduced a setting in which inductive logic and statistics may be unified. I have discussed how inductive logic can be developed to encompass and emulate a number of inductive procedures from statistics. In particular, the discussion of Bayesian statistical inference has led to the extension of the language of inductive logic with statistical hypotheses. The resulting inductive logic was applied to two classical procedures, to wit, Neyman-Pearson hypotheses testing and Fisher’s maximum likelihood estimation . While these procedures are best
Inductive Logic and Statistics
649
understood as ampliative inductive inferences, I have shown that they can also be modelled, at least partly, in terms of this extended inductive logic. I hope that portraying statistical procedures in the setting of inductive logic has been illuminating. In particular, I hope that the relation between Carnapian inductive logic and Bayesian statistics stimulates research on the intersection of the two. Certainly, some research in this area has already been conducted; see for example [Skyrms, 1991; Skyrms, 1993; Skyrms, 1996] and [Festa, 1993]. Following these contributions, [Romeijn, 2005] argues that an inductive logic that includes statistical hypotheses in its language is closely related to Bayesian statistical inference, and some of these views have been reiterated in this chapter. However, I believe that there is much room for improvement. Research on the intersection of inductive logic and statistical inference can certainly enhance the relevance of inductive logical systems to scientific method and the philosophy of science. In parallel, I believe that insights from inductive logic may help to clarify the foundations of statistics. ACKNOWLEDGEMENTS This research was carried out as part of a project funded by the Dutch Organization of Scientific Research (NWO VENI-grant nr. 275-20-013). I also thank the Spanish Ministry of Science and Innovation (Research project FFI2008-1169) for generous support. Finally, my thanks go to Theo Kuipers and Roberto Festa for teaching me about inductive logic over the past years. Needless to say, the mistakes and omissions in this paper are my own doing. BIBLIOGRAPHY [Auxier and Hahn, 2006] R.E. Auxier and L.E. Hahn, editors. The Philosophy of Jaako Hintikka. Open Court, Chicago, 2006. [Bandyopadhyay and Forster, 2009] P. Bandyopadhyay and M. Forster, editors. Handbook for the Philosophy of Science: Philosophy of Statistics. Elsevier, 2009. [Barnett, 1999] V. Barnett. Comparative Statistical Inference. John Wiley, New York, 1999. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. University of Chicago Press, 1950. [Carnap, 1952] Rudolf Carnap. The Continuum of Inductive Methods. University of Chicago Press, Chicago, 1952. [Dawid and Stone, 1982] A. P. Dawid and M. Stone. The functional-model basis of fiducial inference (with discussion). Annals of Statistics, 10(4):1054–1074, 1982. [de Finetti, 1937] B. de Finetti. La pr´evision: ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincar´ e, 7(1):1–68, 1937. [Douven, 2002] I. Douven. A new solution to the paradoxes of rational acceptability. The British Journal for the Philosophy of Science, 53:391–410, 2002. [Edwards, 1972] A.W.F. Edwards. Likelihood. Cambridge University Press, 1972. [Festa, 1993] R. Festa. Optimum Inductive Methods. Dordrecht: Kluwer, 1993. [Festa, 1996] R. Festa. Analogy and exchangeability in predictive inferences. Erkenntnis, 45:89– 112, 1996. [Fisher, 1930] Ronald A. Fisher. Inverse probability. Proceedings of the Cambridge Philosophical Society, 26:528–535, 1930.
650
Jan-Willem Romeijn
[Fisher, 1935] Ronald A. Fisher. The fiducial argument in statistical inference. Annals of Eugenics, 6:317–324, 1935. [Fisher, 1956] Ronald A. Fisher. Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh, 1956. [Hacking, 1965] I. Hacking. The Logic of Statistical Inference. Cambridge University Press, Cambridge, 1965. [Haenni et al., 2009] Rolf Haenni, Jan-Willem Romeijn, Greg Wheeler, and Jon Williamson. Probabilistic Logics and Probabilistic Networks. Springer, 2009. [Hailperin, 1996] T. Hailperin. Sentential Probability Logic. Lehigh University Press, 1996. [Howson, 2003] C. Howson. Probability and logic. Journal of Applied Logic, 1(3–4):151–165, 2003. [Jeffreys, 1931] H. Jeffreys. Scientific Inference. Cambridge University Press, , Cambridge, 1931. [Johnson, 1932] W. Johnson. Probability: the deductive and inductive problems. Mind, 49:409– 423, 1932. [Kyburg, 1974] Henry E. Kyburg, Jr. The Logical Foundations of Statistical Inference. D. Reidel, Dordrecht, 1974. [Levi, 1980] Isaac Levi. The enterprise of knowledge: an essay on knowledge, credal probability, and chance. MIT Press, Cambridge MA, 1980. [Maher, 2000] P. Maher. Probabilities for two properties. Erkenntnis, 52:63–81, 2000. [Neapolitan, 2003] R. E. Neapolitan. Learning Bayesian Networks. Prentice Hall, 2003. [Neyman and Pearson, 1967] J. Neyman and E. Pearson. Joint Statistical Papers. University of California Press, Berkeley, 1967. [Paris and Waterhouse, 2008] J. Paris and P. Waterhouse. Atom exchangeability and instantial relevance. unpublished manuscript, 2008. [Press, 2003] J. Press. Subjective and Objective Bayesian Statistics: Principles, Models, and Applications. John Wiley, New York, 2003. [R´ enyi, 1970] A. R´ enyi. Probability Theory. North Holland, Amsterdam, 1970. [Romeijn, 2004] J.W. Romeijn. Hypotheses and inductive predictions. Synthese, 141(3):333–64, 2004. [Romeijn, 2005] J.W. Romeijn. Bayesian Inductive Logic. PhD dissertation, University of Groningen, 2005. [Romeijn, 2006] J.W. Romeijn. Analogical predictions for explicit similarity. Erkenntnis, 64:253–280, 2006. [Seidenfeld, 1979] T. Seidenfeld. Philosophical Problems of Statistical Inference: Learning from R.A. Fisher. Reidel, Dordrecht, 1979. [Skyrms, 1991] B. Skyrms. Carnapian inductive logic for markov chains. Erkenntnis, 35:35–53, 1991. [Skyrms, 1993] B. Skyrms. Analogy by similarity in hyper-Carnapian inductive logic. In J. Earman, A. I. Janis, G. Massey, and N. Rescher, editors, Philosophical Problems of the Internal and External Worlds, pages 273–282. University of Pittsburgh Press, Pittsburgh, 1993. [Skyrms, 1996] B. Skyrms. Statistics, Probability, and Game, chapter Carnapian Inductive Logic and Bayesian Statistics, pages 321–336. IMS Lecture Notes, 1996. [Wheeler, 2006] Gregory Wheeler. Rational acceptance and conjunctive/disjunctive absorption. Journal of Logic, Language and Information, 15(1-2):49–63, 2006. [Zabell, 1982] S. Zabell. W. E. Johnson’s “sufficientness” postulate. Annals of Statistics, 10:1091–99, 1982.
STATISTICAL LEARNING THEORY: MODELS, CONCEPTS, AND RESULTS Ulrike von Luxburg and Bernhard Sch¨olkopf
1
INTRODUCTION
Statistical learning theory provides the theoretical basis for many of today’s machine learning algorithms and is arguably one of the most beautifully developed branches of artificial intelligence in general. It originated in Russia in the 1960s and gained wide popularity in the 1990s following the development of the so-called Support Vector Machine (SVM), which has become a standard tool for pattern recognition in a variety of domains ranging from computer vision to computational biology. Providing the basis of new learning algorithms, however, was not the only motivation for developing statistical learning theory. It was just as much a philosophical one, attempting to answer the question of what it is that allows us to draw valid conclusions from empirical data. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We do not assume that the reader has a deep background in mathematics, statistics, or computer science. Given the nature of the subject matter, however, some familiarity with mathematical concepts and notations and some intuitive understanding of basic probability is required. There exist many excellent references to more technical surveys of the mathematics of statistical learning theory: the monographs by one of the founders of statistical learning theory ([Vapnik, 1995], [Vapnik, 1998]), a brief overview over statistical learning theory in Section 5 of [Sch¨olkopf and Smola, 2002], more technical overview papers such as [Bousquet et al., 2003], [Mendelson, 2003], [Boucheron et al., 2005], [Herbrich and Williamson, 2002], and the monograph [Devroye et al., 1996]. 2
2.1
THE STANDARD FRAMEWORK OF STATISTICAL LEARNING THEORY
Background
In our context, learning refers to the process of inferring general rules by observing examples. Many living organisms show some ability to learn. For instance, children can learn what “a car” is, just by being shown examples of objects that are cars Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
652
Ulrike von Luxburg and Bernhard Sch¨ olkopf
and objects that are not cars. They do not need to be told any rules about what is it that makes an object a car, they can simply learn the concept “car” by observing examples. The field of machine learning does not study the process of learning in living organisms, but instead studies the process of learning in the abstract. The question is how a machine, a computer, can “learn” specific tasks by following specified learning algorithms. To this end, the machine is shown particular examples of a specific task. Its goal is then to infer a general rule which can both explain the examples it has seen already and which can generalize to previously unseen, new examples. Machine learning has roots in artificial intelligence, statistics, and computer science, but by now has established itself as a scientific discipline in its own right. As opposed to artificial intelligence, it does not try to explain or generate “intelligent behavior”, its goal is more modest: it just wants to discover mechanisms by which very specific tasks can be “learned” by a computer. Once put into a formal framework, many of the problems studied in machine learning sound familiar from statistics or physics: regression, classification, clustering, and so on. However, machine learning looks at those problems with a different focus: the one of inductive inference and generalization ability. The most well-studied problem in machine learning is the problem of classification. Here we deal with two kind of spaces: the input space X (also called space of instances) and the output space (label space) Y. For example, if the task is to classify certain objects into a given, finite set of categories such as “car”, “chair”, “cow”, then X consists of the space of all possible objects (instances) in a certain, fixed representation, while Y is the space of all available categories. In order to learn, an algorithm is given some training examples (X1 , Y1 ), ..., (Xn , Yn ), that is pairs of objects with the corresponding category label. The goal is then to find a mapping f : X → Y which makes “as few errors as possible”. That is, among all the elements in X, the number of objects which are assigned to the wrong category is as small as possible. The mapping f : X → Y is called a classifier. In general, we distinguish between two types of learning problems: supervised ones and unsupervised ones. Classification is an example of a supervised learning problem: the training examples consist both of instances Xi and of the correct labels Yi on those instances. The goal is to find a functional relationship between instances and outputs. This setting is called supervised because at least on the training examples, the learner can evaluate whether an answer is correct, that is the learner is being supervised. Contrary to this, the training data in the unsupervised setting only consists of instances Xi , without any further information about what kind of output is expected on those instances. In this setting, the question of learning is more about discovering some “structure” on the underlying space of instances. A standard example of such a setting is clustering. Given some input points X1 , ..., Xn , the learner is requested to construct “meaningful groups” among the instances. For example, an online retailer might want to cluster his customers based on shopping profiles. He collects all kinds of potentially meaningful information about his customers (this will lead to the input Xi for each customer)
Statistical Learning Theory
653
and then wants to discover groups of customers with similar behavior. As opposed to classification, however, it is not specified beforehand which customer should belong to which group — it is the task of the clustering algorithm to work that out. Statistical learning theory (SLT) is a theoretical branch of machine learning and attempts to lay the mathematical foundations for the field. The questions asked by SLT are fundamental: • Which learning tasks can be performed by computers in general (positive and negative results)? • What kind of assumptions do we have to make such that machine learning can be successful? • What are the key properties a learning algorithm needs to satisfy in order to be successful? • Which performance guarantees can we give on the results of certain learning algorithms? To answer those questions, SLT builds on a certain mathematical framework, which we are now going to introduce. In the following, we will focus on the case of supervised learning, more particular on the case of binary classification. We made this choice because the theory for supervised learning, in particular classification, is rather mature, while the theory for many branches of unsupervised learning is still in its infancy.
2.2
The formal setup
In supervised learning, we deal with an input space (space of instances, space of objects) X and an output space (label space) Y. In the case of binary classification, we identify the label space with the set {−1, +1}. That is, each object can belong to one out of two classes, and by convention we denote those classes by −1 and 1. The question of learning is reduced to the question of estimating a functional relationship of the form f : X → Y, that is a relationship between input and output. Such a mapping f is called a classifier. In order to do this, we get access to some training points (training examples, training data) (X1 , Y1 ), ..., (Xn , Yn ) ∈ X × Y. A classification algorithm (classification rule) is a procedure that takes the training data as input and outputs a classifier f . We do not make specific assumptions on the spaces X or Y, but we do make an assumption on the mechanism which generates those training points. Namely, we assume that there exists a joint probability distribution P on X × Y, and the training examples (Xi , Yi ) are sampled independently from this distribution P . This type of sampling is often denoted as iid sampling (independent and identically distributed). There are a few important facts to note here.
654
Ulrike von Luxburg and Bernhard Sch¨ olkopf
1. No assumptions on P . In the standard setting of SLT we do not make any assumption on the probability distribution P : it can be any distribution on X × Y. In this sense, statistical learning theory works in an agnostic setting which is different from standard statistics, where one usually assumes that the probability distribution belongs to a certain family of distributions and the goal is to estimate the parameters of this distribution. 2. Non-deterministic labels due to label noise or overlapping classes. Note that P is a probability distribution not only over the instances X, but also over the labels Y. As a consequence, labels Yi in the data are not necessarily just a deterministic function of the objects Xi , but can be random themselves. There are two main reasons why this can be the case. The first reason is that the data generating process can be subject to label noise. That is, it can happen that label Yi we get as a training label in the learning process is actually wrong. This is an important and realistic assumption. For example, to generate training data for email spam detection, humans are required to label emails by hand into classes “spam” and “not spam”. All humans make mistakes from time to time. So it will happen that some emails accidentally get labeled as “spam” even though they are not spam, or vice versa. Of course, the hope is that such wrong labels only occur with a relatively small probability. The second major reason which can lead to non-deterministic labels is the case of overlapping classes. As an example, consider the task of predicting the gender of a person based on their height. It is clear that a person of height 1.80 meters, say, could in principle be male or female, thus we cannot assign a unique label Y to the input X = 1.80. For the purpose of learning, we will see that in the end it does not matter which of the reasons is the one leading to non-deterministic labels. It will turn out that the important quantity which covers both cases is the conditional likelihood of the labels, namely the probability that the label Y is 1, under the condition that the data point under consideration is the point x: η(x) := P (Y = 1 X = x).
(1)
Note that we only consider the case Y = 1 as the case Y = −1 can simply be computed by P (Y = −1 X = x) = 1−η(x). In the case of small label noise, the conditional probability η(x) is either close to 1 or close to 0 (depending on whether the true label is +1 or -1). For large label noise, on the other hand, the probability η(x) gets closer to 0.5 and learning becomes more difficult. Similar reasoning applies to the case of overlapping classes. In the example of predicting the gender of a person based on height, the classes overlap quite strongly. For example, the probability P (Y = ”male” X = 1.70) might only be 0.6. That is, if we want to predict the gender of a person of height 1.70 we will, on average, make an error of at least 40%. Both in cases of label noise and of overlapping classes, learning becomes more difficult the
Statistical Learning Theory
655
closer the function η(x) comes to 0.5, and it becomes unavoidable that a classifier makes a relatively large number of errors. 3. Independent sampling. It is an important assumption of SLT that data points are sampled independently. This is a rather strong assumption, which is justified in many applications, but not in all of them. For example, consider the example of pattern recognition for hand written digits. Given some images of hand written digits, the task is to train a machine to automatically recognize new hand written digits. For this task, the training set usually consists of a large collection of digits written by many different people. Here it is safe to assume that those digits form an independent sample from the whole “population” of all hand written digits. As an example where the independence assumption is heavily violated, consider the case of drug discovery. This is a field in pharmacy where people try to identify chemical compounds which might be helpful for designing new drugs. Machine learning is being used for this purpose: the training examples consist of chemical compounds Xi with a label Yi which indicates whether this compound is useful for drug design or not. It is expensive to find out whether a chemical compound possesses certain properties that render it a suitable drug because this would require running extensive lab experiments. As a result, only rather few compounds Xi have known labels Yi , and those compounds have been carefully selected in the first place. Here, we cannot assume that the Xi are a representative sample drawn independently from some distribution of chemical compounds, as the labeled compounds are hand-selected according to some non-random process. Note that in some areas of machine learning, researchers try to relax the independence assumption. For example in active learning one deals with the situation where users can actively select the points they want to get labeled. Another case is time series prediction, where training instances are often generated from overlapping (and thus dependent) windows of a temporal sequence. We are not going to discuss those areas in this paper. 4. The distribution P is fixed. In the standard setting of SLT, we do not have any “time” parameter. In particular, we do not assume any particular ordering of the training examples, and the underlying probability distribution does not change over time. This assumption would not be true if we wanted to argue about time series, for example. Another situation that has recently attracted attention is the case where training and test distributions differ in certain aspects (e.g., under the heading of “covariate shift”). 5. The distribution P is unknown at the time of learning. It is important to recall that at the time of training the underlying distribution is not known. We will see below that if we knew P , then learning would be trivial as we could simply write down the best classifier by a given formula. Instead, we only have access to P indirectly, by observing training examples. Intuitively
656
Ulrike von Luxburg and Bernhard Sch¨ olkopf
this means that if we get enough training examples, we can “estimate” all important properties of P pretty accurately, but we are still prone to errors. It is one of the big achievements of statistical learning theory to provide a framework to make theoretical statements about this error. As already mentioned above, the goal of supervised learning is to learn a function f : X → Y. In order to achieve this, we need to have some measure of “how good” a function f is when used as a classifier. To this end, we introduce a loss function. This is a function which tells us the “cost” of classifying instance X ∈ X as Y ∈ Y. For example, the simplest loss function in classification is the 0-1-loss or misclassification error: the loss of classifying X by label f (X) is 0 if f (X) is the correct label for X, and 1 otherwise: 1 (X, Y, f (X)) = 0
if f (X) = Y otherwise.
In regression, where the output variables Y take values that are real numbers rather than class labels, a well-known loss function is the squared error loss function (X, Y, f (X)) = (Y − f (X))2 . The general convention is that a loss of 0 denotes perfect classification, and higher loss values represent worse classification performance. While the loss function measures the error of a function on some individual data point, the risk of a function is the average loss over data points generated according to the underlying distribution P , R(f ) := E((X, Y, f (X))). That is, the risk of a classifier f is the expected loss of the function f at all points X ∈ X. Intuitively, this risk “counts” how many elements of the instance space X are misclassified by the function f . Of course, a function f is a better classifier than another function g if its risk is smaller, that is if R(f ) < R(g). To find a good classifier f we need to find one for which R(f ) is as small as possible. The best classifier is the one with the smallest risk value R(f ). One aspect we have left open so far is what kind of functions f to consider. To formalize this, we consider some underlying space F of functions which map X to Y. This is the space of functions from which we want to choose our solution. At first glance, the most natural way would be to allow all possible functions from X to Y as classifier, that is to choose Fall = {f : X → Y}. (We ignore issues about measurability at this point; for readers familiar with measure theory, note that one usually defines the space Fall to be the space of measurable functions between X and Y.) In this case, one can formally write down what the optimal classifier should be. Given the underlying probability distribution P , this classifier is defined as follows:
Statistical Learning Theory
1 fBayes (x) := −1
if P (Y = 1 X = x) ≥ 0.5 otherwise.
657
(2)
This is the so-called “Bayes classifier”. Intuitively, what it does is as follows. For each point in the space X, it looks at the function η(x) := P (Y = 1 X = x) introduced in Eq. (1). If we assume that P (Y = 1 X = x) = 1, this means that the true label Y of the point X = x satisfies Y = 1 with certainty (probability 1). Hence, an optimal classifier should also take this value, that is it should choose f (x) = 1. Now assume that the classes slightly overlap, for example P (Y = 1 X = x) = 0.9. This still means that in an overwhelming number of cases (in 90 % of them), the label of object x is +1, thus this is what the classifier f should choose. The same holds as long as the overlap is so small that η(x) ≥ 0.5. By choosing f (x) = 1, the classifier f will be correct in the majority of all cases. Only when η(x) goes below 0.5, the situation flips and the optimal choice is to choose f = −1. We will come back to this example in Section 3. In practice, it is impossible to directly compute the Bayes classifier. The reason is that, as we explained above, the underlying probability distribution is unknown to the learner. Hence, the Bayes classifier cannot be computed, as we would need to evaluate the conditional probabilities P (Y = 1 X = x). With all those definitions in mind, we can formulate the standard problem of binary classification as follows: Given some training points (X1 , Y1 ), ..., (Xn , Yn ) which have been drawn independently from some unknown probability distribution P , and given some loss function , how can we construct a function f : X → Y which has risk R(f ) as close as possible to the risk of the Bayes classifier? At this point, note that not only is it impossible to compute the Bayes error, but also the risk of a function f cannot be computed without knowledge of P . All in all, it looks like a pretty desperate situation: we have defined the goal of binary classification (to minimize the risk of the classifier), and can even formally write down its solution (the Bayes classifier). But at the time of training, we do not have access to the important quantities to compute either of them. This is where SLT comes in. It provides a framework to analyze this situation, to come up with solutions, and to provide guarantees on the goodness of these solutions.
2.3
Generalization and consistency
There are a few more important notions we need to explain at this point. The most important one is “generalization”. Assume we are given some training set (X1 , Y1 ), ..., (Xn , Yn ), and by some algorithm come up with a classifier fn . Even though we cannot compute the true underlying risk R(fn ) of this classifier, what we can do is to “count” the number of mistakes on the training points. The resulting quantity also has a name, it is called the empirical risk or the training error. Formally, for any function f it is defined as
658
Ulrike von Luxburg and Bernhard Sch¨ olkopf
1 (Xi , Yi , f (Xi )). n i=1 n
Remp (f ) :=
Usually, for a classifier fn learned on a particular training set, the empirical risk Remp (fn ) is relatively small – otherwise, the learning algorithm does not even seem to be able to explain the training data. However, it is not clear whether a function fn which makes few errors on the training set also makes few errors on the rest of the space X, that is whether it has a small overall risk R(fn ). We say that a classifier fn generalizes well if the difference |R(fn )−Remp (fn )| is small. Note that with this definition, good generalization performance does not necessarily mean that a classifier has a small overall error Remp . It just means that the empirical error Remp (fn ) is a good estimate of the true error R(fn ). Particularly bad in practice is the situation where Remp (fn ) is much smaller than R(fn ). In this case, using the empirical risk as an estimator of the true risk would lead us to be overly optimistic about the quality of our classifier. Consider the following regression example. We are given empirical observations, (x1 , y1 ), . . . , (xm , ym ) ∈ X × Y where for simplicity we take X = Y = Ê. For example, the data could have been collected in a physical experiment where X denotes the weight of an object, and Y the force we need to pull this object over a rough surface. Figure 1 shows a plot of such a dataset (indicated by the round points), along with two possible functional dependencies that could underlie the data. The dashed line fdashed represents a fairly complex model, and fits the training data perfectly, that is it has a training error of 0. The straight line fstraight , on the other hand, does not completely “explain” the training data, in the sense that there are some residual errors, leading to a small positive training error (for example, measured by the squared loss function). But what about the true risks R(fdashed ) and R(fstraight )? The problem is that we cannot compute this risk from the training data. Moreover, the functions fdashed and fstraight have very different behavior. For example, if the straight line was the true underlying function, then the dashed function fdashed would have a high true risk, as the “distance” between the true and the estimated function is very large. The same also holds the other way around. In both cases the true risk would be much higher than the empirical risk. This example points out an important choice we have to make. Do we prefer to fit the training data with a relatively “complex” function, leading to a very small training error, or do we prefer to fit it with a “simple” function at the cost of a slightly higher training error? In the example above, a physicist measuring these data points would argue that it cannot be by chance that the measurements lie almost on a straight line and would much prefer to attribute the residuals to measurement error than to an erroneous model. But is it possible to characterize the way in which the straight line is simpler, and why this should imply that it is, in some sense, closer to an underlying true dependency? What is the
Statistical Learning Theory
659
y
x
Figure 1. Suppose we want to estimate a functional dependence from a set of examples (black dots). Which model is preferable? The complex model perfectly fits all data points, whereas the solid line exhibits residual errors. Statistical learning theory formalizes the role of the capacity of the model class, and gives probabilistic guarantees for the validity of the inferred model (from [Sch¨olkopf and Smola, 2002]).
“amount of increase in training error” we should be willing to tolerate for fitting a simpler model? In one form or another, this issue has long occupied the minds of researchers studying the problem of learning. In classical statistics, it has been studied as the bias-variance dilemma. If we computed a linear fit for every data set that we ever encountered, then every functional dependency we would ever “discover” would be linear. But this would not come from the data; it would be a bias imposed by us. If, on the other hand, we fitted a polynomial of sufficiently high degree to any given data set, we would always be able to fit the data perfectly, but the exact model we came up with would be subject to large fluctuations, depending on how accurate our measurements were in the first place — the model would suffer from a large variance. A related dichotomy is the one between estimation error and approximation error. If we use a small class of functions, then even the best possible solution will poorly approximate the “true” dependency, while a large class of functions will lead to a large statistical estimation error. We will discuss these dichotomies in more detail in Section 2.4. In the terminology of applied machine learning, the complex explanation shows overfitting, while an overly simple explanation imposed by the learning machine design would lead to underfitting. A concept which is closely related to generalization is the one of consistency. However, as opposed to the notion of generalization discussed above, consistency is not a property of an individual function, but a property of a set of functions. As in classical statistics, the notion of consistency aims to make a statement about what happens in the limit of infinitely many sample points. Intuitively, it seems
660
Ulrike von Luxburg and Bernhard Sch¨ olkopf
reasonable to request that a learning algorithm, when presented more and more training examples, should eventually “converge” to an optimal solution. There exist two different types of consistency in the literature, depending on the taste of the authors, and both of them are usually just called “consistency” without any distinction. To introduce these concepts, let us make the following notation. Given any particular classification algorithm, by fn we will denote its outcome on a sample of n training points. It is not important how exactly the algorithm chooses this function. But note that any algorithm chooses its functions from some particular function space F. For some algorithms this space is given explicitly, for others it only exists implicitly via the mechanism of the algorithm. No matter how this space F is defined, the algorithm attempts to choose the function fn ∈ F which it considers as the best classifier in F, based on the given training points. On the other hand, in theory we know precisely what the best classifier in F is: it is the one that has the smallest risk. For simplicity, we assume that it is unique and denote it as fF , that is fF = argmin R(f ). f ∈F
(3)
The third classifier we will talk about is the Bayes classifier fBayes introduced in Equation (2) above. This is the best classifier which exists at all. In the notation above we could also denote it by fFall (recall the notation Fall for the space of all functions). But as it is unknown to the learner, it might not be contained in the function space F under consideration, so it is very possible that R(fF ) > R(fBayes ). With the notation for these three classifiers fn , fF , and fBayes we can now define different types of convergence: DEFINITION 1. Let (Xi , Yi )i∈ be an infinite sequence of training points which have been drawn independently from some probability distribution P . Let be a loss function. For each n ∈ Æ, let fn be a classifier constructed by some learning algorithm on the basis of the first n training points. 1. The learning algorithm is called consistent with respect to F and P if the risk R(fn ) converges in probability to the risk R(fF ) of the best classifier in F, that is for all ε > 0, P (R(fn ) − R(fF ) > ε) → 0 as n → ∞. 2. The learning algorithm is called Bayes-consistent with respect to P if the risk R(fn ) converges to the risk R(fBayes ) of the Bayes classifier, that is for all ε > 0, P (R(fn ) − R(fBayes ) > ε) → 0 as n → ∞. 3. The learning algorithm is called universally consistent with respect to F (resp. universally Bayes-consistent) if it is consistent with respect to F (resp. Bayes-consistent) for all probability distributions P .
Statistical Learning Theory
661
Note that for simplicity in the following we often say “the classifier fn is consistent”, meaning that “the classification algorithm which, based on n samples, chooses fn as classifier is consistent”. Let us try to rephrase the meaning of those definitions in words. We start with Part 1 of the definition. The statement requests that the larger the sample size n gets, the closer the risk of the classifier fn should get to the risk of the best classifier fF in the space F. This should happen “with high probability”: Note that the risk R(fn ) is a random quantity, as it depends on the underlying sample. In rare circumstances (with probability < δ, where δ is supposed to be a small, positive number), it might be the case that we get mainly misleading sample points (for example, lots of points with “wrong” labels due to label noise). In those circumstances, it might happen that our classifier is not very good. However, in the vast majority of all cases (with probability ≥ 1 − δ), our training points will not be misleading, at least if we have many of them (n → ∞). Then the classifier fn picked by the algorithm will be close to the best classifier the algorithm could have picked at all, the classifier fF , that is R(fn ) − R(fF ) > ε only with small probability δ, where δ will converge to 0 as n → ∞. Part 2 of the definition is similar to Part 1, except that we now compare to the overall best classifier fBayes . The difference between those statements is obvious: Part 1 deals with the best the algorithm can do under the given circumstances (namely, in the function space F). Part 2 compares this to the overall best possible result. Traditionally, statistical learning theory often focuses on Part 1, but ultimately we will be more interested in Part 2. Both parts will be treated in the following sections. To understand Part 3, note the following. In the first two parts, consistency is defined for a fixed probability distribution P . This means that if the true underlying distribution is P , then the sequence of classifiers fn will converge to the correct result. However, the whole point of machine learning is that we do not know what the underlying distribution P is. So it would make little sense if a learning algorithm is consistent for a certain distribution P , but inconsistent for some other distribution P . Hence, we define a stronger notion of consistency, termed universal consistency. It states that no matter what the underlying distribution might be, we will always have consistency. A mathematical detail we would like to skim over is the exact type of convergence. For readers familiar with probability theory: consistency as stated above is called weak consistency, as it is a statement about convergence in probability; the analogous statement for convergence almost surely would be called strong consistency. For exact details see Section 6 in [Devroye et al., 1996]. There is one further important fact to note about these definitions: they never mention the empirical risk Remp (fn ) of a classifier, but are only concerned with the true risk R(fn ). On the one hand, it is clear why this is the case: our measure of quality of a classifier is the true risk, and we want the true risk to become as good as possible. On the other hand, the empirical risk is our first and most important estimator of the true risk of a classifier. So it seems natural that in addition to the convergence of the true risk such as R(fn ) → R(fBayes ), we also
662
Ulrike von Luxburg and Bernhard Sch¨ olkopf
request convergence of the empirical risk: Remp (fn ) → R(fBayes ). We will see below that such statements about the empirical risk are the most important steps to prove consistency in the standard approach to statistical learning theory. So even though we did not explicitly require convergence of the empirical risk, it usually comes out as a side result of consistency.
2.4 The bias-variance and estimation-approximation trade-off The example illustrated in Figure 1 above already pointed out the problem of model complexity in an intuitive way: when is a model “simpler” than another one? Is it good that a model is simple? How simple? We have already stated above that the goal of classification is to get a risk as good as the Bayes classifier. Could we just choose F as the space Fall of all functions, define the classifier fn := argmin f ∈Fall Remp (f ), and obtain consistency? Unfortunately, the answer is no. In the sections below we will see that if we optimize over too large function classes F, and in particular if we make F so large that it contains all the Bayes classifiers for all different probability distributions P , this will lead to inconsistency. So if we want to learn successfully, we need to work with a smaller function class F. To investigate the competing properties of model complexity and generalization, we want to introduce a few notions which will be helpful later on. Recall the definitions fn , fF and fBayes introduced above. We have seen that Bayes-consistency deals with the convergence of the term R(fn )−R(fBayes ). Note that we can decompose this quantity in the following way: R(fn ) − R(fBayes ) = R(fn ) − R(fF ) + R(fF ) − R(fBayes ) #$ % #$ % " " estimation error
(4)
approximation error
The two terms on the right hand side have particular names: the first one is called the estimation error and the second one the approximation error; see also Figure 2 for an illustration. The reasons for these names are as follows. The first term deals with the uncertainty introduced by the random sampling process. That is, given the finite sample, we need to estimate the best function in F. Of course, in this process we will make some (hopefully small) error. This error is called the estimation error. The second term is not influenced by any random quantities. It deals with the error we make by looking for the best function in a (small) function space F, rather than looking for the best function in the entire space Fall of all functions. The fundamental question in this context is how well functions in F can be used to approximate functions Fall in the space of all functions. Hence the name approximation error. In statistics, estimation error is also called the variance, and the approximation error is called the bias of an estimator. Originally, these terms were coined for the special situation of regression with squared error loss, but by now people use them in more general settings, like the one outlined above. The intuitive meaning
Statistical Learning Theory
estimation error
approximation error
f Bayes
663
fF fn
space Fall of all functions Small function space F used by the algorithm Figure 2. Illustration of estimation and approximation error.
is the same: the first term measures the variation of the risk of the function fn estimated on the sample, the second one measures the “bias” introduced in the model by choosing too small a function class. At this point, we would already like to point out that the space F is the means to balance the trade-off between estimation and approximation error; see Figure 3 for an illustration and Sections 4 and 5 for an in-depth discussion. If we choose a very large space F, then the approximation term will become small (the Bayes classifier might even be contained in F or can be approximated closely by some element in F). The estimation error, however, will be rather large in this case: the space F will contain complex functions which will lead to overfitting. The opposite effect will happen if the function class F is very small. In the following, we will deal with the estimation error and approximation error separately. We will see that they have rather different behavior and that different methods are needed to control both. Traditionally, SLT has a strong focus on the estimation error, which we will discuss in greater depth in Sections 4 and 5. The approximation error will be treated in Section 7.
664
Ulrike von Luxburg and Bernhard Sch¨ olkopf
risk approximation error
estimation error
complexity of the function class
Figure 3. The trade-off between estimation and approximation error. If the function space F used by the algorithm has a small complexity, then the estimation error is small, but the approximation error is large (underfitting). If the complexity of F is large, then the estimation error is large, while the approximation error is small (overfitting). The best overall risk is achieved for “moderate” complexity.
3
CONSISTENCY AND GENERALIZATION FOR THE K -NEAREST NEIGHBOR CLASSIFIER
For quite some time, until 1977, it was not known whether a universally consistent classifier exists. This question has been solved positively by [Stone, 1977], who showed by an elegant proof that a particular classifier, the so-called k-nearest neighbor classifier, is universally consistent. As the k-nearest neighbor classifier is one of the simplest classifiers and is still widely used in practice, we would like to spend this section illustrating the notions introduced in the last section such as generalization, overfitting, underfitting, and consistency at the example of the k-nearest neighbor classifier. Assume we are given a sample of points and labels (X1 , Y1 ), ..., (Xn , Yn ) which live in some metric space. This means that we have some way of computing distances between points in this space. Very generally, the paradigm of learning is to assign “similar output to similar inputs”. That is, we believe that points which are “close” in the input space tend to have the same label in the output space. Note that if such a statement does not hold, learning becomes very difficult or even impossible. For successful learning, there needs to be some way to relate the labels of training points to those of test points, and this always involves some prior assumptions about relations between input points. The easiest such relation is a distance between points, but other ways of measuring similarity, such as “kernels,” exist and indeed form the basis of some of the most popular existing learning algorithms [Sch¨ olkopf and Smola, 2002]. So assume that there exists a distance function on the input space, that is a
Statistical Learning Theory
665
function d : X × X → Ê which assigns a distance value d(X, X ) to each pair of training points X, X . Given some training points, we now want to predict a good label for a new test point X. A simple idea is to search for the training point Xi which has the smallest distance to X, and then give X the corresponding label Yi of that point. To define this more formally, denote by NN(X) the nearest neighbor of X among all training points, that is NN(X) = argmin {X ∈ {X1 , ..., Xn } d(X, X ) ≤ d(X, X ) for all X ∈ {X1 , ..., Xn }}. We can then define the classifier fn based on the sample of n points by fn (X) = Yi where Xi = NN(X). This classifier is also called the 1-nearest neighbor classifier (1NN classifier). A slightly more general version is the k-nearest neighbor classifier (kNN classifier). Instead of just looking at the closest training point, we consider the closest k training points, and then take the average over all their labels. That is, we define the k-nearest neighbors kNN(X) of X as the set of those k training points which are closest to X. Then we set the kNN classifier fn (X) =
+1 −1
if Xi ∈kNN(X) Yi > 0 otherwise.
That is, we decide on the label of X by majority vote among the labels of the training points in the k-nearest neighborhood of X. To avoid ties one usually chooses k as an uneven number. Let us first consider the simpler case of the 1-nearest neighbor classifier. Is this classifier Bayes-consistent? Unfortunately, the answer is no. To see this, we consider the following counter-example: Assume our data space is the real interval X = [0, 1]. As distribution P we choose the distribution which gives uniform weight to all points X ∈ [0, 1], and which assigns labels with noise such that P (Y = 1 X = x) = 0.9 for all x ∈ X. That is, the “correct” label (the one assigned by the Bayes classifier) is +1 for all points x ∈ X. We have already mentioned this example when we introduced the Bayes classifier. In this case, the Bayes classifier is simply the function which outputs 1 for all points in X, and its Bayes risk with respect to the 0-1-loss is 0.1. Now let us investigate the behavior of the 1NN classifier in this case. When we draw some training points (Xi , Yi )i=1,...,n according to the underlying distribution, then they will be roughly “evenly spread” on the interval [0, 1]. On average, every 10th point will have training label -1, all others have training label +1. If we now consider the behavior of the 1NN classifier fn on this example, we can write its risk with respect to the 0-1-loss function as follows:
666
Ulrike von Luxburg and Bernhard Sch¨ olkopf
R(fn ) = P (Y = fn (X)) = P (Y = 1|fn (X) = 0) + P (Y = 0|fn (X) = 1) ≈ 0.1 · 0.9 + 0.9 · 0.1 = 2 · 0.1 · 0.9 = 0.18 The approximation sign ≈ is used because in this argument we suppress the variations introduced by the random sampling process to keep things simple. We can see that the risk R(fn ) of the classifier fn is approximately 0.18, independently of the sample size n, while the Bayes risk would be 0.1. Hence, the 1NN classifier is not consistent as R(fn ) → R(fBayes ). On the other hand, note that if we consider the 100-nearest neighbor classifier instead of the 1-nearest neighbor classifier in this example, we would make much fewer mistakes: it is just very unlikely to have a neighborhood of 100 points in which the majority vote of training points is -1. Thus, the 100-nearest neighbor classifier, while still not being consistent, makes a smaller error than the 1-nearest neighbor classifier. The trick to achieve consistency is related to this observation. Essentially, one has to allow that the size k of the neighborhood under consideration grows with the sample size n. Formally, one can prove the following theorem: THEOREM 2 [Stone, 1977]. Let fn be the k-nearest neighbor classifier constructed on n sample points. If n → ∞ and k → ∞ such that k/n → 0, then R(fn ) → R(fBayes ) for all probability distributions P . That is, the k-nearest neighbor classification rule universally Bayes-consistent. Essentially, this theorem tells us that if we choose the neighborhood parameter k such that it grows “slowly” with n, for example k ≈ log(n), then the kNN classification rule is universally Bayes-consistent. In the last sections we mentioned that the function class F from which the classifier is chosen is an important ingredient for statistical learning theory. In the case of the kNN classifier, this is not as obvious as it will be for the classifiers we are going to study in later sections. Intuitively, one can say that for a fixed parameter k, the function class Fk is a space of piecewise constant functions. The larger k is, the larger the k-neighborhoods become and thus the larger the pieces of functions which have to be constant. This means that for very large k, the function class Fk is rather “small” (the functions cannot “wiggle” so much). In the extreme case of k = n, the k-neighborhood simply includes all training points, so the kNN classifier cannot change its sign at all — it has to be constant on the whole input space X. In this case, the function class Fk contains only two elements: the function which is constantly +1 and the function which is constantly −1. On the other hand, if k is small then Fk becomes rather large (the functions can change their labels very often and abruptly). In the terms explained in the last sections, one can say that if we choose k too small, then the function class overfits: for example, this happens in the extreme case of the 1NN classifier. On the other hand, if k is too large, then the function class underfits — it simply does not contain functions which are able to model the training data. In this section we merely touched on the very basics of the kNN classifier and
Statistical Learning Theory
667
did not elaborate on any proof techniques. For a very thorough treatment of theoretical aspects of kNN classifiers we recommend the monograph [Devroye et al., 1996].
4
EMPIRICAL RISK MINIMIZATION
In the previous section we encountered our first simple classifier: the kNN classifier. In this section we now want to turn to a more powerful way of classifying data, the so called empirical risk minimization principle. Recall the assumption that the data are generated iid (independent and identically distributed) from an unknown underlying distribution P (X, Y ). As we have already seen, the learning problem consists in minimizing the risk (or expected loss on the test data), R(f ) = E((X, Y, f (X))
(5)
where f is a function mapping the input space X into the label space Y, and is the loss function. The difficulty of this task stems from the fact that we are trying to minimize a quantity that we cannot actually evaluate: since we do not know the underlying probability distribution P , we cannot compute the risk R(f ). What we do know, however, are the training data sampled from P . We can thus try to infer a function f from the training sample whose risk is close to the best possible risk. To this end, we need what is called an induction principle. Maybe the most straightforward way to proceed is to approximate the true risk by the empirical risk computed on the training data. Instead of looking for a function which minimizes the true risk R(f ), we try to find the one which minimizes the empirical risk 1 (Xi , Yi , f (Xi )). n i=1 n
Remp (f ) =
(6)
That is, given some training data (X1 , Y1 ), ..., (Xn , Yn ), a function space F to work with, and a loss function , we define the classifier fn as the function fn := argmin Remp (f ). f ∈F
This approach is called the empirical risk minimization induction principle, abbreviated by ERM. The motivation for this principle is given by the law of large numbers, as we will now explain.
4.1
The law of large numbers
The law of large numbers is one of the most important theorems in statistics. In its simplest form it states that under mild conditions, the mean of random variables
668
Ulrike von Luxburg and Bernhard Sch¨ olkopf
ξi which have been drawn iid from some probability distribution P converges to the mean of the underlying distribution itself when the sample size goes to infinity: 1 ξi → E(ξ) for n → ∞. n i=1 n
Here the notation assumes that the sequence ξ1 , ξ2 , ... has been sampled iid from P and that ξ is also distributed according to P . This theorem can be applied to the case of the empirical and the true risk. In order to see this, note that the empirical risk is defined as the mean of the loss (Xi , Yi , f (Xi )) on individual sample points, and the true risk is the mean of this loss over the whole distribution. That is, from the law of large numbers we can conclude that for a fixed function f , the empirical risk converges to the true risk as the sample size goes to infinity: 1 (Xi , Yi , f (Xi )) → E((X, Y, f (X))) for n → ∞. Remp (f ) = n i=1 n
Here the loss function (X, Y, f (X)) plays the role of the random variable ξ above. For a given, finite sample this means that we can approximate the true risk (the one we are interested in) very well by the empirical risk (the one we can compute on the sample). A famous inequality due to [Chernoff, 1952], later generalized by [Hoeffding, 1963], characterizes how well the empirical mean approximates the expected value. Namely, if ξi are random variables which only take values in the interval [0, 1], then & n ' 1 ξi − E(ξ) ≥ ≤ 2 exp(−2n 2 ). (7) P n i=1
This theorem states that the probability that the sample mean deviates by more than ε from the expected value of the distribution is bounded by a very small quantity, namely by 2 exp(−2n 2 ). Note that the higher n is, the smaller this quantity becomes, that is the probability for large deviations decreases very fast with n. Again, we can apply this theorem to the setting of empirical and true risk. This leads to a bound which states how likely it is that for a given function f the empirical risk is close to the actual risk: P (|Remp (f ) − R(f )| ≥ ) ≤ 2 exp(−2n 2 ).
(8)
For any fixed function (and sufficiently large n), it is thus highly probable that the training error provides a good estimate of the test error. There are a few important facts concerning the Chernoff bound (8). First, a crucial property of the Chernoff bound is that it is probabilistic in nature. It states that the probability of a large deviation between the test error and the training error of f is small; the larger the sample size n, the smaller the probability. Hence, it does not rule out the presence of cases where the deviation is large, it just says
Statistical Learning Theory
669
that for a fixed function f , this is very unlikely to happen. The reason why this has to be the case is the random generation of training points. It could be that in a few unlucky cases, our training data is so misleading that it is impossible to construct a good classifier based on it. However, as the sample size gets larger, such unlucky cases become very rare. In this sense, any consistency guarantee can only be of the form “the empirical risk is close to the true risk, with high probability”. At first sight, it seems that the Chernoff bound (8) is enough to prove consistency of empirical risk minimization. However, there is an important caveat: the Chernoff bound only holds for a fixed function f which does not depend on the training data. However, the classifier fn of course does depend on the training data (we used the training data to select fn ). While this seems like a subtle mathematical difference, this is where empirical risk minimization can go completely wrong. We will now discuss this problem in detail, and will then discuss how to adapt the strong law of large numbers to be able to deal with data-dependent functions.
4.2
Why empirical risk minimization can be inconsistent
Assume our underlying data space is X = [0, 1]. Choose the uniform distribution on X as probability distribution and define the label Y for an input point X deterministically as follows: −1 if X < 0.5 Y = (9) 1 if X ≥ 0.5. Now assume we are given a set of training points (Xi , Yi )i=1,...,n , and consider the following classifier: Yi if X = Xi for some i = 1, . . . , n fn (X) = (10) 1 otherwise. This classifier fn perfectly classifies all training points. That is it has empirical risk Remp (fn ) = 0. Consequently, as the empirical risk cannot become negative, fn is a minimizer of the empirical risk. However, fn clearly has not learned anything, the classifier just memorizes the training labels and otherwise simply predicts the label 1. Formally, this means that the classifier fn will not be consistent. To see this, suppose we are given a test point (X, Y ) drawn from the underlying distribution. Usually, this test point will not be identical to any of the training points, and in this case the classifier simply predicts the label 1. If X happened to be larger than 0.5, this is the correct label, but if X < 0.5, it is the wrong label. Thus the classifier fn will err on half of all test points, that is it has test error R(fn ) = 1/2. This is the same error one would make by random guessing! In fact, this is a nice example of overfitting: the classifier fn fits perfectly to the training data, but does not learn anything about new test data. It is easy to see that the classifier fn is inconsistent.
670
Ulrike von Luxburg and Bernhard Sch¨ olkopf
Just note that as the labels are a deterministic function of the input points, the Bayes classifier has risk 0. Hence we have 1/2 = R(fn ) → R(fBayes ) = 0. We have constructed an example where empirical risk minimization fails miserably. Is there any way we can rescue the ERM principle? Luckily, the answer is yes. The main object we have to look at is the function class F from which we draw our classifier. If we allow our function class to contain functions which just memorize the training data, then the ERM principle cannot work. In particular, if we choose the empirical risk minimizer from the space Fall of all functions between X and Y, then the values of the function fn at the training points points X1 , . . . , Xn do not necessarily carry any information about the values at other points. Thus, unless we make restrictions on the space of functions from which we choose our estimate f , we cannot hope to learn anything. Consequently, machine learning research has studied various ways to implement such restrictions. In statistical learning theory, these restrictions are enforced by taking into account the capacity of the space of functions that the learning machine can implement.
4.3 Uniform convergence It turns out the conditions required to render empirical risk minimization consistent involve restricting the set of admissible functions. The main insight of VC (Vapnik-Chervonenkis) theory is that the consistency of empirical risk minimization is determined by the worst case behavior over all functions f ∈ F that the learning machine could choose. We will see that instead of the standard law of large numbers introduced above, this worst case corresponds to a version of the law of large numbers which is uniform over all functions in F. Figure 4 gives a simplified depiction of the uniform law of large numbers and the question of consistency. Both the empirical risk and the actual risk are plotted as functions of f . For simplicity, we have summarized all possible functions f by a single axis of the plot. Empirical risk minimization consists in picking the f that yields the minimal value of Remp . It is consistent if the minimum of Remp converges to that of R as the sample size increases. One way to ensure the convergence of the minimum of all functions in F is uniform convergence over F: we require that for all functions f ∈ F, the difference between R(f ) and Remp (f ) should become small simultaneously. That is, we require that there exists some large n such that for sample size at least n, we know for all functions f ∈ F that |R(f ) − Remp (f )| is smaller than a given value ε. In the figure, this means that the two plots of R and Remp become so close that their distance is never larger than ε. Mathematically, the statement “|R(f ) − Remp (f )| ≤ ε for all f ∈ F” can be expressed using a supremum (readers not familiar with the notion of “supremum” should think of a maximum instead): sup |R(f ) − Remp (f )| ≤ ε.
f ∈F
Intuitively it is clear that if we know that for all functions f ∈ F the difference |R(f ) − Remp (f )| is small, then this holds in particular for any function fn we
risk
Statistical Learning Theory
671
R Remp
f
fF
fn
function class F
Figure 4. Simplified depiction of the convergence of empirical risk to actual risk. The x-axis gives a one-dimensional representation of the function class F (denoted F in the figure); the y axis denotes the risk. For each fixed function f , the law of large numbers tells us that as the sample size goes to infinity, the empirical risk Remp (f ) converges towards the true risk R(f ) (indicated by the arrow). This does not imply, however, that in the limit of infinite sample sizes, the minimizer of the empirical risk, fn , will lead to a value of the risk that is as good as the risk of the best function fF in the function class (denoted fF in the figure). For the latter to be true, we require the convergence of Remp (f ) towards R(f ) to be uniform over all functions in F (from [Sch¨ olkopf and Smola, 2002]). might have chosen based on the given training data. That is, for any function f ∈ F we have |R(f ) − Remp (f )| ≤ sup |R(f ) − Remp (f )|. f ∈F
In particular, this also holds for a function fn which has been chosen based on a finite sample of training points. From this we can draw the following conclusion: P (|R(fn ) − Remp (fn )| ≥ ε) ≤ P (sup |R(f ) − Remp (f )| ≥ ε). f ∈F
(11)
The quantity on the right hand side is now what the uniform law of large numbers deals with. We say that the law of large numbers holds uniformly over a function class F if for all ε > 0, P (sup |R(f ) − Remp (f )| ≥ ε) → 0 as n → ∞. f ∈F
672
Ulrike von Luxburg and Bernhard Sch¨ olkopf
One can now use (11) to show that if the uniform law of large numbers holds for some function class F, then empirical risk minimization is consistent with respect to F. To see this, consider the following derivation: |R(fn ) − R(fF )| (by definition of fF we know that R(fn ) − R(fF ) ≥ 0) = R(fn ) − R(fF ) = R(fn ) − Remp (fn ) + Remp (fn ) − Remp (fF ) + Remp (fF ) − R(fF ) (note that Remp (fn ) − Remp (fF ) ≤ 0 by def. of fn ) ≤ R(fn ) − Remp (fn ) + Remp (fF ) − R(fF ) ≤ 2 sup |R(f ) − Remp (f )| f ∈F
So we can conclude: P (|R(fn ) − R(fF )| ≥ ε) ≤ P (sup |R(f ) − Remp (f )| ≥ ε/2). f ∈F
Under the uniform law of large numbers, the right hand side tends to 0, which then leads to consistency of ERM with respect to the underlying function class F. In other words, uniform convergence over F is a sufficient condition for consistency of the empirical risk minimization over F. What about the other way round? Is uniform convergence also a necessary condition? Part of the elegance of VC theory lies in the fact that this is the case (see for example [Vapnik and Chervonenkis, 1971], [Mendelson, 2003], [Devroye et al., 1996]): THEOREM 3 Vapnik and Chervonenkis. Uniform convergence P (sup |R(f ) − Remp (f )| > ) → 0 as n → ∞, f ∈F
(12)
for all > 0, is a necessary and sufficient condition for consistency of empirical risk minimization with respect to F. In Section 4.2 we gave an example where we considered the set of all possible functions, and showed that learning was impossible. The dependence of learning on the underlying set of functions has now returned in a different guise: the condition of uniform convergence crucially depends on the set of functions for which it must hold. Intuitively, it seems clear that the larger the function space F, the larger supf ∈F |R(f ) − Remp (f )|. Thus, the larger F, the larger P (supf ∈F |R(f ) − Remp (f )| > ). Thus, the larger F, the more difficult it is to satisfy the uniform law of large numbers. That is, for larger function spaces F consistency is “harder” to achieve than for “smaller” function spaces.
Statistical Learning Theory
673
This abstract characterization of consistency as a uniform convergence property, whilst theoretically intriguing, is not all that useful in practice. The reason is that it seems hard to find out whether the uniform law of large numbers holds for a particular function class F. Therefore, we next address whether there are properties of function spaces which ensure uniform convergence of risks. 5
CAPACITY CONCEPTS AND GENERALIZATION BOUNDS
So far, we have argued which property of the function space determines whether the principle of empirical risk minimization is consistent, i.e., whether it will work “in the limit.” This property was referred to as uniform convergence. To make statements about what happens after seeing only finitely many data points — which in reality will always be the case — we need to take a closer look at this convergence. It will turn out that this will provide us with bounds on the risk, and it will also provide insight into which properties of function classes determine whether uniform convergence can take place. To this end, let us take a closer look at the subject of Theorem 3: the probability P (sup |R(f ) − Remp (f )| > ). f ∈F
(13)
Two tricks are needed along the way: the union bound and the method of symmetrization by a ghost sample.
5.1
The union bound
The union bound is a simple but convenient tool to transform the standard law of large numbers of individual functions into a uniform law of large numbers over a set of finitely many functions. Suppose the set F consists just of finitely many functions, that is F = {f1 , f2 , ..., fm }. Each of the functions fi ∈ F satisfies the standard law of large numbers in form of the Chernoff bound, that is P (|R(fi ) − Remp (fi )| ≥ ε) ≤ 2 exp(−2nε2 ).
(14)
Now we want to transform these statements about the individual functions fi into a uniform law of large numbers. To this end, note that we can rewrite: P (sup |R(f ) − Remp (f )| ≥ ε) f ∈F
|R(f1 ) − Remp (f1 )| ≥ ε or |R(f2 ) − Remp (f2 )| ≥ ε or ... or |R(fm ) − Remp (fm )| ≥ ε =P
≤
m
(15)
P (|R(fi ) − Remp (fi )| ≥ ε)
i=1
≤ 2m exp(−2nε2 )
(16)
674
Ulrike von Luxburg and Bernhard Sch¨ olkopf
Let us go through these calculations step by step. The first equality comes from the way the supremum is defined. Namely, the supremum over certain expressions is larger than ε if at least one of the expressions is larger than ε, which leads to the statements with the “or” combinations. The next step uses a standard tool from probability theory, the union bound. The union bound states that the probability of a union of events (that is, events coupled with “or”) is smaller or equal to the sum of the individual probabilities. Finally, the last step is a simple application of the Chernoff bound of Eq. (14) to each of the terms in the sum. From left to right, the statements in Eq. (16) show us how to convert the Chernoff bound for individual functions fi into a bound which is uniform over a finite number m of functions. As we can see, the difference between the Chernoff bound for the individual functions and the right hand side of (16) is just a factor m. If the function space F is fixed, this factor can be regarded as a constant, and the term 2m exp(−2nε2 ) still converges to 0 as n → ∞. Hence, the empirical risk converges to 0 uniformly over F as n → ∞. That is, we have proved that empirical risk minimization over a finite set F of functions is consistent with respect to F. We next describe a trick used by Vapnik and Chervonenkis to reduce the case of an infinite function class to the case of a finite one. It consists of introducing what is sometimes called a ghost sample. It will enable us to replace the factor m in (16) by more general capacity measures that can be computed for infinite function classes.
5.2 Symmetrization Symmetrization is an important technical step towards using capacity measures of function classes. Its main purpose is to replace the event supf ∈F |R(f ) − Remp (f )| by an alternative event which can be solely computed on a given sample. Assume we are given a sample (Xi , Yi )i=1,...,n .We now introduce a new sample called the ghost sample. This ghost sample is just another sample (Xi , Yi )i=1,...,n which is also drawn iid from the underlying distribution and which is independent of the first sample. It is called ghost sample because we do not need to physically draw this sample in practice. It is just a mathematical tool, that is we play “as if we had a second sample”. Of course, given the ghost sample we can also compute the empirical risk of a function with respect to this ghost sample, and we will denote (f ). With the help of the ghost sample, one can now prove the this risk by Remp following simple statement: LEMMA 4 Vapnik and Chervonenkis. For m 2 ≥ 2, we have P (sup |R(f ) − Remp (f )| > ) ≤ 2P (sup |Remp (f ) − Remp (f )| > /2). f ∈F
f ∈F
(17)
Here, the first P refers to the distribution of an iid sample of size n, while the second one refers to the distribution of two samples of size n (the original sample and the ghost sample), that is the distribution of iid samples of size 2n. In the
Statistical Learning Theory
675
latter case, Remp measures the empirical loss on the first half of the sample, and on the second half. Remp Although we do not prove this lemma, it should be fairly plausible: if the empirical risks on two independent n-samples are close to each other, then they should also be close to the true risk. This lemma is called the symmetrization lemma. Its name refers to the fact that we now look at an event which depends in a symmetric way on a sample, now of size 2n. The main purpose of this lemma is that the quantity R(f ), which cannot (f ), which be computed on a finite sample, has been replaced by the quantity Remp can be computed on a finite sample. Now let us explain what the symmetrization lemma is used for. In the last section we have seen how to bound the probability of uniform convergence (13) in terms of a probability of an event referring to a finite function class. The crucial observation is now that even if F contains infinitely many functions, the different ways it can classify a training set of n sample points is finite. Namely, for any given training point in the training sample, a function can only take the values −1 or +1. On a sample of n points X1 , . . . , Xn , a function can act in at most 2n different ways: it can choose each Yi as −1 or +1. This has a very important consequence. Even if a function class F contains infinitely many functions, there are at most 2n different ways those functions can classify the points of a finite sample of n points. This means that if we consider the term (f )| sup |Remp (f ) − Remp
f ∈F
then the supremum effectively only runs over a finite function class. To understand this, note that two functions f, g ∈ F which take the same values on the given sample have the same empirical risk, that is Remp (f ) = Remp (g). The analogous . Hence, all functions f, g which statement holds for the ghost sample and Remp coincide both on the sample and the ghost sample will lead to the same term (f )|. Thus, the only functions we need to consider to compute |Remp (f ) − Remp the supremum are the 22n functions we can obtain on sample and ghost sample together. Hence, we can replace the supremum over f ∈ F by the supremum over a finite function class with at most 22n functions. Note that this step is only possible due to the symmetrization lemma. If we had considered the term supf ∈F |R(f ) − Remp (f )|, the argument from above would not hold, as the value of R(f ) not only depends on the values of f on the sample. In the following we now want to show how the insights gained in the symmetrization step can be used to derive a first capacity measure of a function class.
5.3
The shattering coefficient
For the purpose of bounding (13), Lemma 4 implies that the function class F is effectively finite: restricted to the 2n points appearing on the right hand side of (17), it has at most 22n elements. This is because only the values of the functions
676
Ulrike von Luxburg and Bernhard Sch¨ olkopf
on the sample points and the ghost sample points count. The number of effectively different functions can be smaller than 22n , however. For example, it could be the case that F does not contain a single function which takes value +1 on the first training point. We now want to formalize this. Let Zn := ((X1 , Y1 ), . . . , (Xn , Yn )) be a given sample of size n. Denote by |FZn | be the cardinality of F when restricted to {X1 , . . . , Xn }, that is, the number of functions from F that can be distinguished from their values on {X1 , . . . , Xn }. Let us, moreover, denote the maximum number of functions that can be distinguished in this way as N(F, n), where the maximum runs over all possible choices of samples, that is N(F, n) = max{|FZn | X1 , ..., Xn ∈ X}. The quantity N(F, n) is referred to as the shattering coefficient of the function class F with respect to sample size n. It has a particularly simple interpretation: it is the number of different outputs (Y1 , . . . , Yn ) that the functions in F can achieve on samples of a given size n. In other words, it measures the number of ways that the function space can separate the patterns into two classes. Whenever N(F, n) = 2n , this means that there exists a sample of size n on which all possible separations can be achieved by functions of the class F. If this is the case, the function space is said to shatter n points. Note that because of the maximum in the definition of N(F, n), shattering means that there exists a sample of n patterns which can be separated in all possible ways — it does not mean that this applies to all possible samples of n patterns. The shattering coefficient is a capacity measure of a function class, that is it measures the “size” of a function class in some particular way. Note that if a function class F contains very many functions, then the shattering coefficient tends to be larger than for a function class which only contains very few functions. However, the shattering coefficient is more subtle than simply counting the number of functions in a class. It only counts the number of functions in relation to the samples we are interested in. The following section will now finally show how to use the shattering coefficient to derive a generalization bound for empirical risk minimization on infinite function classes F.
5.4 Uniform convergence bounds Given an arbitrary, possibly infinite function class, we now want to take a look at the right hand side of (17). We now consider a sample of 2n points, that is a set Z2n , where we interpret the first n points as the original sample and the second n points as the ghost sample. The idea is now to replace the supremum over F by the supremum over the set FZ2n , use that the set FZ2n contains at most N(F, n) ≤ 22n different functions, then apply the union bound on this finite function set and then the Chernoff bound. This leads to a bound like (16), with N(F, 2n) playing the role of m. Essentially, those steps can be written down as follows:
Statistical Learning Theory
677
P (sup |R(f ) − Remp (f )| > ) f ∈F
(due to symmetrization) (f )| > /2) ≤ 2P (sup |Remp (f ) − Remp f ∈F
(only functions in FZ2n are important) (f ))| > /2) = 2P ( sup |Remp (f ) − Remp f ∈FZ2n
(FZ2n contains at most N(F, 2n) functions, independently of Z2n ) (use union bound argument and Chernoff ) ≤ 2N(F, 2n) exp(−nε2 /4) So all in all we see that P (sup |R(f ) − Remp (f )| > ) ≤ 2N(F, 2n) exp(−nε2 /4). f ∈F
(18)
Now we can use this expression to draw conclusions about consistency of empirical risk minimization. Namely, ERM is consistent for function class F if the right hand side of this expression converges to 0 as n → ∞. Let us look at a few examples. First of all, consider a case where the shattering coefficient N(F, 2n) is considerably smaller than 22n , say N(F, 2n) ≤ (2n)k for some constant k (this means that the shattering coefficient grows polynomially in n). Plugging this in the right hand side of (18), we get 2N(F, 2n) exp(−nε2 /4) = 2 · (2n)k · exp(−nε2 /4) = 2 exp(k · log(2n) − nε2 /4). Here we can see that for all ε, if n is large enough the whole expression converges to 0 for n → ∞. From this we can conclude that whenever the shattering coefficient N(F, 2n) only grows polynomially with n, then empirical risk minimization is consistent with respect to F. On the other hand, consider the case where we use the function class Fall . It is clear that this class can classify each sample in every possible way, that is N(F, 2n) = 22n for all values of n. Plugging this in the right hand side of (18), we get 2N(F, 2n) exp(−nε2 /4) = 2 · 22n exp(−nε2 /4) = 2 exp(n(2 log(2) − ε2 /4)). We can immediately see that this expression does not tend to 0 when n → ∞, that is we cannot conclude consistency for Fall . Note that we cannot directly
678
Ulrike von Luxburg and Bernhard Sch¨ olkopf
conclude that ERM using Fall is inconsistent, either. The reason is that (18) only gives an upper bound on P (supf ∈F |R(f )−Remp (f )| > ), that is it only provides a sufficient condition for consistency, not a necessary one. However, with more effort one can also prove necessary statements. For example, a necessary and sufficient condition for consistency of ERM is that log N(F, n)/n → 0 (cf. [Mendelson, 2003], for related theorems also see [Vapnik and Chervonenkis, 1971; Vapnik and Chervonenkis, 1981], or Section 12.4. of [Devroye et al., 1996]). The proof that this condition is necessary is more technical, and we omit it. In case of the examples above, this condition immediately gives the desired results: if N(F, n) is polynomial, then log N(F, n)/n → 0. On the other hand, for Fall we always have N(F, n) = 2n , thus log N(F, n)/n = n/n = 1, which does not converge to 0. Thus, ERM using Fall is not consistent.
5.5 Generalization bounds It is sometimes useful to rewrite (18) “the other way round”. That is, instead of fixing ε and then computing the probability that the empirical risk deviates from the true risk by more than ε, we specify the probability with which we want the bound to hold, and then get a statement which tells us how close we can expect the risk to be to the empirical risk. This can be achieved by setting the right hand side of (18) equal to some δ > 0, and then solving for . As a result, we get the statement that with a probability at least 1 − δ, any function f ∈ F satisfies ( 4 log(2N(F, n)) − log(δ) . (19) R(f ) ≤ Remp (f ) + n In the same way as above, we can use this bound to derive consistency statements. For example, it is now obvious ) that empirical risk minimization is consistent for function class F if the term log(2N(F, 2n))/n converges to 0 as n → ∞. Again, this is for example the case if N(F, 2n) only grows polynomially with n. Note that the bound (19) holds for all functions f ∈ F. On the one side, this is a strength of the bound, as it holds in particular for the function fn minimizing the empirical risk, which is what we wanted. Moreover, many learning machines do not truly minimize the empirical risk, and the bound thus holds for them, too. However, one can also interpret it as a weakness, since by taking into account more information about the function we are interested in, one could hope to get more accurate bounds. Let us try to get an intuitive understanding of this bound. It tells us that if both Remp (f ) and the square root term are small simultaneously, then we can guarantee that with high probability, the risk (i.e., the error on future points that we have not seen yet) will be small. This sounds like a surprising statement, however, it
Statistical Learning Theory
679
does not claim anything impossible. If we use a function class with relatively small N(F, n), i.e., a function class which cannot “explain” many possible functions, and then notice that using a function of this class we can nevertheless explain data sampled from the problem at hand, then it is likely that this is not a coincidence, and we can have captured some essential aspects of the problem. If, on the other hand, the problem is too hard to learn from the given amount of data then we will find that in order to explain the data (i.e., to achieve a small Remp (f )), we need a function class which is so large that it can basically explain anything, and thus the square root term would be large. Note, finally, that whether a problem is hard to learn is entirely determined by whether we can come up with a suitable function class, and thus by our prior knowledge of it. Even if the optimal function is subjectively very complex, if our function class contains that function, and few or no other functions, we are in an excellent position to learn. There exists a large number of bounds similar to (18) and its alternative form (19). Differences occur in the constants, both in front of the exponential and in its exponent. The bounds also differ in the exponent of (see [Devroye et al., 1996; Vapnik, 1998] and references therein) and in the way they measure capacity. We will not elaborate on those issues.
5.6
The VC dimension
Above we have formulated the generalization bounds in terms of the shattering coefficient N(F, n). The downside is that they are usually difficult to evaluate. However, there exists a large number of different capacity concepts, with different advantages and disadvantages. We now want to introduce the most well known one, the so-called VC dimension (named after Vapnik and Chervonenkis). Its main purpose is to characterize the growth behavior of the shattering coefficient using a single number. We say that a sample Zn of size n is shattered by function class F if the function class can realize any labeling on the given sample, that is |FZn | = 2n . The VC dimension of F, denoted by VC(F), is now defined as the largest number n such that there exists a sample of size n which is shattered by F. Formally, VC(F) = max{n ∈ Æ |FZn | = 2n for some Zn }. If the maximum does not exist, the VC dimension is defined to be infinity. For many examples of function classes and their VC dimensions see for example [Kearns and Vazirani, 1994] or [Anthony and Biggs, 1992]. A beautiful combinatorial result proved simultaneously by several people ([Sauer, 1972], [Shelah, 1972], [Vapnik and Chervonenkis, 1971]) characterizes the growth behavior of the shattering coefficient and relates it to the VC dimension: LEMMA 5 Vapnik, Chervonenkis, Sauer, Shelah. Let F be a function class with
680
Ulrike von Luxburg and Bernhard Sch¨ olkopf
finite VC dimension d. Then d n
N(F, n) ≤
i=0
i
for all n ∈ Æ. In particular, for all n ≥ d we have N(F, n) ≤
en d d
.
The importance of this statement lies in the last fact. If n ≥ d, the shattering coefficient behaves like a polynomial function of the sample size n. This is a very remarkable result: once we know the VC-dimension of a function class F is finite, we already know that the shattering coefficients grow polynomially with n. By the results of the last section this implies the consistency of ERM. Note that we also have a statement in the other direction. If the VC-dimension is infinite, this means that for each n there exists some sample which can be shattered by F, that is N(F, n) = 2n . For this case we have already seen above that ERM is not consistent. Together, we achieve the following important characterization: THEOREM 6. Empirical risk minimization is consistent with respect to F if and only if VC(F) is finite. One important property to note both about the shattering coefficient and the VC dimension is that they do not depend on the underlying distribution P , they only depend on the function class F. One the one hand, this is an advantage, as all generalization bounds derived from those concepts apply to all possible probability distributions. On the other hand, one can also consider this as disadvantage, as the capacity concepts do not take into account particular properties of the distribution at hand. In this sense, those capacity concepts often lead to rather loose bounds.
5.7 Rademacher complexity A different concept to measure the capacity of a function space is the Rademacher complexity. Compared to the shattering coefficient and the VC dimension, it does depend on the underlying probability distribution, and usually leads to much sharper bounds than both of them. The Rademacher complexity is defined as follows. Let σ1 , σ2 , ... independent random variables which attain the two values +1 and −1 with probability 0.5 each (such random variables are sometimes called Rademacher variables, therefore the name “Rademacher complexity”). For example, they could be the results of repeatedly tossing a fair coin. We formally define the Rademacher complexity R(F) of a function space F as 1 σi f (Xi ). n i=1 n
R(F) := E sup
f ∈F
(20)
Statistical Learning Theory
681
This expression looks complicated, but it has a nice interpretation. For the moment, consider the values σi as fixed, and interpret label σi as a label of the point Xi . As both σi and f (Xi ) only take the values +1 or −1, the product σi f (Xi ) takes the value +1 if σi = f (Xi ), and −1 if σi = f (Xi ). As a consequence, the sum on the right hand side of Equation (20) will be large if the labels f (Xi ) coincide with the labels σi on many data points. This means that the function f “fits well” to the labels σi : if the labels σi were the correct labels, f would have a small training error Remp . Now taking into account the supremum, we not onlylook at one function f , but at all functions f ∈ F. We can see that n supf ∈F i=1 σi f (Xi ) is large if there exists a function in F which fits well to the given sequence σi of labels. Finally, recall that the labels σi are supposed to be random variables. We can consider them as random labels on the data points Xi . As we take the expectation over both the data points and the random labels, the overall Rademacher complexity is high if the function space F is able to “fit well” to random labels. This intuition makes sense: a function space has to be pretty large to be able to fit to all kinds of random labels on all kinds of data sets. In this sense, the Rademacher complexity measures how “complex” the function space is: the higher R(F), the larger the complexity of F. From a mathematical point of view, the Rademacher complexity is convenient to work with. One can prove generalization bounds of the following form: with probability at least 1 − δ, ( R(f ) ≤ Remp (f ) + 2R(F) +
log(1/δ) 2n
Rademacher complexities have some advantages over the classical capacity concepts such as the VC dimension. Most notably, the bounds obtained by Rademacher complexities tend to be much sharper than the ones obtained by the classical tools. The proof techniques are different from the ones explained above, but we do not want to go into details here. For literature on Rademacher complexity bounds, see for example [Mendelson, 2003], [Bousquet et al., 2003] or [Boucheron et al., 2005] and references therein.
5.8
Large margin bounds
Finally, we would like to introduce another type of capacity measure of function classes which is more specialized than the general combinatorial quantities introduced above. Consider the special case where the data space consists of points in the two-dimensional space Ê2 , and where we want to separate classes by a straight line. Given a set of training points and a classifier fn which can perfectly separate them, we define the margin of the classifier fn as the smallest distance of any training point to the separating line fn (cf. Figure 5). Similarly, a margin can be defined for general linear classifiers in the space Êd of arbitrary dimension d. It can be proved that the VC dimension of a class Fρ of linear functions with all
682
Ulrike von Luxburg and Bernhard Sch¨ olkopf
x
x x
x
x x
x x
o o o
o
o o
Figure 5. Margin of a linear classifier: the crosses depict training points with label +1, the circles training points with label -1. The straight line is the linear classifier fn , and the dashed line shows the margin. The width ρ of the margin is depicted by the arrow. have margin at least ρ can essentially be bounded by the ratio of the radius R of the smallest sphere enclosing the data points with the margin ρ, that is * 4R2 +1 VC(Fρ ) ≤ min d, 2 ρ (cf. Theorem 5.1. in [Vapnik, 1995]). That is, the larger the margin ρ of the functions in the class Fρ , the smaller is its VC dimension. Thus, one can use the margin of classifiers as a capacity concept. One of the most well-known classifiers, the support vector machine (SVM) builds on this result. See [Sch¨ olkopf and Smola, 2002] for a comprehensive treatment. An example of a generalization bound involving the large margin is as follows (for a more precise statement see for example Theorem 7.3. in [Sch¨olkopf and Smola, 2002]): THEOREM 7 Large margin bound. Assume the data space lies inside a ball of radius R in Êd . Consider the set Fρ of linear classifiers with margin at least ρ. Assume we are given n training examples. By ν(f ) denote the fraction of training examples which have margin smaller than ρ or which are wrongly classified by some classifier f ∈ Fρ . Then, with probability at least 1 − δ, the true error of any f ∈ Fρ can be bounded by + R(f ) ≤ ν(f ) +
c n
R2 2 log(n) + log(1/δ) ρ2
where c is some universal constant.
5.9 Other generalization bounds and capacity concepts Above we have introduced a few capacity concepts for function classes such as the shattering coefficient, the VC dimension, or the Rademacher complexity. In the
Statistical Learning Theory
683
literature, there exist many more capacity concepts, and introducing all of them will be beyond the scope of this overview. However, we would like to point out the general form which most generalization bounds take. Usually, those bounds are composed of three different terms and have a form like: with probability at least 1 − δ, R(f ) ≤ Remp (f ) + capacity(F) + conf idence(δ). That is, one can bound the true risk of a classifier by its empirical risk, some capacity term which in the simplest case only depends on the underlying function class, and a confidence term which depends on the probability with which the bound should hold. Note that by nature, all bounds of this form a worst case bounds: as the bound holds for all functions in the class F, the behavior of the bound is governed by the “worst” or “most badly behaved” function in the class. This point is often used to criticize this approach to statistical learning theory, as natural classifiers do not tend to pick the worst function in a class. 6
INCORPORATING KNOWLEDGE INTO THE BOUNDS
In the previous section we presented the standard approach to derive risk bounds using uniform laws of large numbers. The bounds are agnostic in the sense that they do not make any prior assumptions on the underlying distribution P . In many cases, however, it is desirable to be able to take into account some prior knowledge we might have about our problem. There are several good reasons for doing so. First, the bounds we considered above are worst case bounds over all possible distributions. That is, their behavior might be governed by completely unnatural distributions which would never occur in practice (such as distributions on Cantor sets, for example). In this sense, the bounds are overly pessimistic. As we often believe that “real” distributions have certain regularity aspects, we might improve the results obtained above by making use of this regularity. A more fundamental reason for incorporating prior knowledge is the no free lunch theorem, which will be discussed in Section 8. In a nutshell, the no free lunch theorem states that learning is impossible unless we make assumptions on the underlying distribution. Thus it seems a good idea to work out ways to state prior assumptions on distributions, and to build them into our learning framework. In this section we want to point out a few ways of doing so.
6.1
How to encode prior knowledge in the classical approach
Let us step back and try to figure out where prior knowledge enters a learning problem: 1. Via the formalization of the data space X, for example its topology. For example, we might construct a distance function or a similarity function which
684
Ulrike von Luxburg and Bernhard Sch¨ olkopf
tells us how “similar” different input values in X are. A learning algorithm then tries to assign similar output labels to similar input labels. The topology of X is one of the most important places to encode prior knowledge, not only for theory but in particular for practical applications. 2. Via the elements of F. Here we encode our assumptions on how a useful classifier might look. While in theory this is an important way to encode prior assumptions, it does not play such a big role in practice. The reason is that it is not so easy to formulate any prior assumptions in terms of the function space. In applications, people thus often use one out of a handful of standard function spaces, and do not consider this as the main place to encode prior assumptions. An exception are Bayesian approaches, see below. 3. Via the loss function . This function encodes what the “goal” of learning should be. While the loss functions we mentioned above are pretty straightforward, there are many ways to come up with more elaborate loss functions. For example, we can weight errors on individual data points more heavily than those on other data points. Or we can weight different kinds of errors differently. For example, in many problems the “false positives” and “false negatives” have different costs associated to them. In the example of spam filtering, the cost of accidentally labeling a spam email as “not spam” is not so high (the user can simply delete this mail). However, the cost of labeling a “non-spam” email as “spam” can be very high (the mail might have contained important information for the user, but got deleted without his knowledge). 4. Via assumptions on the underlying probability distributions. So far, the classical approach presented above was agnostic in the sense that it did not make any assumptions on the underlying probability distribution. That is, no matter what probability distribution P generated our data, the generalization bounds from above apply. Due to this property they are often called “worst case bounds”, as they even hold for the worst and most unlikely distribution one can imagine. This worst case approach has often been criticized as being overly pessimistic. The accusation is that the behavior of the results is governed by artificial, pathological probability distributions which would never occur in practice, and thus the results do not have much meaning for the “real world”. To circumvent this, one could make assumptions on the underlying probability distributions. For example, one could assume that the distribution is “nice” and “smooth”, that the labels are not overly noisy, and so on. Such assumptions can lead to a dramatic improvement of the learning guarantees, as is discussed in Section 7.4 in connection with the fast learning rates. On the other hand, making assumptions on the distributions contradicts the agnostic approach towards learning we wanted to make in the beginning. It
Statistical Learning Theory
685
is often perceived problematic to make such assumptions, because after all one does not know whether those assumptions hold for “real world” data, and guarantees proved under those assumptions could be misleading in cases where the assumptions are violated. In some sense, all those quantities mentioned above do enter the classical generalization bounds: The loss function is directly evaluated in the empirical risk of the bound. The topology of X and the choice of F are both taken into account when computing the capacity measure of F (for example the margin of functions). However, the capacity measures are often perceived as a rather cumbersome and crude way to incorporate such assumptions, and the question is whether there are any other ways to deal with knowledge. In this section we now want to focus on a few advanced techniques, all of which try to improve the framework above from different points of view. The focus of the attention is on how to incorporate more “knowledge” in the bounds than just counting functions in F: either knowledge we have a priori, or knowledge we can obtain a posteriori.
6.2
PAC Bayesian bounds
The classical SLT approach has two features which are important to point out. The capacity term in the generalization bound always involves some quantity which measures the “size” of the function space F. However, this quantity usually does not directly depend on the complexity of the individual functions in F, it rather counts how many functions there are in F. Moreover, the bounds do not contain any quantity which measures the complexity of an individual function f itself. In this sense, all functions in the function space F are treated the same: as long as they have the same training error, their bound on the generalization error is identical. No function is singled out in any special way. This can be seen as an advantage or a disadvantage. If we believe that all functions in F are similarly well suited to fit a certain problem, then it would not be helpful to introduce any “ordering” between them. However, often this is not the case. We already have some “prior knowledge” which we accumulated in the past. This knowledge might tell us that some functions f ∈ F are much more likely to be a good classifier than others. The Bayesian approach is one way to try to incorporate such prior knowledge into statistical inference. The general idea is to introduce some prior distribution π on the function space F. This prior distribution expresses our belief about how likely a certain function is to be a good classifier. The larger the value π(f ) is, the more confident we are that f might be a good function. The important point is that this prior will be chosen before we get access to the data. It should be selected only based on background information or prior experience. It turns out that this approach can be effectively combined with the classic SLT framework. For example, one can prove that for a finite function space F, with
686
Ulrike von Luxburg and Bernhard Sch¨ olkopf
probability at least 1 − δ, ( R(f ) ≤ Remp (f ) +
log(1/π(f )) + log(1/δ) 2n
where π(f ) denotes the value of the prior on f . This is the simplest PAC-Bayesian bound. The name comes from the fact that it combines the classical bounds (sometimes called PAC bounds where “PAC” stands for “probably approximately correct”) and the Bayesian framework. Note that the right hand side does not involve a capacity term for F, but instead “punishes” individual functions f according to their prior likelihood π(f ). The more unlikely we believed f to be, the smaller π(f ), and the larger the bound. This mechanism shows that among two functions with the same empirical risk on the training data, one prefers the one with the higher prior value π(f ). For background reading on PAC-Bayesian bounds, see for example Section 6 of [Boucheron et al., 2005] and references therein.
6.3
Luckiness approach
There is one important aspect of the bounds obtained in the classical approach (and in the PAC-Bayesian approach as well): all quantities in the capacity term of the bound have to be evaluated before getting access to the data. The mathematical reason for this is that capacity measures are not treated as random quantities, they are considered to be deterministic functions of a fixed function class. For example, the large margin bound in Section 5.8 has to be parsed as follows: “if we knew that the linear classifier we are going to construct has margin ρ, then the generalization bound of Theorem 7 would hold”. From this statement, it becomes apparent that there is a problem: in practice, we do not know in advance whether a given classifier will have a certain margin, this will strongly depend on the data at hand. At first glance, this problem sounds superficial, and many people in the machine learning community are not even aware of it. It is closely related to the question what prior assumptions we want to encode in the learning problems before seeing actual data, and which properties of a problem should only be evaluated a posteriori, that is after seeing the data. While the PAC-Bayesian framework deals with a priori knowledge, we still need a framework to deal with “a posteriori” knowledge. This problem is tackled in the “luckiness framework”, first introduced in [ShaweTaylor et al., 1998] and then considerably extended in [Herbrich and Williamson, 2002]. On a high level, the idea is as follows. Instead of proving bounds where the capacity term just depends on the underlying function class, one would like to allow this capacity to depend on the actual data at hand. Some samples are considered to be very “lucky”, while others are “unlucky”. The rationale is that for some samples it is easy to decide on a good function from the given function class, while the same is difficult for some other samples. As a result, one can prove generalization bounds which do depend on the underlying sample in terms of their “luckiness value”.
Statistical Learning Theory
Sample 1:
687
Sample 2:
x xx x 0
o oo o 1 d1
x xx x o o o 0
o 1
d2
Figure 6. Luckiness approach We would like to illustrate the basic idea with a small toy example. Suppose our data space X is just the interval [0, 1], and define the functions fa as −1 if x ≤ a fa (x) = +1 if x > a As function space we use the space F := {fa a ∈ [0, 1]}. Moreover, assume that the true underlying distribution can be classified by a function in F with 0 error, that is we do not have any label noise. Now consider two particular samples, see Figure 6 for illustration. In the first sample, the two data points which are closest to the decision boundary have distance d1 , which is fairly large compared to the corresponding distance d2 in the second sample. It is clear that the true decision boundary has to lie somewhere in the interval d1 (d2 , respectively). The uncertainty about the position of the decision boundary is directly related to the lengths d1 and d2 , respectively: we can be confident that the error on Sample 2 will be much smaller than the error on Sample 1. In this sense, Sample 2 is more “lucky” than Sample 1. Now consider two classifiers f1 and f2 constructed using Sample 1 and Sample 2, and which have 0 training error. Classic generalization bounds such as the ones in Section 5 give exactly the same risk value to both classifiers: they have the same training error and use the same function space F. This is where the luckiness approach comes in. It assigns a “luckiness value” to all samples, and then derives a generalization bound which not only depends on the training error and the capacity of the underlying function class, but also on the luckiness value of the sample at hand. The technical details are fairly complicated, and we refrain from introducing them here. Let us just remark that in the example above, the luckiness approach would attribute a smaller risk to Sample 2 than to Sample 1. For more material on how this works exactly see [Shawe-Taylor et al., 1998] and [Herbrich and Williamson, 2002]. Also note that similar approaches have been developed in classical statistics under the name “conditional confidence sets”, see [Kiefer, 1977]. 7
THE APPROXIMATION ERROR AND BAYES CONSISTENCY
In the previous sections we have investigated the standard approach to bound the estimation error of a classifier. This is enough to achieve consistency with respect
688
Ulrike von Luxburg and Bernhard Sch¨ olkopf
to a given function class F. In this section, we now want to look at the missing piece towards Bayes-consistency: the approximation error. Recall that the estimation error was defined as R(fn ) − R(fF ) and the approximation error as R(fF ) − R(fBayes ), cf. Section 2.4. In order to achieve Bayes-consistency, both terms have to vanish when n → ∞. We have seen above that for the estimation error to converge to 0 we have to make sure that the function space F has a reasonably small capacity. But this now poses a problem for the approximation error: if the function space F has a small capacity, this means in particular that the space F is considerably smaller than the space Fall of all functions. Thus, if we fix the function class F and the Bayes classifier fBayes is not contained in F, then the approximation error might not be 0. There are only two ways to solve this problem. The first one is to make assumptions on the functional form of the Bayes classifier. If fBayes ∈ F for some known function space F with small capacity, we know that the approximation error is 0. In this case, Bayes-consistency reduces to consistency with respect to F, which can be achieved by the methods discussed in the last section. However, if we do not want to make assumptions on the Bayes classifier, then we have to choose a different construction.
7.1 Working with nested function spaces In this construction, we will not only consider one function class F, but a sequence F1 , F2 , ... of function spaces. When constructing a classifier on n data points, we will do this based on function space Fn . The trick is now that the space Fn should become more complex the larger the sample size n is. The standard construction is to choose the spaces Fn such that they form an increasing sequence of nested function spaces, that is F1 ⊂ F2 ⊂ F3 ⊂ .... The intuition is that we start with a simple function space and then slowly add more and more complex functions to the space. If we are now given a sample of n data points, we are going to pick our classifier from the space Fn . If we want this classifier to be Bayes-consistent, there are two things we need to ensure: 1. The estimation error has to converge to 0 as n → ∞. To this end, note that for each fixed n we can bound the estimation error by one of the methods of Section 5. This bound is decreasing as the sample size increases, but it is increasing as the complexity term increases. We now have to make sure that the overall estimation error is still decreasing, that is the complexity term must not dominate the sample size term. To ensure this, we have to make sure that the complexity of Fn does not grow too fast as the sample size increases. 2. The approximation error has to converge to 0 as n → ∞. To this end, we need to ensure that eventually for some large n, each function of Fall is either contained in Fn , or that it can be approximated by a function from Fn . We are going to discuss below how this can be achieved.
Statistical Learning Theory
689
An example how those two points can be stated in a formal way is the following theorem, adapted from Theorem 18.1 of [Devroye et al., 1996]: THEOREM 8. Let F1 , F2 , ... be a sequence of function spaces, and consider the classifiers fn = argmin Remp (f ). f ∈Fn
Assume that for any distribution P the following two conditions are satisfied: 1. The VC-dimensions of the spaces Fn satisfy V C(Fn )·log n/n → 0 as n → ∞. 2. R(fFn ) → R(fBayes ) as n → ∞. Then, the sequence of classifiers fn is Bayes-consistent. Let us try to understand this theorem. We are given a sequence of increasing function spaces Fn . For each sample of size n, we pick the function in Fn which has the lowest empirical risk. This is our classifier fn . If we want this to be consistent, two conditions have to be satisfied. The first condition says that the complexity of the function classes, as measured by the VC dimension, has to grow slowly. For example, if we choose the function spaces Fn such that V C(FN ) ≈ nα for some α ∈]0, 1[, then the first condition is satisfied because V C(Fn ) · (log n)/n ≈ nα (log n)/n = (log n)/n1−α → 0. However, if we choose V C(FN ) ≈ n (that is, α = 1 in the above calculation), then this is no longer the case: V C(Fn ) · (log n)/n ≈ log n → ∞. The second condition of the theorem simply states that the approximation error has to converge to 0, but the theorem does not give any insight how to achieve this. But as we discussed above it is clear that the latter can only be achieved for an increasing sequence of function classes.
7.2
Regularization
An implicit way of working with nested function spaces is the principle of regularization. Instead of minimizing the empirical risk Remp (f ) and then expressing the generalization ability of the resulting classifier fn using some capacity measure of the underlying function class F, one can pursue a more direct approach: one directly minimizes the so-called regularized risk Rreg (f ) = Remp (f ) + λΩ(f ). Here, Ω(f ) is the so-called regularizer. This regularizer is supposed to punish overly complex functions. For example, one often chooses a regularizer which
690
Ulrike von Luxburg and Bernhard Sch¨ olkopf
punishes functions with large fluctuations, that is one chooses Ω(f ) such that it is small for functions which vary slowly, and large for functions which fluctuate a lot. Or, as another example, for linear classifiers one can choose Ω(f ) as the inverse of the margin of a function (recall the definition of a margin in Section 5.8). The λ in the definition of the regularized risk is a trade-off constant. It “negotiates” between the importance of Remp (f ) and of Ω(f ). If λ is very large, we take the punishment induced by Ω(f ) very seriously, and might prefer functions with small Ω(f ) even if they have a high empirical risk. On the other hand, if λ is small, the influence of the punishment decreases, and we merely choose functions based on their empirical risks. The principle of regularization consists in choosing the classifier fn that minimizes the regularized risk Rreg . Many of the widely-used classifiers can be cast into the framework of regularization, for example the support vector machine (see [Sch¨ olkopf and Smola, 2002], for details). To prove Bayes-consistency of regularized classifiers one essentially proceeds as outlined in the subsection above: for some slowly increasing sequence ω1 , ω2 , ... we consider nested function spaces Fω1 , Fω2 , ... , where each Fωi contains all functions f with Ω(f ) ≤ ωi . Eventually, if i is very large, the space Fωi will approximate the space Fall of all functions. For consistency, one has to take the constant λ to 0 as n → ∞. This ensures that eventually, for large n we indeed are allowed to pick functions from a space close to Fall . On the other hand, the constant λ must not converge to 0 too fast, otherwise we will already start overfitting for small values of n (as with a small constant λ, one essentially ignores the regularizer and consequently performs something close to ERM over a very large set of functions). A paper which carefully goes through all those steps for the example of the support vector machine is [Steinwart, 2005]. Note that there is one important conceptual difference between empirical risk minimization and regularization. In regularization, we have a function Ω which measures the “complexity” of an individual function f . In ERM, on the other side, we never look at complexities of individual functions, but only at the complexity of a function class. The latter, however, is more a measure of capacity, that is a measure of the “number of functions” in F, and only indirectly a measure of how complex the individual functions in the class are. From an intuitive point of view, the first approach is often easier to grasp, as the complexity of an individual function is a more intuitive concept than the capacity of a function class.
7.3 Achieving zero approximation error The theorem above shows the general principle how we can achieve Bayes-consistency. However, the theorem simply postulated as its second condition that the approximation error should converge to 0. How can this be achieved in practice? It turns out that there are many situations where this is not so difficult. Essentially we have to make sure that each function of Fall is either contained in Fn for some large n, or that it can be approximated arbitrarily well by a function from Fn . The
Statistical Learning Theory
691
area in mathematics dealing with such kind of problems is called approximation theory, but for learning theory purposes simple approximation results are often enough (for more sophisticated ones see for example [Cucker and Zhou, 2007]). The only technical problem we have to solve is that we need a statement of the following form: if two functions are “close” to each other, then their corresponding risk values are also “close”. Such statements are often quite easy to obtain. For example, it is straightforward to see that if f is a binary classification function (i.e., with f (x) ∈ {±1}) and g any arbitrary (measurable) function, and the L1 -distance between f and g is less than δ, then so is their difference in 0-1-risk, i.e., P (f (x) = sgn (g(x))) < δ. This means that in order to prove that the approximation error of a function space F is smaller than δ, we just have to know that every function in Fall can be approximated up to δ in the L1 -norm by functions from F. Results of this kind are abundant in the mathematics literature. For example, if X is a bounded subset of the real numbers, it is well known that one can approximate any measurable function on this set arbitrarily well by a polynomial. Hence, we could choose the spaces Fn as spaces of polynomials with degree at most dn where dn slowly grows with n. This is enough to guarantee convergence of the approximation error.
7.4
Rates of convergence
We would like to point out one more fundamental difference between estimation and approximation error: the uniform rates of convergence. In a nutshell, a “rate of convergence” gives information about how “fast” a quantity converges. In our case, the rate says something about how large n has to be in order to ensure that an error is smaller than a certain quantity. Such a statement could be: in order to guarantee (with high probability) that the estimation error is smaller than 0.1 we need at least n = 1000 sample points. Usually it is the case that rates of convergence depend on many parameters and quantities of the underlying problem. In the learning setting, a particularly important question is whether the rate of convergence also depends on the underlying distribution. And this is where the difference between estimation error and approximation error happens. For the estimation error, there exist rates of convergence which hold independently of the underlying distribution P . This is important, as it tells us that we can give convergence guarantees even though we do not know the underlying distribution. For the approximation error, however, it is not possible to give rates of convergence which hold for all probability distributions P . This means that a statement like “with the nested function classes (Fn )n we need at least n = 1000 to achieve approximation error smaller than 0.01” could only be made if we knew the underlying probability distribution. One can prove that unless one makes some further assumptions on the true distribution, for any fixed sequence (Fn )n of nested function spaces the rate of convergence of the approximation error can be arbitrarily slow. Even though we might know that it eventually converges to 0 and we obtain consistency, there is no way we could specify what “eventually” really means.
692
Ulrike von Luxburg and Bernhard Sch¨ olkopf
It is important to note that statements about rates of convergence strongly depend on the underlying assumptions. For example, above we have already pointed out that even under the assumption of independent sampling, no uniform rate of convergence for the approximation error exists. A similar situation occurs for the estimation error if we weaken the sampling assumptions. If we no longer require that samples Xi have to be independent from each other, then the situation changes fundamentally. While for “nearly independent sampling” (such as stationary α-mixing processes) it might still be possible to recover results similar to the ones presented above, as soon as we leave this regime is can become impossible to achieve such results. (For the specialists: even in the case of stationary ergodicity, we can no longer achieve universal consistency, see [Nobel, 1999].) For more discussion see also [Steinwart et al., 2006] and references therein. On the other hand, if we strengthen our sampling assumptions, we can even improve the rates of convergence. If we assume that the data points are sampled independently, and if we make a few assumptions on the underlying distribution (in particular on the label noise), then the uniform rates of convergence of the estimation error can be improved dramatically. There is a whole branch of learning theory which deals with this phenomenon, usually called “fast rates”. For an overview see Section 5.2. of [Boucheron et al., 2005]. Finally, note that both consistency and rates of convergence deal with the behavior of an algorithm as the sample size tends to infinity. Intuitively, consistency is a “worst case statement”: it says that ultimately, an algorithm will give a correct solution and does not make systematic errors. Rates of convergence, on the other hand, make statements about how “well behaved” an algorithm can be. Depending on prior assumptions, one can compare different algorithms in terms of their rates of convergence. This can lead to insights into which learning algorithms might be better suited in which situations, and it might help us to choose a particular algorithm for a certain application on which we have prior knowledge. 8
NO FREE LUNCH THEOREM
Up to now, we have reported many positive results in statistical learning theory. We have formalized the learning problem, defined the goal of learning (to minimize the risk), specified which properties a classifier should have (consistency), and devised a framework to investigate these properties in a fundamental way. Moreover, we have seen that there exist different ways to achieve consistent classifiers (k-nearest neighbor, empirical risk minimization), and of course there exist many more consistent ways to achieve consistency. Now it is a natural question to ask which of the consistent classifiers is “the best” classifier. Let us first try to rephrase our question in a more formal way. Under the assumption of independent sampling from some underlying, but unknown probability distribution, is there a classifier which “on average over all probability distributions” achieves better results than any other classifier? Can we compare classifiers pairwise, that is compare whether a classifier A is better than classifier
693
performance
Statistical Learning Theory
classifier 1 classifier 2 distributions Figure 7. No free lunch theorem: Classifier 1 depicts a general purpose classifier. It performs moderately on all kinds of distributions. Classifier 2 depicts a more specialized classifier which has been tailored towards particular distributions. It behaves very good on those distributions, but worse on other distributions. According to the no free lunch theorem, the average performance of all classifiers over all distributions, that is the area under the curves, is identical for all classifiers.
B, on average over all distributions? The reasons why we consider statements “on average over all distributions” lies in the fact that we do not want to make any assumption on the underlying distribution. Thus it seems natural to study the behavior of classifiers on any possible distribution. Regrettable, those questions have negative answers which are usually stated in the form of the so-called “no free lunch theorem”. A general proof of this theorem appeared in [Wolpert and Macready, 1997] and [Wolpert, 2001]. A simpler but more accessible version for finite data spaces has been published by [Ho and Pepyne, 2002]. For versions with focus on convergence rates, see Section 7 of [Devroye et al., 1996]. For the ease of understanding, consider the following simplified situation: assume that our input space X only consists of a finite set of points, that is X = {x1 , ..., xm } for some large number m. Now consider all possible ways to assign labels to those data points, that is we consider all possible probability distributions on X. Given some small set of training points (Xi , Yi )i=1,...,n we use some fixed classification rule to construct a classifier on those points, say the kNN classifier. Now consider all points of X which have not been training points and call this set of points the test set. Of course there exists a label assignment P1 on the test set for which the classifier makes no error at all, namely the assignment which has been chosen by the classifier itself. But there also exists some label assignment P2 on which the classifier makes the largest possible error, namely the inverse of the assignment constructed by the classifier. In the same way, we can see that essentially for any given error R, we can construct a probability distribution on X such that the error of fn on the test set has error R. The same reasoning will apply to any other classifier (for more precise reasoning, cf. [Ho and Pepyne, 2002]). Thus, averaged over all possible probability distributions on X, all classi-
694
Ulrike von Luxburg and Bernhard Sch¨ olkopf
fiers fn will achieve the same test error: whenever there is a distribution where the classifier performs well, there is a corresponding “inverse” distribution on the test set, on which the classifier performs poorly. In particular, on average over all probability distributions, no classifier can be better than random guessing on the test set! This is a very strong result, and it touches the very base of machine learning. Does it in fact say that learning is impossible? Well, the answer is “it all depends on the underlying assumptions”. The crucial argument exploited above is that we take the average over all possible probability distributions, and that all probability distributions are considered to be “equally likely” (in the Bayesian language, we choose a uniform prior over the finite set of distributions). Those distributions also include cases where the labels are assigned to points “without any system”. For example, somebody could construct a probability distribution over labels by simply throwing a coin and for each data point, deciding on its true label by the outcome of the random coin tossing. It seems plausible that in such a scenario it does not help to know the labels on the training points — they are completely independent of the labels of all other points in the space. In such a scenario, learning is impossible. The only chance for learning is to exclude such artificial cases. We need to ensure that there is some inherent mechanism by which we can use the training labels to generalize successfully to test labels. Formally, this means that we have to restrict the space of probability distributions under consideration. Once we make such restrictions, the no free lunch theorem breaks down. Restrictions can come in various forms. For example, we could assume that the underlying distribution has a “nice density” and a “nice function η”. Or we can assume that there is a distance function on the space and the labels depend in some “continuous” way on the distance, that is points which are close to each other tend to have similar labels. If we make such assumptions, it is possible to construct classifiers which exploit those assumptions (for example, the kNN classifier to exploit distance structure). And those classifiers will then perform well on data sets for which the assumptions are satisfied. Of course, the no free lunch theorem still holds, which means that there will be some other data sets where this classifier will fail miserably. However, those will be data sets which come from distributions where the assumptions are grossly violated. And in those cases it makes complete sense that a classifier which relies on those assumptions does not stand a chance any more. The no free lunch theorem is often depicted by a simple figure, see Figure 7. The figure shows the performance of two different classifiers (where, intuitively, the performance of a classifier is high if it achieves close to the Bayes risk). The x-axis depicts the space of all probability distributions. Classifier 1 represents a general purpose classifier. It performs moderately on all kinds of distributions. Classifier 2 depicts a more specialized classifier which has been tailored towards particular distributions. It behaves very good on those distributions, but worse on other distributions. According to the no free lunch theorem, the average performance, that is the area under the two curves, is the same for both classifiers.
Statistical Learning Theory
695
A question which often comes up in the context of no free lunch theorems is how those theorems fit together with the consistency theorems proved above. For example, we have seen in Section 3 that the k-nearest neighbor classifier is universally consistent, that is it is consistent for any underlying probability distribution P . Is there a contradiction between the no free lunch theorem and the consistency statements? The solution to this apparent paradox lies in the fact that the consistency statements only treat the limit case of n → ∞. In the example with the finite data space above, note that as soon as the sample size is so large that we essentially have sampled each point of the space at least once, then a classifier which memorizes the training data will not make any mistake any more. Similar statements (but a bit more involved) also hold for cases of infinite data spaces. Thus, no free lunch theorems make statements about some finite sample size n, whereas consistency considers the limit of n → ∞. In the finite example above, note that unless we know the number m of points in the data space, there is no way we could give any finite sample guarantee on a classifier. If we have already seen half of the data points, then the classifier will perform better than if we have only seen 1/100 of all points. But of course, there is no way we can tell this from a finite sample. A formal way of stating this is as follows: THEOREM 9 Arbitrarily close to random guessing. Fix some ε > 0. For every n ∈ Æ and every classifier fn there exists a distribution P with Bayes risk 0 such that the expected risk of fn is larger than 1/2 − ε. This theorem formally states what we have already hinted above: we can always construct a distribution such that based on a finite sample with fixed size, a given classification rule is not better than random guessing. This and several other versions of the no free lunch theorem can be found in Section 7 of [Devroye et al., 1996]. The no free lunch theorem is one of the most important theorems in statistical learning. It simply tells us that in order to be able to learn successfully with guarantees on the behavior of the classifier, we need to make assumptions on the underlying distribution under consideration. This fits very nicely to the insights we gained in Section 5. There we have seen that in order to construct consistent classifiers, we need to make assumptions on the underlying space F of function one uses. In practice, it makes sense to combine those statements: first restrict the space of probability distributions under consideration, and then use a small function class which is able to model the distributions in this class.
9 MODEL BASED APPROACHES TO LEARNING Above we have introduced the standard framework of statistical learning theory. It has been established as one of the main building blocks for analyzing machine learning problems and algorithms, but of course it is not the only approach to do this. In this section we would like to mention a few other ways to view machine learning problems and to analyze them. In particular, we will focus on methods
696
Ulrike von Luxburg and Bernhard Sch¨ olkopf
which deviate from the model-free approach of not making any assumption on the underlying distribution.
9.1 The principle of minimum description length The classical SLT looks at learning very much from the point of view of underlying function classes. The basic idea is that good learning guarantees can be obtained if one uses simple function classes. The simplicity of a function space can be measured by one out of many capacity measures such as covering numbers, VC dimension, Rademacher complexity, and so on. The minimum description length (MDL) approach is based on a different notion of ”simplicity”. The concept of simplicity used in MDL is closely related to the literal meaning of simple: an object is called “simple” if it can be described by a “short description”, that is if one only needs a small number of bits to describe the object. In the context of MDL, objects can be a sequence of data, or a function, or a class of functions. As an example, consider the following function: “The function f : [0, 1] → {−1, +1} takes the value -1 on the interval [0, 0.3] and +1 on ]0.3, 1].” For a function of the class f : [0, 1] → {−1, +1}, this is a rather compact description. On the other hand, consider a function g : [0, 1] → {−1, +1} which takes the values +1 and -1 at random positions on [0, 1]. In order to describe this functions, the best we can do is to compile a table with input and output values: X g(X)
0.01 1
0.02 -1
0.03 -1
0.04 1
... ...
It is obvious that compared to the code for f , the code for g will be extremely long (even if we ignore the issue that a function on [0, 1] cannot simply be described in a table with countably many entries). The question is now how we can formulate such a concept of simplicity in mathematical terms. A good candidate in this respect is the theory of data coding and data compression. A naive way to encode a function (on a finite domain) is to simply provide a table of input and output values. This can be done with any function, and is considered to be our baseline code. Now, given a function f , we try to encode it more efficiently. In order to be able to do this, we need to discover certain “regularities” in the function values. Such a regularity might be that 100 consecutive entries in the function table have the same value g(X), say +1. Then, instead of writing +1 in 100 entries of the table, we can say something like: “the following 100 entries have value +1”. This is much shorter than the naive approach. The MDL framework tries to make use of such insights for learning purposes. The general intuition is that learning can be interpreted as finding regularities in the data. If we have to choose between two functions f and g which have a similar training error, then we should always prefer the function which can be described by a shorter code. This is an explicit way to follow Occam’s razor, which in machine
Statistical Learning Theory
697
learning is usually interpreted as “models should not be more complex than is necessary to explain the data”. Omitting all technical details, we would just like to point out one possible way to turn the MDL idea in a concrete algorithm for learning. Assume we are given a function space F, and some training points. One can now try to pick the function f ∈ F which minimizes the following expression: L(f ) + L(Training Data f ) Let us explain those two terms. L(f ) stands for the length of the code to encode the function f in the given function class F. This is some absolute length which is only influenced by the function itself, and not by the data at hand. We do not go into details how such a code can be obtained, but the general idea is as described above: the shorter the code, the simpler the function. One has to note that in constructing this code, the function space F might also play a role. For example, one could just encode f by saying “take the 23rd function in space F”. In this code, the length L(f ) would just depend on the ordering of the functions in F. If the function we are looking for occurs early in this ordering, it has a short code (“function 23”), but if the function occurs late in this ordering, it has a longer code (“function 4341134”). The term L(Training Data f ) denotes the length of the code to express the given training data with help of function f . The idea here is simple: if f fits the training data well, it is trivial to describe the training data. We can simply use the function f to compute all the labels of the training points. In this case, the length of the code is very short: it just contains the instruction “apply f to the input to obtain the correct output”. On the other hand, if the function f does not fit the data so well, it will make some misclassifications on the training points. To recover the label of the misclassified training points, we thus need to add some further information to the code. If, say, training points X2 , X7 , and X14 are misclassified by f , the code for the data might now look as follows: “apply f to all input points to compute the labels. Then flip the labels of training points X2 , X7 , and X14 ”. It is clear that the more errors f makes on the given training points, the longer this code will get. Intuitively, the two terms L(f ) and L(Training Data f ) play similar roles as some of the quantities in classical SLT. The term L(Training Data f ) corresponds to the training error the function f makes on the data. The term L(f ) measures the complexity of the function f . In this sense, the sum of both looks familiar: we sum the training error and some complexity term. One difference to the classical SLT approach is that the complexity term is not only computed based on the underlying function class F, but can depend on the individual function f . The approach outline above has often been criticized to be rather arbitrary: a function that has a short description (small L(f )) under one encoding method may have a long description (large L(f )) under another. How should we decide what description method to use? There are various answers to this questions. The most common one is the idea of universal coding. Here codes are associated with
698
Ulrike von Luxburg and Bernhard Sch¨ olkopf
classes of models rather than individual classifiers. There exist several ways to build universal codes. One of them is as follows. We decompose the function class F into subsets F1 ⊂ F2 ⊂ .... Then we encode the elements of each subset with a fixed-length code which assigns each member of Fi the same code length. For example, if Fi is finite and has N elements, one may encode each member f ∈ Fi using log N bits. Or, if the function class Fi is infinite one can go over to concepts like the VC dimension to encode the functions in the class. As in the finite case, each element f ∈ Fi will be encoded with the same length, and this length turns out to be related to the VC dimension of Fi . Many other ways might be possible. In general, we define the ”coding complexity of Fi ” as the smallest uniform code length one can achieve for encoding the functions f ∈ Fi . It is uniquely defined and is “universal” in the sense that it does not rely on any particular coding scheme. Given a function f ∈ Fi , we can now go ahead and encode our data in several steps: we first encode the index i of the function class, then use the uniform code described above to encode which element of Fi the function f is, and finally code the data with the help of f . The code length then becomes L(i) + L(f | f ∈ Fi ) + L(Training Data|f ). Note that for the first term L(i) one usually chooses some uniform code for the integers, which is possible as long as we are dealing with finitely many function classes Fi . Then this term is constant and does not play any role any more. The middle term L(f | f ∈ Fi ) is identical for all f ∈ Fi . Hence, it does not distinguish between different functions within the class Fi , but only between functions which come from different classes Fi . Finally, the last term explains how well the given function f can explain the given training data. The goal of this approach is now to choose the hypothesis f which minimizes this code. In this formulation one can see that the MDL approach is not that far from standard statistical learning theory approaches. For example, if we are just given one fixed function class F (and do not split it further into smaller sets Fi ), the code length essentially depends on some complexity measure of F plus a term explaining how well f fits the given data. In this setting, one can prove learning bounds which show that MDL learns about as fast as the classical methods. One can also see that the approach can be closely related to classical statistical learning approaches based on compression coefficients [Vapnik, 1995]. Moreover, in more advanced MDL approaches one may want to assign different code lengths to different elements of F. This approach is then closely related to the PAC-Bayesian approach (cf. Section 6.2). Finally, it is worth mentioning is that under certain assumptions, MDL can be performed in a consistent way. That is, in the limit of infinitely many data points, the approach can find the correct model for the data. Of course there are many details to take care of in the MDL approach, but this goes beyond the scope of the current paper. We refer to the monographs [Gr¨ unwald, 2007] and [Rissanen, 2007] for a comprehensive treatment of the MDL approach and related concepts.
Statistical Learning Theory
9.2
699
Bayesian methods
The two traditional schools in statistics are the frequentist and the Bayesian one. Here we would like to briefly recall their approaches to inference and discuss their applications to machine learning. Before we do this, let us introduce the basic setting and some notation. Traditionally, statistical inference is performed in a model-based framework. As opposed to the agnostic approach taken in SLT, we assume that the underlying probability distribution comes from some particular class of probability distributions P. This class is usually indexed by one (or several) parameters, that is it has the form P = {Pα α ∈ A}. Here, A is a set of parameter values, and Pα a distribution. For example, one could consider the class of normal distributions, indexed by their means and variances. The standard frequentist approach to statistics. Here, the main goal is to infer, from some given sample of points, the correct parameters of the underlying distribution. Once the parameter is known, it is easy to perform tasks such as classification (for example by using the Bayes classifier corresponding to the estimated distribution). The most important quantity used in the frequentist approach to statistics is the likelihood of the parameter α, denoted by P (data α). This term tells us the probability that, under the assumption that α is the correct parameter of the distribution, the given sample is generated. It is used as an indicator of how good the parameter α “fits” the data: if it is unlikely that the data occurs based on the underlying distribution Pα1 , but more likely for Pα2 , then one would tend to prefer parameter α2 over α1 . This already hints the way parameters are inferred from the data in the frequentist setting: the maximum likelihood (ML) approach. Here we choose the “best” parameter α ˆ by maximizing the data likelihood, that is α ˆ (data) = argmax P (data α), α∈A
where we have used the notation α(data) ˆ to indicate that α ˆ depends on the data. It is important to note that the likelihood does not make any statement about the “probability that a parameter α is correct”. All inference about α happens indirectly, by estimating the probability of the data given a parameter. The same is also true if we want to make confidence statements. It is impossible to make a statement like “the probability that α is correct is larger than something”. First of all, it is a mathematical problem that we cannot make such a statement: we simply do not have a probability distribution over parameters. If we want to make confidence statements in the traditional approach, this has to be done in a somewhat peculiar way. A confidence statement looks as follows: let I(data) ∈ A be some interval (or more generally, subset) of parameters which has been constructed from the data. For example, I could be a symmetric interval of a certain width around an estimated parameter value α(data), ˆ say I = [ˆ α(data) − c, α(data) ˆ + c] for some constant c. The set I is called a 95% confidence interval
700
Ulrike von Luxburg and Bernhard Sch¨ olkopf
if Pα (α ∈ I) ≥ 95%, which is a shorthand for α(data) − c, α(data) ˆ + c] ≥ 95%. Pα data α ∈ [ˆ Again, it is important to point out that the random quantity in this statement is I, not α. The statement only says that if α happened to be the true parameter, and I is the confidence set we come up with when looking at the data, then in 95% of all samples the true parameter α will lie in I. It does not say that with 95% probability, α is the correct parameter! This is one of the reasons why the frequentist approach is sometimes perceived as unsatisfactory. It only provides a rather indirect way to perform inference, and confidence statements are hard to grasp. The Bayesian approach to statistics. The Bayesian framework is an elegant way to circumvent some of the problems of the frequentist one, in particular the indirect mechanism of inference provided by the ML framework. From a technical point of view, the main difference between the Bayesian and the frequentist approach is that the Bayesian approach introduces some “prior” distribution on the parameter space. That is, we define some distribution P (α) which for each parameter α encodes how likely we find it that this is a good parameter to describe our problem. The important point is that this prior distribution is defined before we get to see the data points from which we would like to learn. It should just encode our past experiences or any other kind of prior knowledge we might have. Now assume we are given some data points. As in the frequentist approach, one can compute the likelihood term P (data α). Combining it with the prior distri bution P (α), we can compute the so-called posterior distribution P (α data). Up to a normalizing constant independent of α, the posterior is given by the product of the prior and the likelihood term, that is P (α data) ∝ P (α)P (data α). (21) One can say that the posterior distribution arises by “updating” the prior belief using the data we actually have at hand. As opposed to the frequentist approach, this posterior is now indeed interpreted as the probability that α is the correct parameter, given that we have observed the data, and the prior distribution is “correct”. Given the posterior distribution, there are two main principles to do inference based on it. The first one is to use the maximum a posteriori (MAP) estimator: (22) α ˆ = argmax P (α data) α∈A
Here we come up with one fixed value of α, and can now use it to do the inference we are interested in, in a similar way as we can use the ML estimator in the frequentist approach. The second, fully Bayesian way, is to do the inference based on any parameter α, but then weight the result by the posterior probability that α is correct.
Statistical Learning Theory
701
What are advantages and disadvantages of the Bayesian approach? One advantage is the fact that the Bayesian approach leads to simple, intuitive statements about the results of a learning algorithm. As opposed to making complicated confidence statements like the ones in the standard SLT approach or the traditional frequentist approach to statistics, in the end one has statements like “with probability 95% we selected the correct parameter α”. This comes at a price, though. The most vehement objection to the Bayesian approach is often the introduction of the prior itself. The prior does influence our results, and by selecting different priors one can obtain very different results on the same data set. It is sometimes stated that the influence of the prior is benign and in the limit, the prior is “washed out” by the data, as indicated by certain consistency results for the Bayesian approach [Berger, 1985]. However, this argument is somewhat misleading. On a finite data set, the whole point of the prior is that it should bias our inference towards solutions that we consider more likely. So it is maybe appropriate to say that the Bayesian approach is a convenient method for updating our beliefs about solutions which we had before taking into account the data. Among Bayesian practitioners it is generally accepted that even though priors are “wrong”, most of the time they are quite useful in that Bayesian averaging over parameters leads to good generalization behavior. One point in favor of working with prior distributions is that they are a nice tool to invoke assumptions on the underlying problem in a rather explicit way. In practice, whether or not we should apply Bayesian methods thus depends on whether we are able to encode our prior knowledge in the form of a distribution over solutions. On the other hand, SLT is arguably more explicit about model complexity, an issue which in Bayesian framework is a little harder to spot. As an example for how model complexity is dealt with in the Bayesian framework, consider the posterior distribution (21) and the MAP problem (22). We rewrite this problem by taking the negative logarithm, in which case the product becomes a sum and the maximization problem becomes a minimization problem: argmin ( − log P (α) − log P (data α) ). α∈A
(Note that we are a bit sloppy here: applying the logarithm can be interpreted as a re-parameterization of the problem. One can show that MAP solutions are not invariant under such re-parameterizations.) Now we are given a sum of two terms which have an interpretation similar to the approaches we have seen so far. One could say that the second term log P (data α) describes the model fit to the data, that is it plays a similar role as the training error Remp in the standard SLT approach, or the quantity L(data f ) in the minimum description length framework. Continuing with this analogy, the other term P (α) would then be the term corresponding to the “complexity” of f . But how is the prior probability related to complexity? The general idea is as follows. It is a simple fact from coding theory that there exist much fewer “simple” models than “complicated” models. For example, a “simple” model class described by one parameter α ∈
702
Ulrike von Luxburg and Bernhard Sch¨ olkopf
{1, 2, 3, ..., 100} contains 100 models, whereas a slightly less simple model class using two such parameters already contains 1002 = 10000 models. That is, the number of models increases dramatically with the complexity of the model class. Now assume we want to assign a prior distribution to the parameters. For finite model classes as above, a prior distribution simply assigns a certain probability value to each of the models in the class, and those probability values have to sum to 1. In case of the first model mentioned above, we need to assign 100 numbers to the parameters α. If we do the same for the second model, we have 10000 probabilities to assign. Now note that in order to make them sum to 1, the individual prior values per parameter tends to be much smaller in the second case than in the first case, simply because the second case requires many more parameters than the first one. On a higher level, the consequence is that prior probabilities assigned to elements of “complex” model classes tend to be much lower than prior probabilities for elements of “simple” model classes. Hence, the negative logarithm of the priors of complex function classes is high, whereas the same quantity for simple model classes is low. All in all, as in the standard SLT approach, the Bayesian framework implicitly deals with overfitting by looking at the trade-off between data fit and model complexity. The literature on general Bayesian statistics is huge. A classic is [Cox, 1961], which introduces the fundamental axioms that allow to express beliefs using probability calculus. [Jaynes, 2003] puts those axioms to work and addresses many practical and philosophical concerns. Another general treatment of Bayesian statistics can be found in [O’Hagan, 1994]. A gentle introduction into Bayesian methods for machine learning can be found in [Tipping, 2003], a complete monograph on machine learning with a strong focus on Bayesian methods is [Bishop, 2006]. 10
THE VC DIMENSION, POPPER’S DIMENSION, AND THE NUMBER OF PARAMETERS
We have seen that for statistical learning theory, it is crucial that one controls the capacity of the class of hypotheses from which one chooses the solution of a learning process. The best known measure of capacity is a combinatorial quantity termed the VC dimension. [Corfield et al., 2005] have pointed out that the VC dimension is related to Popper’s notion of the dimension of a theory. Popper’s dimension of a theory is defined as follows (for a discussion of the terms involved, we refer to [Popper, 1959; Corfield et al., 2005]): If there exists, for a theory t, a field of singular (but not necessarily basic) statements such that, for some number d, the theory cannot be falsified by any d-tuple of the field, although it can be falsified by certain (d + 1)-tuples, then we call d the characteristic number of the theory with respect to that field. All statements of the field whose degree of composition is less than d, or equal to d, are then compatible with the theory, and permitted by it, irrespective of their content.
Statistical Learning Theory
703
[Popper, 1959, p. 130]
[Corfield et al., 2005] argue that although this definition sounds similar to the VC dimension, there is a crucial difference which could either be attributed to an error on Popper’s side, or to a difference between statistical learning and “active” learning: in Popper’s definition, it is enough to find one (d + 1)-tuple that falsifies a theory. E.g, the “theory” that two classes of data can be separated by a hyperplane could be falsified by three collinear points labeled “+1”, “−1”, and “+1”. In the definition of the VC dimension, on the other hand, it is required that there exists no (d + 1)-tuple of points which can be shattered, i.e., for any (d + 1)-tuple of points, there exists some labeling such that it falsifies the hyperplane, say. The VC dimension of separating hyperplanes in Ên is n + 1, while the Popper dimension of separating hyperplanes is always 2, independent of n. Whilst one could fix this by adding some noise to the points in the Popper dimension, thus ruling out the existence of non-generic configurations, it may be more interesting to relate the difference between the two definitions to the fact that for Popper, the scientist trying to perform induction is actively looking for points, or experiments, that might be able to falsify the current hypothesis, while Vapnik and Chervonenkis devised their capacity measure to characterize the generalization error of a learning procedure where the incoming measurement points are generated randomly according to a certain probability distribution. Interestingly, Popper also discusses the link between his capacity measure and the number of a parameters in an algebraic characterization of a hypothesis class [Corfield et al., 2005], stating that “[...] the number of freely determinable parameters of a set of curves by which a theory is represented is characteristic for the degree of falsifiability [...]” (cited after [Corfield et al., 2005]). From the point of view of statistical learning theory, Popper here falls into the same trap hole as classical statistics sometimes does. In learning theory, it is not the number of parameters but the capacity which determines the generalization ability. Whilst the number of parameters sometimes coincides with, e.g., the VC dimension (the above mentioned hyperplanes being an example), there are also important cases where this is not true. For instance, it has been pointed out that the class of thresholded sine waves on Ê, parameterized by a single real frequency parameter, has infinite VC dimension [Vapnik, 1995]. We conclude with [Corfield et al., 2005] that Popper has at an astonishingly early point in time identified some of the crucial aspects of what constitutes the capacity of a class of hypothesis. If he had had the full developments of statistical learning theory at his disposal, he might have been able to utilize them to address certain other shortcomings of his approach, in particular by using bounds of statistical learning theory to make statements about certain notions of reliability or generalization ability of theories.
704
Ulrike von Luxburg and Bernhard Sch¨ olkopf
11
CONCLUSION
At first glance, methods for machine learning are impressive in that they automatically extract certain types of “knowledge” from empirical data. The above description, however, has shown that in fact, none of this knowledge is created from scratch. In the Bayesian view of machine learning, the data only serves to update one’s prior — we start with a probability distribution over hypothesis, and end of up with a somewhat different distribution that reflects what we have seen in between. For a subjective Bayesian, learning is thus nothing but an update of one’s beliefs which is consistent with the rules of probability theory. Statements regarding how well the inferred solution works are generally not made, nor are they necessary — for an orthodox Bayesian. In the framework of statistical learning theory, on the other hand, we start with a class of hypotheses, and use the empirical data to select one hypothesis from the class. One can show that if the data generating mechanism is benign, then we can assert that the difference between the training error and test error of a hypothesis from the class is small. “Benign” here can take different guises; typically it refers to the fact that there is a stationary probability law that independently generates all individual observations, however other assumptions (e.g., on properties of the law) can also be incorporated. The class of hypothesis plays a role analogous to the prior, however, it does not need to reflect one’s beliefs. Rather, the statements that we obtain are conditional on that class in the sense that if the class is bad (in the sense that the “true” function cannot be approximated within the class, or in the sense that there is no “true” function, e.g., the data is completely random) then the result of our learning procedure will be unsatisfactory in that the upper bounds on the test error will be too large. Typically, either the training error will be too large, or the confidence term, depending on the capacity of the function class, will be too large. It is appealing, however, that statistical learning theory generally avoids metaphysical statements about aspects of the “true” underlying dependency, and thus is precise by referring to the difference between training and test error. While the above are the two main theoretical schools of machine learning, there are other variants some of which we have briefly mentioned in this article. Importantly, none of them get away without making assumptions, and learning is never a process that starts from a tabula rasa and automatically generates knowledge.
ACKNOWLEDGMENTS We would like to thank Wil Braynen, David Corfield, Peter Gruenwald, Joaquin Quinonero Candela, Ingo Steinwart, and Bob Williamson for helpful comments on the manuscript.
Statistical Learning Theory
705
BIBLIOGRAPHY [Anthony and Biggs, 1992] M. Anthony and N. Biggs. Computational learning theory. Cambridge University Press, 1992. [Berger, 1985] J. O. Berger. Statistical Decision theory and Bayesian Analysis. Springer Verlag, New York, 1985. [Bishop, 2006] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [Boucheron et al., 2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [Bousquet et al., 2003] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet, U. von Luxburg, and G. R¨ atsch, editors, Advanced Lectures on Machine Learning, pages 169–207. Springer, Berlin, 2003. [Chernoff, 1952] H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. [Corfield et al., 2005] D. Corfield, B. Sch¨ olkopf, and V. Vapnik. Popper, Falsification, and the VC-Dimension. Technical Report TR-145, Max Planck Institute For Biological Cybernetics, 2005. [Cox, 1961] R. T. Cox. The Algebra of Probable Inference. John Hopkins University Press, Baltimore, 1961. [Cucker and Zhou, 2007] F. Cucker and D.-X. Zhou. Learning theory: an approximation theory viewpoint. Cambridge University Press, 2007. [Devroye et al., 1996] L. Devroye, L. Gy¨ orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, New York, 1996. [Gr¨ unwald, 2007] P. Gr¨ unwald. The minimum description length principle. MIT Press, Cambridge, MA, 2007. [Herbrich and Williamson, 2002] R. Herbrich and R. C. Williamson. Learning and generalization: Theoretical bounds. In Michael Arbib, editor, Handbook of Brain Theory and Neural Networks, 2002. [Ho and Pepyne, 2002] Y.C. Ho and D.L. Pepyne. Simple explanation of the no-free-lunch theorem and its implications. Journal of Optimization Theory and Applications, 115(3):549 – 570, 2002. [Hoeffding, 1963] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13 – 30, 1963. [Jaynes, 2003] E. T. Jaynes. Probability Theory – The Logic of Science. Cambridge University Press, Cambridge, 2003. [Kearns and Vazirani, 1994] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press, Cambridge, Massachusetts, 1994. [Kiefer, 1977] J. Kiefer. Conditional confidence statements and confidence estimators. Journal of the American Statistical Association, 72(360):789–808, 1977. [Mendelson, 2003] S. Mendelson. A few notes on statistical learning theory. In Advanced lectures in machine learning, volume LNCS 2600, pages 1 – 40. Springer, 2003. [Nobel, 1999] A. Nobel. Limits to classification and regression estimation from ergodic processes. The Annals of Statistics, 27(1):262–273, 1999. [O’Hagan, 1994] A. O’Hagan. Bayesian Inference, volume 2B of Kendall’s Advanced Thoery of Statistics. Arnold, London, 1994. [Popper, 1959] K. Popper. The logic of scientific discovery (Hutchinson, translation of ”Logik der Forschung”, 1934). 1959. [Rissanen, 2007] J. Rissanen. Information and complexity in statistical modeling. Springer, New York, 2007. [Sauer, 1972] N. Sauer. On the density of families of sets. J. Combinatorial Theory (A), 13:145 – 147, 1972. [Sch¨ olkopf and Smola, 2002] B. Sch¨ olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [Shawe-Taylor et al., 1998] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. [Shelah, 1972] S. Shelah. A combinatorial problem: stability and orders for models and theories in infinitariy languages. Pacific Journal of Mathematics, 41:247 – 261, 1972.
706
Ulrike von Luxburg and Bernhard Sch¨ olkopf
[Steinwart et al., 2006] I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. Technical Report LA-UR-06-3507, Los Alamos National Laboratory, 2006. [Steinwart, 2005] I. Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128–142, 2005. [Stone, 1977] C. J. Stone. Consistent nonparametric regression (with discussion). Annals of Statistics, 5:595–645, 1977. [Tipping, 2003] M. Tipping. Bayesian inference: An introduction to principles and practice in machine learning. In O. Bousquet, U. von Luxburg, and G. R¨ atsch, editors, Advanced Lectures on Machine Learning, pages 41–62. Springer, 2003. [Vapnik and Chervonenkis, 1971] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264 – 280, 1971. [Vapnik and Chervonenkis, 1981] V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3):543–564, 1981. [Vapnik, 1995] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. [Vapnik, 1998] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [Wolpert and Macready, 1997] D. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE Trans. Evolutionary Computation, 1(1):67–82, 1997. [Wolpert, 2001] D. Wolpert. The supervised learning no-free-lunch theorems. In Proc. 6th Online World Conf. on Soft Computing in Industrial Applications, 2001.
FORMAL LEARNING THEORY IN CONTEXT Daniel Osherson and Scott Weinstein INTRODUCTION One version of the problem of induction is how to justify hypotheses in the face of data. Why advance hypothesis A rather than B — or in a probabilistic context, why attach greater probability to A than B? If the data arrive as a stream of observations (distributed through time) then the problem is to justify the associated stream of hypotheses. Several perspectives on this problem have been developed including Bayesianism [Howson and Urbach, 1993] and belief-updating [Hansson, 1999]. These are broad families of approaches; the citations are meant just as portals. Another approach is to attempt to justify the present choice of hypothesis by situating it in a strategy with good long-term prospects. Such is the idea behind Formal Learning Theory, which will be discussed in what follows. We’ll see that it is naturally grouped with the somewhat older concept of a confidence interval. THE CHARACTER OF FORMAL LEARNING THEORY Formal Learning Theory is a collection of theorems about games of the following character. Players: You (the reader) and us (the authors). Game pieces: The set {0, 1, . . .} of natural numbers, denoted N. Your goal: Guess which subset of N we have in mind. Our goal: Pick a subset of N that you’ll fail to guess. Rules: First, all parties agree to a family C of nonempty subsets of N that are legal choices. We then pick nonempty S ∈ C and an ωsequence e that orders S.1 We reveal e to you one member at a time. At stage n of this process (that is, once you’ve seen e0 , e1 · · · en ), you announce a guess Tn about the identity of S. Who wins: If your guess Tn = S for cofinitely many n then you win.2 Otherwise, we win. 1 The ω-sequence is just a total function from N onto S. It includes every member of S and no more (repetitions allowed). 2 In other words, you win just in case there are only finitely many n ∈ N such that T = S. n
Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
708
Daniel Osherson and Scott Weinstein
Come on. Let’s play. Take C 1 = {N − {x} | x ∈ N}. This is the family of subsets of N that are missing just one number, e.g., the set of positive integers. We’ve chosen S ∈ C 1 and also an ω-ordering e on S. We’ll now begin to reveal e. You must guess after each number. e0 = 2. Go ahead and guess. e1 = 0. Go ahead. e2 = 1. Guess. e3 = 4. Again. The game never stops so we interrupt it in order to make an observation, namely: There is a winning strategy available to you. If at each stage you guess N − {x0 }, where x0 is the least number not yet revealed then you are sure to win no matter which S ∈ C 1 and which ω-sequence e for S we choose. This is easy to verify. On the other hand, suppose that N is added to the game so that our legal choices at the start are C 2 = {N} ∪ {N − {x} | x ∈ N}. Then it can be demonstrated that there is no winning strategy available to you. To clarify this claim, let Σ be the set of finite sequences of natural numbers (conceived as potential data available at any finite stage of the game). Call a mapping of Σ into C 2 a strategy. For example, the strategy described above is the mapping from σ ∈ Σ to N − {x0 } where x0 is the least number that does not appear in σ. This strategy is not a guaranteed winner for C 2 since it fails if we choose N at the outset. You would then keep changing hypotheses forever, never settling down to N on any ω-sequence over N we use to generate data. More generally, it can be shown [Jain et al., 1999] that in our expanded game no strategy is a guaranteed winner; every strategy can be led to failure by some choice of S ∈ C 2 and some ω-sequence e for S. Our games look a little like scientific inquiry. Nature chooses a reality S from a collection C that is constrained by established theory. The data are revealed to the scientist in some order e. Success consists in ultimately stabilizing on S. More realism comes from limiting members of C to effectively enumerable subsets of N, named via the programs that enumerate them. Scientists can then be interpreted as computable functions from data to such names. In the same spirit, data-acquisition may rendered a less passive affair by allowing the scientist to query Nature about particular members of N. Also, the success criterion can be relaxed or tightened, the data can be partially corrupted in various ways, the computational power of the scientist can be bounded, efficient inquiry can be required, scientists can be allowed to work in teams, and so forth for a great variety of paradigms that have been analyzed. In each case, at issue is the existence of strategies that guarantee success. [Jain et al., 1999] offer a survey of results. Instead of subsets of N, the objects of inquiry can be grammars over a formal language (as in [Gold, 1967]), relational structures [Martin and Osherson, 1998], or arbitrary collections of data-streams [Kelly, 1996]. These developments often appear under the rubric Formal Learning Theory.
Formal Learning Theory in Context
709
The entire field stems from three remarkable papers. [Putnam, 1979] introduced the idea of a computable strategy for converting data into conjectures about a hidden recursive function (the data are increasing initial segments of the function’s graph). He proved the non-existence of strategies that guarantee success, and contrasted his results with the ambitious goals for inductive logic announced in [Carnap, 1950]. For example, Putnam demonstrated that no recursive function T extrapolates every recursive, zero-one valued function. T extrapolates such a function b just in case for cofinitely many n, T (b[n]) = bn+1 . Here, b[n] is the initial segment of b of length n and bn+1 is the next value. Putnam deploys a diagonal argument to establish this limitative result. In particular, given a recursive extrapolator T , define a recursive function b by course-of-values recursion as follows: bn+1 = 1 − T (b[n]). It is clear that T fails to correctly extrapolate the next value of b on any initial segment whatsoever. On the other hand, as Putnam observes, given any recursive extrapolator T and any recursive function b, there is a recursive extrapolator T which correctly extrapolates every function T does and b as well. In particular, for every natural number n and every finite sequence s of length n, we let bn+1 if s = b[n] T (s) = T (s) otherwise. It is clear that T is recursive, and extrapolates b along with all the sequences T extrapolates. The relevance of Formal Learning Theory to the acquisition of language by infants was revealed by [Gold, 1967]. Among other facts, Gold demonstrated that no strategy guarantees success in stabilizing to an arbitrarily chosen finitestate (“Type 3”) grammar on the basis of a presentation of its strings. It follows immediately that the same is true for all levels of the Chomsky hierarchy [Chomsky, 1959] of increasingly inclusive formal grammars. In particular, Gold showed that no strategy is successful on a collection that includes an infinite language I = {w0 , w1 , . . .} and all of its finite subsets. For, suppose we are given a strategy L that succeeds on each of the finite languages Jn = {w0 , . . . , wn }. We may then construct an enumeration e of I on which L fails to stabilize. Our construction proceeds by stages; at each stage n, we specify a finite initial segment sn of e. We begin the construction by letting s0 be the empty sequence. At each stage n + 1 we recursively specify a finite sequence sn+1 with the following properties: • sn+1 extends sn , • every item in sn+1 is a member of Jn , • wn appears in sn+1 , and • L(sn+1 ) is a grammar for Jn .
710
Daniel Osherson and Scott Weinstein
Suppose we have constructed sn and we are at stage n + 1. For m > 0, let rm be the finite sequence obtained by extending sn with a sequence of wn ’s of length m. Since L succeeds on Jn , there is m > 0 such that L(rm ) is a grammar for Jn . We , then it is not hard let sn+1 be rm for the least such m. Now, if we let en = sn+1 n to verify that e is an enumeration of I on which L fails to stabilize. Finally, [Blum and Blum, 1975] introduced new techniques to prove unexpected theorems about paradigms close to Putnam’s and Gold’s. The concepts presented in their work were subsequently mined by a variety of investigators. Among their discoveries is the surprising fact that there is a collection F of total recursive 0-1 valued functions such that a computable strategy can achieve Gold-style success on F , but no computable strategy can successfully extrapolate all functions in F.3 An example of such a collection is the set of “self-describing” 0-1 valued total recursive functions. A total recursive function b is self-describing just in case the least n such that b(n) = 1 is the index of a Turing machine that computes b. Gold-style success on a total recursive function b consists in stabilization to an index for (a Turing machine that computes) b when presented with an enumeration of the argument value pairs 0, b(0), 1, b(1), . . .. It is easy to see that there is a computable strategy guaranteeing Gold-style success on every self-describing total recursive function. It is a remarkable consequence of Kleene’s Recursion Theorem4 that for every total recursive 0-1 valued function a, there is a self-describing total recursive 0-1 valued function b such that a and b differ on only finitely many arguments. It follows at once that there is no recursively enumerable set of natural numbers X such that • for every self-describing total recursive 0-1 valued function b, there is an n ∈ X such that n is an index for b and • for every n ∈ X, n is an index for a total recursive function. On the other hand, [Blum and Blum, 1975] show that if F is a collection of total recursive functions and there is a computable strategy that successfully extrapolates every function in F, then there is a recursively enumerable set of natural numbers X such that • for every b ∈ F , there is an n ∈ X such that n is an index for b and • for every n ∈ X, n is an index for a total recursive function. It follows immediately that the self-describing functions can be learned Gold-style by a single recursive strategy but not extrapolated by any single recursive function. Rather than present Formal Learning Theory in further detail, we rely on the examples given above to communicate its flavor. They illustrate a fundamental feature of virtually all paradigms embraced by the theory. Even when success can be guaranteed, at no stage do the data imply the correctness of the latest 3 This 4 See
result was also obtained independently by [Barzdin and Freivalds, 1972]. [Rogers, 1967] for a proof and applications of this theorem.
Formal Learning Theory in Context
711
hypothesis. In the case of C 1 , for every data-set e0 · · · en and every hypothesis T that could be issued in response, there is T ∈ C 1 distinct from T and ω-sequence e for T that begins with e0 · · · en . Intuitively, your current data don’t exclude the possibility that the “hole” in N occurs beyond the one cited in your current hypothesis. In this sense, Formal Learning Theory concerns methods for arriving at truth non-demonstratively, and thus belongs to Inductive Logic. Because the data never imply the correctness of the current conjecture, a successful guessing strategy warrants confidence in the strategy but not in any hypothesis produced by it. Using the rule described earlier, for example, you are justified in expecting to stabilize to the correct member of C 1 . But Formal Learning Theory offers no warrant for ever suspecting that stabilization is underway. There might be external reasons for such a feeling, e.g., information about an upper bound on the “hole” that your opponent (in this case, the authors) is likely to choose. But information of this kind is foreign to Formal Learning Theory, which only offers various kinds of reliable methods for some games — along with proofs of the nonexistence of reliable methods for other games. To underline the separation between long term performance and warrant for individual conjectures, consider the following guessing policy for C 1 . • On the first datum, guess N − {0}. • Never change an hypothesis unless it is contradicted by your data. • When contradiction arrives, if the number of times you’ve changed hypotheses is even, guess N − {x0 } where x0 is the least number not yet encountered; otherwise, guess N − {x1 } where x1 is the second least number not yet encountered. The new and old guessing policies enjoy the same guarantee of correct stabilization in the C 1 game. But if the common guarantee provided warrant for the conjectures of one strategy, it would seem to provide equal warrant for the conjectures of the other, yet the conjectures will often be different! Indeed, for any potential conjecture at any stage of the game, there is a guessing strategy with guaranteed correct stabilization that issues the conjecture in question. They can’t all be warranted. The remaining discussion compares Formal Learning Theory to the statistical theory of confidence intervals initiated by [Neyman, 1937] before the advent of Formal Learning Theory. (See [Salsburg, 2002, Ch. 12] for the history of Neyman’s idea.) To begin, we rehearse well-known arguments that confidence intervals offer global performance guarantees without provision for evaluating specific hypotheses — in much the sense just indicated for Formal Learning Theory. Then we’ll attempt to situate both theories in the logic of hypothesis acceptance.
712
Daniel Osherson and Scott Weinstein
CONFIDENCE INTERVALS The theory of confidence intervals shares with Formal Learning Theory the goal of revealing a hidden reality on the basis of data that do not deductively imply the correct answer. Let us attempt to isolate the kind of performance guarantee associated with confidence intervals, and distinguish such guarantees from “confidence” about the specific reality behind one’s current data. We focus on a simple case, building on the discussion in [Baird, 1992, §10.5]. (1) Urn problem: Suppose that an urn is composed of balls numbered from 1 to L (no gaps, no repeats), with L ≥ 2. The urn is sampled with replacement k ≥ 2 times. What can be inferred about L on this basis? Let XL,k be the set of possible samples of k balls that can be drawn from the urn with L balls. We think of such samples as ordered sequences. It is clear that XL,k is finite, and that its members have uniform probability of appearing [namely (1/L) exp k]. Let f be a mapping of {XL,k | L, k ≥ 2} into the set of intervals of the form [i, j], where i, j are positive integers (i ≤ j). We think of f as attempting to construct an interval containing L on the basis of a sample. Success at this enterprise is quantified as follows. (2) Definition: Let r ∈ (0, 1) be given. Call f r-reliable just in case for every L, k ≥ 2, there are at least 100 × r% of x ∈ XL,k with L ∈ f (x). The definition embodies the kind of performance guarantee that we hope to asso ciate with a given mapping f from data {XL,k | L, k ≥ 2} into finite intervals (hypotheses about L). Of course, for a given level of reliability, narrower intervals are more informative than wider ones.
One method for constructing confidence intervals for L Let us fix r ∈ (0, 1) and consider a specific r-reliable function fr . Let Xk = 2≤L XL,k (this is the set of all potential samples of size k). For x ∈ Xk , we write max(x) for the largest member of x. It is easy to verify: (3) Fact: Let L0 ≥ 2 and 1 ≤ m ≤ L0 be given. For all L > L0 , the proportion of samples y from XL,k with max(y) ≤ m is less than or equal to (m/L0 )k . For x ∈ Xk , define fr (x) = [max(x), L0 ] where: (4) L0 is the least integer greater than or equal to max(x) such that for all L > L0 , the proportion of samples y from XL,k with max(y) ≤ max(x) is less than or equal to 1 − r. That such an L0 exists for each x ∈ Xk (and can be calculated from x) follows from Fact (3). To show that fr is r-reliable, let L, k ≥ 2 and x ∈ XL,k be given. Then L ∈ fr (x) iff L ∈ [max(x), L0 ] where L0 satisfies (4). Since L ≥ max(x), L ∈ [max(x), L0 ] iff L > L0 , which implies that x ∈ A = {y ∈ XL,k | max(y) ≤ max(x)}. But by (4), Prob (A) ≤ 1 − r.
Formal Learning Theory in Context
713
(5) Example: Suppose r = .95, and let D symbolize the appearance of balls numbered 61 through 90 in thirty consecutive draws. Then, max(x) = 90. To form a confidence interval using the .95-reliable function f.95 defined by (4), we seek the least L0 such that the probability of drawing thirty balls labeled 90 or less from an L0 -urn is no greater than 5%. By Fact (3), L0 is the least integer satisfying:
90 L0
30 < .05.
Calculation reveals that L0 = 100. Hence f.95 (D) = [90, 100].
Confidence and confidence intervals Parallel to our discussion of Formal Learning Theory, we now consider the relation between r-reliability and the warrant for particular intervals. Suppose the urn confronts us with D of Example (5). Does the fact that f.95 (D) = [90, 100] justify 95% confidence that L ∈ [90, 100]? In other words, if Prob subj represents a person’s subjective assessment of chance, should the following hold? Prob subj (L ∈ f.95 (D) | D is the draw) ≥ .95.
(6)
Inasmuch as reliability in the sense of Definition (2) concerns fractions of XL,k (the finite set of k-length data that can emerge from an L-urn) whereas (6) concerns justified belief about a particular urn and draw, it is not evident how the former impinges on the latter [Hacking, 2001]. Indeed, (6) seems impossible to defend in light of the existence of a different .95-reliable function h.95 with the property that (7) h.95 (D) ∩ f.95 (D) = ∅. We exhibit h.95 shortly. To see the relevance of (7), observe that whatever reason r-reliability provides for embracing (6) extends equally to: Prob subj (L ∈ h.95 (D) | D is the draw) ≥ .95.
(8)
If Prob subj is coherent, (6) and (8) entail:5 Prob subj (L ∈ f.95 (D) ∧ L ∈ h.95 (D) | D is the draw) ≥ .90. But the latter judgment is incoherent in light of (7). At least one of (6), (8) must therefore be abandoned; by symmetry, it seems that both should. 5 Here
we use: for any two statements p and q, coherence entails that Prob subj (p ∧ q) ≥ Prob subj (p) + Prob subj (q) − 1. The proof is elementary.
714
Daniel Osherson and Scott Weinstein
An alternative method for constructing confidence intervals for L ¯ To specify hr , we rely on the following well known facts [Ross, 1988], writing X for the arithmetical mean of sample X. (a) For a uniform distribution over {1 · · · L}, the mean μ = (L + 1)/2, and the variance σ 2 = (L2 − 1)/12. (b) (Chebyshev) For any sample X of size n drawn independently and identically from a distribution with mean μ and variance σ 2 , √ ¯ − μ | ≥ kσ/ n ) ≤ k −2 , for all k > 0. Prob ( | X It follows that if an independent sample X of size n is drawn from an urn with highest number ball L then: ' & √ 2−1 L + 1 L k ¯− ≥ √ ≤ k −2 , for all k > 0 Prob X 2 12n hence,
L + 1 kL ¯ Prob X − ≤ k −2 , for all k > 0, ≥ √ 2 12n
so,
¯ − L + 1 < √kL X > 1 − k −2 , for all k > 0. 2 12n ) Algebraic manipulation with k = 1/.05 yields: &√ ' √ ¯ − 1) ¯ − 1) 3n(2X 3n(2X √ (9) Prob .95. 3n + 4.47 3n − 4.47 Prob
Define h.95 (X) to be the interval specified by (9) (with integer values). Then, h.95 is .95-reliable. Setting X = D = [61, 90], as in Example (5), we have h.95 (D) = [101, 284], verifying (7) inasmuch as f.95 (D) = [90, 100]. The method based on (9) allows us to articulate another argument against allowing confidence intervals to determine subjective probability. Even though h.95 is .95-reliable, calculation confirms the following fact. (10) Let E = 61 · · · 90, 400 (that is, E is D with 400 added to the end). Then h.95 (E) = [116, 319]. That is, h.95 (E) does not include the highest ball observed (400). Thus, E visibly belongs to the small set of samples on which h.95 is inaccurate, and it would be folly to believe: Prob subj (L ∈ h.95 (E) | E is the draw) ≥ .95.
Formal Learning Theory in Context
715
COMPARISON Consider the urn problem (1) from the perspective of Formal Learning Theory. Let S be the strategy of guessing L to be the highest numbered ball seen in the data so far. Then, if balls are drawn forever, S will stabilize to the correct conjecture with unit probability. How does S compare with the function f.95 defined earlier for confidence intervals? S appears to ignore useful information that is exploited by f.95 , namely, the independent and uniform character of the draws composing the current data set. But once a given draw is made, it is hard to see the relevance of this information to current belief. For fixed L, all samples of the same size have the same probability prior to the draw, and just one sample has all the probability afterwards (namely, the sample observed). The interval produced by f.95 is then either correct or incorrect, thus has probability either 1 or 0, just as for S’s conjecture. To equate subjective confidence in the interval’s accuracy with the fraction of potential samples on which f.95 succeeds is to overlook the change in sampling distribution consequent to the draw; it was once uniform but now is concentrated on the observed data. (This point is elaborated in [Hacking, 2001]). Ignoring the transition when forging personal confidence leads to incoherent judgment, as seen in the previous section. But once the current data decouple belief about L from the performance guarantees of f.95 and S, it seems just as legitimate to issue conjectures based on one as the other. Instead of attempting to interpret the merit of f.95 and S in terms of personal probabilities (or the utility of announcing their counsel), let us explore the idea that they both embody attractive policies for hypothesis acceptance. The distinction between acceptance and personal probabilities (belief) has been explained diversely by different authors [Popper, 1959; Cohen, 1992; Maher, 1993; Levi, 1967; Kaplan, 1998]. We rely here on just a few observations. Belief appears to be largely involuntary at any given moment. Although you can influence the beliefs that enter your mind over time (e.g., by reading certain newspapers rather than others), you cannot alter your current beliefs by a mere act of will. Acceptance, on the other hand, requires choice. Thus, it might be difficult to refrain from strong belief that a recently purchased lottery ticket will lose. But you haven’t thereby accepted this thesis (if you did, you would throw the ticket away).6 Acceptance is assumed to be categorical inasmuch as it rests on a definite selection among alternative theories even if confidence in each is graded and the choice is provisional. Naturally, beliefs are often a factor in theory acceptance. But they are not always decisive since there may be other factors, such as the impression that one theory is more interesting than another or the suspicion that announcing a certain theory will promote inquiry leading to a better one. We suggest that Formal Learning Theory helps to evaluate policies for accepting 6 This kind of example is discussed more thoroughly in [Maher, 1993, §6.2.1]. A revealing analogy (offered in [Cohen, 1992, Ch. 2]) compares belief to (mere) desire, and acceptance to consent or acquiescence.
716
Daniel Osherson and Scott Weinstein
hypotheses. For example, it informs us that the strategy described earlier for C 1 stabilizes to the truth on every presentation of data whereas no strategy has this property for C 2 . More refined criteria allow comparison of the plurality of successful policies for C 1 — in terms of efficient use of data, for example, or the computational resources needed for implementation. Even so, Formal Learning Theory does not designate a uniquely best strategy, and it is cold comfort in defending any particular conjecture in response to data. The theory nevertheless seems relevant to the orderly adoption of hypotheses. Neyman defended the theory of confidence intervals from the same perspective albeit without the “acceptance” terminology (see [Baird, 1992, §10.6] for illuminating discussion). Similarly to Formal Learning Theory, alternative strategies for constructing intervals can be shown to be reliable (as seen in the last section), so the theory is silent about the correct response to a particular sample. And as before, supplementary criteria may be evoked to compare different strategies, for example, concerning the width of constructed intervals or the possibility of issuing an interval contradicted by the available data [illustrated by (10)]. Both theories clarify the options of an agent who seeks method in her acceptance of hypotheses. In the case of the urn, it remains an extra-systematic choice whether to guess L exactly or just bracket it (or both in parallel). Likewise, no uniquely best choice emerges from the set of recommended strategies of each kind. But it may be hoped that new criteria will eventually come to light, allowing increasingly refined evaluation of policies for hypothesis selection. The logic of acceptance would be thereby extended beyond the contributions already due to confidence intervals and Formal Learning Theory.
CONCLUSION All this leaves untouched a fundamental question: Since we’re each destined to issue but finitely many hypotheses, why rely on a guessing strategy whose performance is guaranteed only in the limit? Formal Learning Theory does not resolve the matter, but perhaps it facilitates articulation of the issue, preparing the way for subsequent clarification. Whatever its normative contribution may turn out to be, Formal Learning Theory also has a potential descriptive role in characterizing inductive practice. A given person (suitably idealized) implements a function from finite data sets to hypotheses about the data’s provenance. Characterizing the functions that people typically implement could provide insight into the scope and limits of human cognition by revealing the class of empirical problems that such a system can solve in principle. Focussing the issue on children could likewise yield fresh perspectives on development, for example, by delineating the collection of natural languages (i.e., systems of communication learnable by infants).7 7 To illustrate, the infant is unlikely to remember at any given time more than a few of the utterances that she has encountered. This design feature can be shown to influence the collection
Formal Learning Theory in Context
717
Formal Learning Theory serves as catalyst to this enterprise. It connects classes of inductive strategies to the empirical problems for which they are adapted. It thereby also suggests the kinds of problems that might pose insuperable obstacles to human inquiry. BIBLIOGRAPHY [Baird, 1992] D. Baird. Inductive Logic: Probability and Statistics. Prentice Hall, Englewood Cliffs NJ, 1992. [Barzdin and Freivalds, 1972] Ja. M. Barzdin and R. V. Freivalds. On the prediction of general recursive functions. Soviet Math. Dokl., 13:1224–1228, 1972. [Blum and Blum, 1975] L. Blum and M. Blum. Toward a mathematical theory of inductive inference. Information and Control, 28:125–155, 1975. [Carnap, 1950] Rudolph Carnap. The Logical Foundations of Probability. University of Chicago Press, Chicago IL, 1950. [Chomsky, 1959] Noam Chomsky. On certain formal properties of grammars. Information and Control, 2:137–167, 1959. [Cohen, 1992] L. J. Cohen. Belief & Acceptance. Oxford University Press, Oxford, UK, 1992. [Gold, 1967] E. M. Gold. Language identification in the limit. Information and Control, 10: 447–474, 1967. [Hacking, 2001] I. Hacking. An Introduction to Probability and Inductive Logic. Cambridge University Press, Cambridge UK, 2001. [Hansson, 1999] S. O. Hansson. A Textbook of Belief Updating. Kluwer, Dordrecht, 1999. [Howson and Urbach, 1993] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. Open Court Publishing Company, Peru, Illinois, 1993. [Jain et al., 1999] Sanjay Jain, Daniel Osherson, James Royer, and Arun Sharma. Systems that Learn. M.I.T. Press, Cambridge MA, 2nd edition, 1999. [Kaplan, 1998] M. Kaplan. Decision Theory as Philospohy. Cambridge University Press, Cambridge UK, 1998. [Kelly, 1996] Kevin T. Kelly. The Logic of Reliable Inquiry. Oxford University Press, 1996. [Levi, 1967] I. Levi. Gambling with Truth. MIT Press, Cambridge MA, 1967. [Maher, 1993] P. Maher. Betting on Theories. Cambridge University Press, Cambridge, UK, 1993. [Martin and Osherson, 1998] Eric Martin and Daniel Osherson. Elements of Scientific Inquiry. MIT Press, Cambridge MA, 1998. Revised edition: www.princeton.edu/∼osherson/IL/ILpage.htm. [Neyman, 1937] Jerzy Neyman. Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society, CCXXXVI(A): 333–380, 1937. [Osherson et al., 1986] D. Osherson, M. Stob, and S. Weinstein. Systems that Learn. MIT Press, 1986. [Popper, 1959] K. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1959. [Putnam, 1979] H. Putnam. Probability and confirmation. In Mathematics, Matter, and Method: Philosophical Papers, Volume I. Cambridge University Press, Cambridge, 1979. [Rogers, 1967] H. Rogers. Theory of Recursive Functions and Effective Computability. McGrawHill, New York, 1967. [Ross, 1988] S. Ross. A First Course in Probability, 3rd Edition. Macmillan, New York City, 1988. [Salsburg, 2002] David Salsburg. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. Owl Books, 2002.
of languages that are potentially learnable, and to interact with other design features (such as the computer simulability of the infant’s guessing strategy). For discussion, see [Osherson et al., 1986].
MECHANIZING INDUCTION Ronald Ortner and Hannes Leitgeb In this chapter we will deal with “mechanizing” induction, i.e. with ways in which theoretical computer science approaches inductive generalization. In the field of Machine Learning, algorithms for induction are developed. Depending on the form of the available data, the nature of these algorithms may be very different. Some of them combine geometric and statistical ideas, while others use classical reasoning based on logical formalism. However, we are not so much interested in the algorithms themselves, but more on the philosophical and theoretical foundations they share. Thus in the first of two parts, we will examine different approaches and inductive assumptions in two particular learning settings. While many machine learning algorithms work well on a lot of tasks, the interpretation of the learned hypothesis is often difficult. Thus, while e.g. an algorithm surprisingly is able to determine the gender of the author of a given text with about 80 percent accuracy [Argamon and Shimoni, 2003], for a human it takes some extra effort to understand on the basis of which criteria the algorithm is able to do so. With that respect the advantage of approaches using logic are obvious: If the output hypothesis is a formula of predicate logic, it is easy to interpret. However, if decision trees or algorithms from the area of inductive logic programming are based purely on classical logic, they suffer from the fact that most universal statements do not hold for exceptional cases, and classical logic does not offer any convenient way of representing statements which are meant to hold in the “normal case”. Thus, in the second part we will focus on approaches for Nonmonotonic Reasoning that try to handle this problem. Both Machine Learning and Nonmonotonic Reasoning have been anticipated partially by work in philosophy of science and philosophical logic. At the same time, recent developments in theoretical computer science are expected to trigger further progress in philosophical theories of inference, confirmation, theory revision, learning, and the semantic and pragmatics of conditionals. We hope this survey will contribute to this kind of progress by building bridges between computational, logical, and philosophical accounts of induction. 1
1.1
MACHINE LEARNING AND COMPUTATIONAL LEARNING THEORY
Introduction
Machine Learning is concerned with algorithmic induction. Its aim is to develop algorithms that are able to generalize from a given set of examples. This is quite Handbook of the History of Logic. Volume 10: Inductive Logic. Volume editors: Dov M. Gabbay, Stephan Hartmann and John Woods. General editors: Dov M. Gabbay and John Woods. c 2011 Elsevier BV. All rights reserved.
720
Ronald Ortner and Hannes Leitgeb
a general description, and Machine Learning is a wide field. Here we will confine ourselves to two exemplary settings, viz. concept learning and sequence prediction. In concept learning, the learner observes examples taken from some instance space X together with a label that indicates for each example whether it has a certain property. The learner’s task then is to generalize from the given examples to new, previously unseen examples or to the whole instance space X. As each property of objects in X can be identified with the subset C ⊆ X of objects that have the property in question, this concept C can be considered as a target concept to be learned. EXAMPLE 1. Consider an e-mail program that allows the user to classify incoming e-mails into various (not necessarily distinct) categories (e.g. spam, personal, about a certain topic, etc.). After the user has done this for a certain number of e-mails, the program shall be able to do this classification automatically. Sequence prediction works without labels. The learner observes a finite sequence over an instance set (alphabet) X and has to predict its next member. EXAMPLE 2. A stock broker has complete information about the price of a certain company share in the past. Her task is to predict the development of the price in the future. In the following, we will consider each of the two mentioned settings in detail. Concerning concept learning we also would like to refer to the chapter on Statistical Learning Theory of von Luxburg and Sch¨ olkopf in this volume, which deals with similar questions in a slightly different setting.
1.2 Concept Learning The Learning Model We start with a detailed description of the learning model. Given a basic set of instances X, the learner’s task is to learn a subset C ⊆ X, called a concept. Learning such a target concept C means learning the characteristic function 1C on X. That is, for each x ∈ X the learner shall be able to predict whether x is in C or not. EXAMPLE 3. For learning the concept “cow” one may e.g. consider X to be the set of all animals, while C would be the set of all cows. The concept would be learned if the learner is able to tell of each animal whether it is a cow. In order to enable the learner to learn a concept C, she is provided with some training examples, that is, some instances taken from X together with the information whether each of these is in C or not. Thus the learner’s task is to generalize from such a set of labeled training examples x1 , 1C (x1 ) , x2 , 1C (x2 ) , . . . , xn , 1C (xn )
,
Mechanizing Induction
721
with xi ∈ X to a hypothesis h : X → {0, 1}. If the learner’s hypothesis coincides with 1C she has successfully learned the concept C. A special case of this general setting is the learning of Boolean functions where the task is to learn a function f (p1 , p2 , . . . , pn ) of Boolean variables pi that takes values in {0, 1}. Obviously, any Boolean function can be represented by a formula of propositional logic (and vice versa) if the values of the variables pi and the value of the function f are interpreted as truth values. Each training example for the learner consists of an assignment of values from {0, 1} to the variables pi together with the respective value of f . The task of the learner is to identify the function f . As each assignment of values to the pi uniquely corresponds to a vector from X := {0, 1}n , learning a Boolean function f is the same as learning the concept of all vectors x in X for which f (x) = 1. No-Free-Lunch Theorems Unfortunately, the space of possible concepts is the whole power set 2X , so that without any further inductive assumptions learning is an impossible task (except for the trivial case, in which each instance in X is covered by the training examples), cf. [Mitchell, 1990]. Mitchell in his introduction to Machine Learning [1997, p.23] postulates as desired inductive assumption the following inductive learning hypothesis: “Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.” Although this is of course what we want to have, it does not really help us in achieving it. Given only the training examples, each possible hypothesis in 2X which correctly classifies the training examples seems to be as likely. On the other hand, machine learning literature provides a lot of algorithms starting from decision trees to neural networks and support vector machines (for an introduction to all this see e.g. [Mitchell, 1997]) that seem to work well on a lot of problems (and also with some theoretical results complementing the picture). How do these algorithms resolve the problem of induction? Indeed, each algorithm (at least implicitly) defines its own hypothesis class. For a given set of training examples the algorithm will output a hypothesis. By doing this, the algorithm obviously has to prefer this hypothesis to all other possible hypotheses. As remarked before, the training sample helps only to a limited extent as there are a lot of consistent hypotheses in 2X which classify the training examples correctly. Thus each learning algorithm is biased towards some hypotheses in 2X in order to be able to make a decision at all. On the other hand, if all hypotheses in 2X are equally likely, no learning algorithm will be able to perform better than another one in general. This is basically the content of so-called “no-free-lunch theorems” for supervised learning of which there exist various versions [Rao et al., 1995; Schaffer, 1994; Wolpert, 1996b; Wolpert, 1996a]. For a discussion see also [von Luxburg and Sch¨ olkopf, 2009].
722
Ronald Ortner and Hannes Leitgeb
THEOREM 4 No-free-lunch, [Wolpert, 2001]. Averaged over all possible target concepts, the performance of any learning algorithm on previously unseen test examples is that of a random guesser. Thus, in order to save the possibility of learning, one either has to accept that each learning problem can be only tackled by suitable algorithms that on the other hand will fail on other learning problems. Alternatively, one may adopt an inductive assumption that denies that all possible target concepts in 2X are equally likely. ASSUMPTION 5 [Giraud-Carrier and Provost, 2005]. The process that presents us with learning problems induces a non-uniform probability distribution on possible target concepts C ⊆ 2X . Actually, this is not enough, as one also has to avoid all probability distributions on target concepts where good performance on some problems is exactly counterbalanced by poor performance on all other problems, see [Rao et al., 1995] for details. Most learning algorithms use that the instance set X will not be unstructured but (as e.g. in the common case where X ⊆ Rn ) provides a suitable distance metric d : X × X → R which can be used as a kind of similarity measure between instances. The inductive assumption in this case is that two similar (i.e. close) instances will have the same label. Thus, if in an object recognition problem two images differ in only a few pixels they will get the same label, if two e-mails differ only in some letters one will be spam only if the other is as well, and so on.1 Although this is a very natural assumption, note that it won’t be true for all learning problems whatsoever, as the following example shows. EXAMPLE 6. Consider the two chess positions in the following diagram with Black to move in both of them. While the two positions will be considered to
be very close by any ordinary distance measure (as there is only a single white pawn placed differently), the first position is clearly in favor of White due to the enormous material advantage, while in the second diagram Black can checkmate 1 This observation leads to the simple k-nearest neighbor algorithm, which classifies an instance according to the labels of the k nearest training examples. This algorithm not only works well in practice, it also has some advantageous theoretical properties. See [von Luxburg and Sch¨ olkopf, 2009].
Mechanizing Induction
723
immediately by moving the bishop to h4. Of course, it may well be that there is some other natural metric on the space of chess positions that works well, yet it is by no means clear what such a metric may look like. Thus, from a practical point of view for many learning problems it is important to find in advance a proper representation space together with a suitable metric for the data. In the favorable case where there is a suitable distance metric on X, the only difficulty is to determine the boundary between positively and negatively labeled instances. Unfortunately, in general this boundary between positive and negative training examples also will not be uniquely determined by the training examples. Thus, the availability of a distance metric does not really solve our original problem, so that again any algorithm must specify some preference relation. (Accordingly, in Section 2 on Nonmonotonic Reasoning we will see that the standard semantics for nonmonotonic logics is based on preference relations over worlds or states.) Occam’s Razor A common preference relation on the whole hypothesis space is to prefer in the spirit of Occam’s razor simple hypotheses over complicated ones. Thus — to stay with the previous example — when choosing a boundary between positive and negative training examples, a hyperplane is e.g. preferred over a non-differentiable surface. Especially, in the presence of noise (i.e. when the labels of the training data may be wrong with some probability) Occam’s razor is often used to avoid the danger of overfitting the training data, that is, to choose a hypothesis that perfectly fits the training data but is very complex and hence often does not generalize well. There has been some discussion on the validity of Occam’s razor (and also of the more or less synonymous overfitting avoidance) also in the machine learning community.2 While Occam’s razor often remains a rather vague principle, there are some theoretical results (some of which will be mentioned below) and attempts to clarify what Occam’s razor in machine learning exactly is. Thus, it has been argued [Domingos, 1998] that the term “Occam’s razor” is actually used for two different principles in the machine learning literature. POSTULATE 7 Occam’s first razor. Given two models with the same error on the whole instance space X, choose the simpler one. POSTULATE 8 Occam’s second razor. Given two models with the same error on the training sample, choose the simpler one. While it is evidently easier to argue for Occam’s first razor (although its validity is also not clear), only the second razor is of any use in machine learning. However, finding convincing arguments for this latter version is obviously more difficult. Basically there are two ways of argument for a theoretical justification of Occam’s 2 For a related discussion concerning the trade-off between estimation error and approximation error see [von Luxburg and Sch¨ olkopf, 2009].
724
Ronald Ortner and Hannes Leitgeb
second razor. First, there are some theoretical results from so-called PAC learning which can loosely be interpreted as support for Occam’s second razor (see the section on PAC learning below). Second, there is the Bayesian argument which serves as a base for a lot of learning algorithms, the best-known of which is the MDL (minimum description length) principle introduced by Rissanen [1978].3 POSTULATE 9 MDL Principle. Choose the model that minimizes the total number of bits needed to encode the model and the data (given the model).4 There are some theoretical results that support the claim that Occam’s second razor in the realization of the MDL principle indeed is the best strategy in almost all cases [Vit´ anyi and Li, 2000]. However, these results consider an idealized MDL principle that (due to the use of Kolmogorov complexity [Li and Vit´ anyi, 1997]) 5 is uncomputable in practice. On the other hand, although approximations of an idealized MDL approach are often successful in practice, the empirical success of Occam’s second razor is controversial, too [Webb, 1996; Schaffer, 1993]. Of course, part of the reason for this is that practical MDL approaches (just as any other learning algorithm) cannot evade the earlier mentioned no-free-lunch results. That MDL has particular problems when the training sample size is small (so that the chosen hypothesis fits the data, but is too simple) is neither surprising nor a real defect of the approach: with insufficient training data provided, also complex models are likely to fail. Metalearning Some people try to evade the no-free-lunch theorems by lifting the problem to a meta-level (see e.g. [Baxter, 1998; Vilalta et al., 2005]). Thus, instead of solving all problems with the same algorithm, it is attempted to assign each problem a suitable algorithm. Of course, from the theoretical point of view this does not help, as the no-free-lunch theorems obviously also hold for any meta-algorithm that consists of several individual algorithms. However, from the practical point of view this approach makes sense. In particular, it is e.g. certainly useful to apply an algorithm that is able to make use of any additional assumptions one has on the learning problem at hand. A similar theory of induction that prefers local induction over global induction has recently been proposed in [Norton, 2003]. After these general considerations we turn to more theoretical models of learning together with some results that have been achieved. 3 Actually, MDL does not see itself as a Bayesian method. For a discussion of the relation of MDL to Bayesian methods (and a general introduction to MDL) see Chapters 1 and 17 of [Gr¨ unwald, 2007]. More detailed descriptions of MDL as well as Bayesian approaches can also be found in [von Luxburg and Sch¨ olkopf, 2009]. 4 As such, this basic idea of MDL does not look Bayesian at all, as there seem to be no probabilities involved. Indeed, these come into play by the observation that there is a correspondence between encodings and probability distributions (cf. Section 1.3, in particular footnote 20). 5 We do not give any details here but refer to the section on Solomonoff ’s theory of induction below, which is closely related to idealized MDL.
Mechanizing Induction
725
PAC Learning The simplest way to make learning feasible is to consider a setting where the space of possible concepts is restricted to a certain concept class known to the learner. DEFINITION 10. Let X be an arbitrary (possibly infinite) set of instances. Then any subset C of the power set 2X of X is called a concept class over X. If the learner knows the concept class C, she will choose her hypothesis h from H = {1C | C ∈ C}. Thus our first assumption to make learning possible in this framework will be the following. ASSUMPTION 11. The learner has access to a set of possible hypotheses H ⊂ 2X that also contains a hypothesis corresponding to the target concept. Of course, it will depend on the size of the concept class and the given training examples to which extent Assumption 11 will be helpful to the learner. . / EXAMPLE 12. Assume X = {a, b, c, . . . , z} and C = {a, b}, {b, c} . Then the learner will be able to identify a target concept taken from C if and only if either a or c is among the training examples. . / EXAMPLE 13. Let X = {a, b, c} and C = {a}, {b}, {c}, {a, b}, {b, c} . It is easy to check that unlike in Example 12 two distinct training examples are needed in any case in order to identify a target concept taken from C. In general, the number of distinct training examples that are necessary (in the best as well as in the worst case) in order to identify a concept will depend on the combinatorial structure of the concept class. DEFINITION 14. For Y ⊆ X we set C ∩ Y := {C ∩ Y | C ∈ C}. Such a subset Y ⊆ X is said to be shattered by C, if C ∩ Y = 2Y . If Y ⊆ X is shattered by a concept class C then it is easy to see that in the worst case, learning a concept in C will take |Y | distinct training examples. Thus, the following definition provides an important combinatorial parameter of a concept class. DEFINITION 15. The VC-dimension 6 of a concept class C ⊆ 2X is the cardinality of a largest Y ⊆ X that is shattered by C. Subsequent results will emphasize the significance of the VC-dimension. For a more detailed account on why the VC-dimension matters see [von Luxburg and Sch¨ olkopf, 2009]. EXAMPLE 16. The concept class given in Example 12 has VC-dimension 1, as only the sets {a} and {c} are shattered. In Example 13 we find that C shatters {a, b} and has VC-dimension 2. REMARK 17. In the agnostic learning setting (see the respective section below) where the learner does not know the concept class from which the target concept 6 VC stands for Vapnik-Chervonenkis. Vapnik and Chervonenkis [1971] were together with [Sauer, 1972] and [Shelah, 1972] the first to consider the VC-dimension of concept classes.
726
Ronald Ortner and Hannes Leitgeb
is taken, the learner has to choose a hypothesis class H by herself. If there is no hypothesis in H that suits the training examples, the hypothesis class H can be considered to be falsified by the examples. The number of examples that is necessary in the worst case to falsify a class H of VC-dimension d is simply d + 1. This can be easily verified, as the examples in a set Y shattered by H cannot falsify H, independent of their labels. Hence, the VC-dimension can be used to measure the degree of falsifiability of hypothesis classes, as noted in [Corfield et al., 2005]. Popper [1969, Chapter VI] had similar ideas, yet his measure for the degree of falsifiability in general does not coincide with the VC-dimension [Corfield et al., 2005]. See also the discussion in [von Luxburg and Sch¨ olkopf, 2009]. Example 12 shows that if two concepts are close to each other (i.e. their characteristic functions coincide on almost all instances in X), then finding the target concept may take a lot of training examples (in the worst case). In the PACframework this problem is handled by weakening the task for the learner in that she need not identify the target concept, but that it is sufficient to approximate it. That is, the learner’s hypothesis shall be correct on most instances in X. As the learner usually will not be able to choose the training examples by herself,7 one assumes that the training examples are drawn independently according to some fixed probability distribution P on X that is unknown to the learner. The learning success will of course depend on the concrete sample presented to the learner, so that the number of training examples the learner needs in order to approximate the target concept well will be a random variable depending on the distribution P. Thus, with some (usually small) probability the training examples may be not representative for the target concept (e.g. if the same example is repeatedly drawn from X). Hence, it is reasonable to demand from the learner to approximate the target concept only with high probability, that is, to learn probably approximately correct, which is what PAC stands for. However, learning still may be impossible, if the distribution P has support8 on a proper subset Y ⊂ X, so that some instances will never be sampled. Thus, one measures the performance of the learner’s hypothesis not uniformly over the whole instance space X, but according to the same distribution P that also generates the training examples. That is, the error of the learner’s hypothesis h : X → {0, 1} with respect to a target concept C and a distribution P on X is defined as . / erC,P (h) := P x | h(x) = 1C (x) . One may consider this as the error expected for a randomly drawn test example, where one makes the inductive assumption that this test example is drawn according to the same distribution P that generated the training sample. ASSUMPTION 18. The training as well as the test examples are drawn from the instance set X according to an unknown but fixed distribution P. 7 See
however the active learning setting described below. support of a probability distribution is basically the set of instances that have positive probability. 8 The
Mechanizing Induction
727
Summarizing, a learner PAC learns a concept class, if for ε, δ > 0 there is a number m = m(ε, δ) of training examples that are sufficient to approximate the target concept with high probability. That is, with probability at least 1 − δ the output hypothesis has error smaller than ε. More precisely, this leads to the following definition.9 DEFINITION 19. A concept class C ⊆ 2X is called PAC learnable, if for all ε, δ ∈ (0, 1) there is an m = m(ε, δ), such that for all probability distributions P on X and all C ∈ C: when learning C from m examples, the output hypothesis h has error erC,P (h) > ε with probability smaller than δ (in respect to the m examples drawn independently according to P and labeled by C). This framework has been introduced by Valiant [1984]. For a comparison of this learning model to alternative models see [Haussler et al., 1991] or also [Wolpert, 1995] where the PAC learning model is embedded into a Bayesian framework. A lot of earlier work in computational learning dealt with learning a target concept in the limit (i.e. when the number of training examples goes to infinity). Research in this direction (with some links to recursion theory) goes back to [Gold, 1967]. For an overview see [Angluin and Smith, 1983]; [Osherson and Weinstein, 2009] also deals with that approach. Valiant [1984, p.1142] remarks that an interesting consequence of his learning model is that when a population has successfully learned a concept based on the same underlying probability distribution, there still may be significant differences on the learned concept. In particular, examples that appear only with very small probability are irrelevant for learning, so that “thought experiments and logical arguments involving unnatural hypothetical situations may be meaningless activities.” It is a natural question to ask which concept classes are PAC learnable. Obviously, this will also depend on the learning algorithm. Choosing e.g. a stupid algorithm that even classifies most of the training examples wrongly will obviously prevent learning. Thus, one often turns the attention to consistent learners that always choose a hypothesis h that is consistent with the training sample,10 i.e. for a target concept C and training examples x1 , . . . , xn one has h(xi ) = 1C (xi ) for 1 ≤ i ≤ n.11 Such consistent learners then are able to PAC learn finite concept 9 Usually, there are also some considerations about the run-time complexity of an algorithm that PAC learns a concept class. For now, we will neglect this for the sake of simplicity, and come back to the question of efficient PAC learning when discussing Occam algorithms and polynomial learnability below. 10 While at first sight it may look foolish to consider hypotheses that are not consistent, this certainly makes sense if there is noise in the training data. Furthermore as mentioned by Kelly [2004b], when also considering questions of computability it may happen that the restriction to computable consistent hypotheses prevents learning, see also [Osherson et al., 1988; Kelly and Schulte, 1995]. 11 As we assume that the learner has access to the concept class C from which the target concept is taken, it is obvious that there is a consistent hypothesis h ∈ H = {1C | C ∈ C}.
728
Ronald Ortner and Hannes Leitgeb
classes, where the necessary number of examples can be shown to depend on the size of the concept class. THEOREM 20 [Haussler, 1988]. Any consistent learning algorithm needs O 1ε (log 1δ + log |C|) examples for PAC learning any finite concept class C. More generally, not the absolute size but the VC-dimension of a concept class turns out to be the critical parameter. Thus, concept classes of finite VC-dimension are PAC learnable, and the number of necessary examples for learning can be upper bounded using the VC-dimension as follows.12 THEOREM 21 [Blumer et al., 1989]. Any consistent learning algorithm needs O 1ε (log 1δ + d log 1ε ) examples for PAC learning any well-behaved13 concept class of VC-dimension d. For finite concept classes this is slightly worse than the result of Theorem 20, as the VC-dimension of a finite class C may take values up to log2 |C|, see [Blumer et al., 1989]. Of course, particular learning algorithms may PAC learn concept classes using fewer examples. Here is e.g. an alternative bound for the (consistent) 1-inclusion graph algorithm of [Haussler et al., 1994]. THEOREM d 221 [Haussler et al., 1994]. The 1-inclusion graph learning algorithm needs O ε log δ examples for PAC learning any well-behaved concept class of VC-dimension d. The following lower bound shows that the dependence on the VC-dimension is necessary. THEOREM 23 [Ehrenfeucht et al., 1989]. Let C be an arbitrary concept class of VC-dimension d. Then there is a probability distribution P such that any consistent learner needs Ω 1ε (d + log 1δ ) examples for PAC learning C. In particular this means that it is impossible to PAC learn concept classes of infinite VC-dimension,14 so that it becomes an interesting question which concept classes have finite VC-dimension. For certain concept classes the VC-dimension can be easily calculated. Thus, the concept class of all (open or closed) intervals on the real line R has VC-dimension 2. The class of axis-parallel rectangles in Rn has VC-dimension 2n. Half-spaces as well as balls in Rn have VC-dimension n + 1. Further concept classes with finite VC-dimension can be found e.g. in [Vapnik and Chervonenkis, 1974], [Dudley, 1984], [Wenocur and Dudley, 1981], [Assouad, 1983], or [Haussler and Welzl, 1987]. Examples for concept classes with infinite VC-dimension are finite unions of intervals or the interiors of Jordan curves in R2 . 12 Although the results presented by von Luxburg and Sch¨ olkopf [2009] concern a slightly different setting, they give some good intuition for why finiteness is important and why a combinatorial parameter like the VC-dimension matters even if hypothesis classes and instance space are infinite or even continuous. 13 Usually, one has to impose some modest measure-theoretic assumptions when the considered concept classes are infinite, cf. Appendix A1 of [Blumer et al., 1989]. 14 See however the paragraph on polynomial learnability below.
Mechanizing Induction
729
In spite of the negative result implied by Theorem 23 it is not hopeless to learn such classes. Algorithms as well as theoretical results may make use of a suitable parametrization of the domain in order to achieve useful results about what is called polynomial learnability (Definition 25 below) in [Blumer et al., 1989]. For details see the discussion of Occam algorithms below. For particular classes with finite VC-dimension, beside the PAC bound of Theorem 21 often alternative or sharper, sometimes even optimal bounds can be de rived. Thus (in view of Theorem 23) optimal PAC bounds of O 1ε (d + log 1δ ) can be derived e.g. for axis-parallel hyperrectangles in Rd [Auer et al., 1998] or classes with certain combinatorial structure [Auer and Ortner, 2007]. However, some of these bounds only hold for special algorithms, while for particular consistent algorithms sharper lower bounds than given in Theorem 23 can be shown. Thus, consider e.g. the concept class CX,d := {C ⊆ X | |C| ≤ d} of all subsets of X of size at most d. This simple concept class has VC-dimension d and is PAC learnable only from Ω 1ε (d log 1ε + log 1δ ) examples for the learning algorithm which chooses a largest consistent hypothesis from CX,d [Auer and Ortner, 2007]. It is notable that the algorithm that chooses a smallest consistent hypothesis needs only O 1ε (d + log 1δ ) examples to PAC learn CX,d [Auer and Ortner, 2007]. This can be seen as a theoretical justification of Occam’s razor, as choosing a simple (which in this case means small) hypothesis provides better PAC bounds than choosing a more complex (i.e. larger) hypothesis. Occam Algorithms and Polynomial Learnability In fact, there are also other theoretical justifications of Occam’s razor. Consider a concept class C (of arbitrary, possibly infinite VC-dimension) together with some kind of complexity measure on the concepts in C. It would be a straightforward realization of Occam’s razor to demand to choose a hypothesis consistent with the training examples which has smallest complexity. However, as it turns out that computing such a hypothesis may be (NP-)hard (cf. e.g. the example given in [Blumer et al., 1989]), one confines oneself to the more modest requirement to choose a hypothesis that is significantly simpler than the training sample. This A be the effective hypothesis space 15 of an can be made precise as follows. Let Cs,m algorithm A that is presented with m training examples of a concept of complexity ≤ s. DEFINITION 24. A learning algorithm A is an Occam algorithm for a concept class C with complexity measure s : C → Z+ , if there is a polynomial p(s) and a A is upper constant α ∈ [0, 1), such that for all s, m ≥ 1, the VC-dimension of Cs,m α bounded by p(s)m . It can be shown that Occam algorithms satisfy a learnability condition that is similar to PAC learning. DEFINITION 25. A concept class C ⊆ 2X with given complexity measure 15 This
is the set of all hypotheses that may be the outcome of the algorithm.
730
Ronald Ortner and Hannes Leitgeb
s : C → Z+ is called polynomially learnable, if for all ε, δ ∈ (0, 1) there is an m = m(ε, δ, s), such that for all probability distributions P on X and all C ∈ C with s(C) ≤ s: when learning C from m examples, the output hypothesis h has error erC,P (h) > ε with probability smaller than δ in respect to the m examples drawn independently according to P and labeled by C. Thus, unlike in the original PAC setting, polynomial learning of more complex concepts (in terms of the given complexity measure s) is allowed to take more examples. THEOREM 26 [Blumer et al., 1989]. Any concept class for which there is an Occam algorithm is polynomially learnable. Theorem 26 is a generalization of a similar theorem of [Blumer et al., 1987]. Sample complexity bounds as given for PAC learning can be found in [Blumer et al., 1989]. As already mentioned in [Blumer et al., 1989], Theorem 26 can be considered as showing a relationship between learning and data compression. If an algorithm is able to compress the training data (as Occam algorithms do), it is capable of learning. Interestingly, there are also some results that indicate some validness for the other direction of this implication [Board and Pitt, 1990; Li et al., 2003]. The general idea of connecting learnability and compressability is the base of Solomonoff ’s theory of induction, which will be discussed in Section 1.3 below. In this direction, the definition of Occam algorithms has been adapted using the notion of Kolmogorov complexity in [Li and Vit´ anyi, 1997] and [Li et al., 2003], resulting in improved complexity bounds. Agnostic Learning and Efficient PAC Learning Obviously, the setting introduced above is not realistic in that usually the learner has no idea what the possible concepts are that may label the training examples. Thus, in general the learner has no access to a concept class that contains the target concept, so that Assumption 11 does not hold. Learning in this restricted setting is called agnostic. In the agnostic learning model introduced by Haussler [1992] no assumption about the labels of the examples are made (i.e. there need not be a target concept according to which examples are labeled). Instead of a concept class from which target concept and the learner’s hypothesis are taken, there is only a set of possible hypotheses H from which the learner chooses. As it is not clear a priori whether the learner’s hypothesis class is suitable for the problem at hand, the learner’s performance is measured not with respect to a perfect label prediction (which may be impossible to achieve with an unsuitable hypothesis space H), but with respect to the best hypothesis in H. That way, an analogous definition of PAC learning (Definition 19 above) can be given. Thus, as above it is assumed that the distribution that produces the training examples is also used for measuring the performance of the learner’s hypothesis, that is, Assumption 18 still holds.
Mechanizing Induction
731
Haussler [1992] has shown that in order to achieve positive results on agnostic PAC learnability with respect to some hypothesis class H, it is sufficient to solve the optimization problem of finding for any finite set of labeled examples the hypothesis h ∈ H with the minimal number of misclassifications on these examples. Unfortunately, this optimization problem is computationally hard for many interesting hypothesis classes. Further, this also concerns efficient (i.e. polynomial time) PAC learning [Kearns et al., 1994; Feldman, 2008]. Consequently, there have been some negative results on efficient agnostic PAC learning of halfspaces [H¨offgen et al., 1995] or conjunctions of literals [Kearns et al., 1994].16 Similar negative results about efficient non-agnostic PAC learning go back to [Pitt and Valiant, 1988]. These latter results also show that results about efficient (non-)learnability depend on the chosen representation of the concept class. However, for suitable hypothesis classes there are also some positive results for agnostic PAC learning, see e.g. [Kearns et al., 1994; Maass, 1994; Auer et al., 1995]. Online Learning, Transduction and Active Learning The discussed models are only a small part of the machine learning and computational learning theory literature. In this section, we would like to indicate the existence of other interesting models not mentioned above. For a general overview of computational learning theory models and results see [Angluin, 1992]. The e-mail example (Example 1) shows some peculiarities that we have not considered so far. First, the test examples the program has to classify are not present all at once but have to be classified one after another. This is called online learning. This form of learning may have advantages as well as disadvantages. On the one hand, the learner does not have the distribution of the test examples to draw any conclusions from it. On the other hand, if an example is misclassified, the user may intervene and correct the mistake, so that the program gets additional information. For more about online learning see e.g. [Blum, 1998]. A related special feature of Example 1 is that, as there is always only a single example to classify, it is not necessary for the program to generate a hypothesis for the whole space of possible e-mails. It is sufficient to classify each incoming e-mail individually. Of course, this can be done by first generating a global hypothesis from which one infers the label of the example in question. However, as Vapnik [1998], p.477 put it: “When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need 16 These results usually hold only relative to an unsolved problem of complexity theory, i.e. they are valid provided that the complexity classes of NP and RP do not coincide. RP is a superclass of P that contains all decision problems which can be solved in randomized polynomial time, i.e. in polynomial time by a probabilistic Turing machine. If NP=RP, then it is easy to give an efficient learning algorithm.
732
Ronald Ortner and Hannes Leitgeb
but not a more general one. (...) Do not estimate a function if you only need to estimate its values at given points. (Try to perform direct inference rather than induction.)” Thus, it is quite natural to try to find the correct label of the single example directly. This has been termed transduction (contrary to induction which generates general rules) by Vapnik [1995].17 Although from a conceptual point of view there may be little difference between transduction and induction (after all, if I know how to get a label for each single instance, I have automatically also an inductive rule), practically it may be easier to get the label for a single instance than to label the whole instance space. Another possibility for our e-mail program may be that it asks the user for the label of an e-mail that is difficult to classify. This is called active learning or query learning. Usually, this is not considered in an online model (as our e-mail example), but the learner is allowed to choose examples from the instance space by herself and query their labels. As a consequence, the learner may concentrate on “interesting” examples which contain more information, so that she will sometimes be able to learn concepts with fewer examples than in ordinary PAC learning.18 Geometrically, these interesting examples usually lie very close to the boundary that corresponds to the learner’s current hypothesis, that shall separate the positively labeled from the negative examples. For an overview of active learning results see e.g. [Angluin, 1992; Freund et al., 1997; Angluin, 2004].
1.3 Sequence Prediction Learning Setting In concept learning, the labels provide the learner with a certain pattern among the training examples. If the learner has to discover such patterns by herself one speaks of unsupervised learning (as opposed to supervised learning). There are various unsupervised learning settings such as data mining, where the learner tries to extract useful information from large data sets. Here we want to consider an online setting, where the learner’s task is to predict the next entry in a finite sequence of observations. Thus, the learner is presented with a sequence of observations 17 Actually, already Carnap [1950] distinguished between various forms of inductive inference, two of which are universal inference and singular predictive inference, the latter corresponding to what Vapnik calls transduction. 18 There are also similar settings where the learner need not identify a concept, but has to check whether the concept at hand has a certain property. This property testing scenario is in particular natural when dealing with graphs, that is, when X is the set of all possible edges of a graph with given vertex set V , and the concept class is a set of graphs with common vertex set V . For the relation between property testing and learning see [Goldreich et al., 1998]. There is also some literature on query learning of graphs, as in the graph setting beside membership (i.e. edge) queries there are other natural queries one may consider (e.g. edge count, shortest path etc.). See e.g. [Alon and Asodi, 2005; Angluin and Chen, 2008; Bouvel et al., 2005].
Mechanizing Induction
733
x1 , x2 , . . . , xt over some instance set (alphabet) X and has to predict the next observation xt+1 . After her prediction the true value is revealed. REMARK 27. Note that this simple setting may deal with seemingly far more general problems. Consider e.g. transductive online concept learning from training examples (z1 , y1 ), . . . , (zn , yn ) where the zi are taken from some instance space and the labels yi ∈ {0, 1} are given according to some target concept. The learner’s task is to classify some zn+1 . This can be encoded as a sequence prediction problem where the task is to predict the next entry in the sequence z1 , y1 , . . . , zn , yn , zn+1 . The learner’s performance when predicting an xt+1 ∈ X is usually evaluated by (xt+1 , xt+1 ) for a suitable loss function : X ×X → R that measures the distance between the true value xt+1 and the prediction xt+1 . A natural loss function that works for arbitrary X is obtained by setting (x, x ) := 0 if x = x and 1 otherwise. If X ⊆ R, another common loss function is the squared error (1) (x, x ) := (x − x )2 . In any case, loss functions are usually chosen so that the learner will try to minimize the loss. More generally, the learner will not predict a single instance in X but e.g. a probability distribution on X. Thus, in general it is assumed that the learner makes at time t + 1 a decision dt+1 taken from some decision space D. The loss function then maps each pair (dt+1 , xt+1 ) ∈ D × X to R. Similar to the case of concept learning, without further assumptions the learner will stand little chance to learn a given sequence. Even in the simplest, binary case with X = {0, 1} each prediction of the learner can be thwarted by a suitable sequence (see [Dawid, 1985]).19 The strategy to make concept learning possible has been twofold. On the one hand, one assumes that not all concepts are equally likely (Assumption 5), on the other hand one restricts the space of possible hypotheses (which e.g. in the PAC learning setting was done by giving the learner access to a concept class that contains the target concept). While in the setting of (PAC) concept learning both of these measures are taken, in sequence prediction either assumption leads to a different framework. Thus, we either assume that the sequence is generated by an unknown probability distribution (the probabilistic setting), or we consider a fixed sequence and restrict the possible hypotheses for prediction (deterministic setting). Thus, there is some duality between these two settings concerning the made assumptions. This duality is particularly strong when the chosen loss function is the self-information loss function (sometimes also called log-loss function) (2) (d, x) = − log2 d(x), 19 Another argument of this kind with emphasis on computability has been produced by Putnam [1963] in order to criticize Carnap’s inductive logic. Putnam showed that no computable prediction algorithm will work on all computable sequences (cf. also the discussion in [Kelly, 2004b]).
734
Ronald Ortner and Hannes Leitgeb
where d is a probability distribution on X and x ∈ X. Beside some technical advantages of this function, it also establishes a relation between prediction and coding, as log2 d(x) gives the ideal (binary) code length of x with respect to the probability distribution d.20 For details on the duality between probabilistic and deterministic setting under the self-information loss function see [Merhav and Feder, 1998], which also gives a general overview of sequence prediction results. In the probabilistic setting, good prediction performance need not be measured by a loss function. If the learner’s decision is a probability distribution on X that shall approximate the real underlying distribution, there are various requirements the learner’s probability distribution should have in order to be regarded as good. Dawid’s prequential analysis [Dawid, 1984; Dawid and Vovk, 1999] deals with these questions of testing statistical models. Interestingly, this question again is closely related to the self-information loss function (see [Merhav and Feder, 1998]). In the following, we pick two particularly interesting topics, on the one hand Solomonoff ’s theory of induction in the probabilistic setting, and on the other hand, prediction with expert advice in the deterministic setting. Solomonoff ’s Theory of Induction We have already met the idea that learning is related to compression (see the part on Occam algorithms above), which leads to the application of information theoretic ideas to learning. Ray Solomonoff’s theory of induction [Solomonoff, 1964a; Solomonoff, 1964b] reduces prediction to data compression. The idea is summarized in the following postulate, which is evidently an implementation of Occam’s razor that identifies simplicity with compressability. POSTULATE 28. Given a (finite) sequence σ over an alphabet X, predict the x ∈ X that minimizes the difference between the length of the shortest program that outputs σx (i.e. the sequence σ followed by x) and the length of the shortest program that outputs σ.21 There seems to be a fundamental problem with this postulate, as it looks as if it depended on the chosen programming language. However, as was shown independently by Solomonoff [1964a], Kolmogorov [1965], and Chaitin [1969], asymptotically the length of two equivalent computer programs in different universal programming languages differs by at most an additive constant (stemming from the length of a compiler that translates one language into the other). Under this invariance theorem it makes sense to consider the length K(σ) of the shortest program that outputs a sequence σ, which is basically the Kolmogorov complexity 20 That is, if one wants to encode words over the alphabet X where the probability of a letter x ∈ X is d(x), then an optimal binary encoding (to keep the average word length as small as possible) will assign the letter x a code of length about log2 d(x), see [Shannon, 1948] and [Rissanen, 1976]. 21 Actually, one may also predict more than a single element, so that in general x may be a (finite) sequence over X as well.
Mechanizing Induction
735
of the sequence.22 It can be shown that prediction under Postulate 28, which chooses x so that K(σx) − K(σ) is minimized, works well in the limit for a large majority of sequences, provided that the sequence is binary (i.e. X = {0, 1}) and the underlying distribution generating the sequence satisfies some benign technical assumptions. THEOREM 29 [Vit´ anyi and Li, 2000]. Let P be a distribution on the set {0, 1}∞ of all possible infinite sequences over {0, 1} that generates an infinite sequence ω.23 Assume that P is a recursive measure24 and that ω is a P-random25 sequence. Then the x that maximizes P(x|σ) minimizes K(σx) − K(σ) with P-probability converging to 1, as the length of σ tends to infinity. Solomonoff was not only interested in prediction. His motivation was to determine the degree of confirmation26 that a sequence σ is followed by x. Thus, the aim is to obtain a respective probability distribution on all possible continuations of the sequence σ. Solomonoff uses a Bayesian approach to achieve this. For the general problem of choosing a suitable prior probability distribution, a universal distribution U (which is also closely related to the notion of Kolmogorov complexity) is defined which prefers simple continuations of σ and exhibits some favorable properties [Solomonoff, 1978].27 In particular, the universal distribution converges fast to the real underlying probability distribution (under similar assumptions as in Theorem 29). THEOREM 30 G´ acs [Li and Vit´ anyi, 1997]. Given that P is a positive recursive measure over {0, 1}∞ that generates a P-random binary infinite sequence, U(x|σ) →1 P(x|σ) with P-probability 1, when the length of σ tends to infinity. Moreover, the sum over the expected squared errors is basically bounded by 22 Actually
there are various (Kolmogorov) complexity variants that coincide up to an additive constant (that depends on the sequence σ). For the sake of simplicity we are not going to distinguish them here. See Section 4.5 of [Li and Vit´ anyi, 1997] for details. 23 Note that the assumption made in this setting is different from the PAC learning setting. Whereas in PAC learning it is assumed that each single example is drawn according to a fixed distribution, here the distribution is over all possible infinite sequences, which is actually more general. 24 This is a modest assumption on the computability of the distribution function, see Chapter 4 of [Li and Vit´ anyi, 1997]. 25 A sequence σ = σ σ . . . is P-random if sup U (σ . . . σ )/P(σ . . . σ ) < ∞, where U is n n 1 2 1 1 n the universal prior distribution (cf. below). In the set of all infinite sequences, the P-random sequences have P-measure 1, that is, almost all considered sequences will be P-random, see Section 4.5 of [Li and Vit´ anyi, 1997]. 26 A student of Carnap, he explicitly refers to Carnap’s [1950]. For more about Carnap’s role see [Solomonoff, 1997]. 27 It has been argued [Kelly, 2004a] that Solomonoff’s theory of induction only provides a circular argument for Occam’s razor, as the chosen prior already prefers short descriptions. However, this neglects that the prior distribution itself is a good predictor as the subsequent results show.
736
Ronald Ortner and Hannes Leitgeb
the Kolmogorov complexity K(·) of the underlying distribution [Li and Vit´ anyi, 1997].28 THEOREM 31 [Solomonoff, 1978]. If P is a recursive measure over {0, 1}∞ that generates a binary sequence, then
P(σ) U(0|σ) − P(0|σ)
n |σ|=n−1
2
≤
K(P) ln 2 , 2
where |σ| denotes the length of the sequence σ. Unfortunately, neither Kolmogorov complexity nor the universal prior distribution are computable [Li and Vit´ anyi, 1997]. Thus, while Solomonoff’s framework of algorithmic probability may offer a theoretical solution to the problem of induction, it cannot be directly applied to practical problems. However, on the other hand there are some principal theoretical limitations on the computability of prediction algorithms, cf. e.g. [Putnam, 1963]29 and [V yugin, 1998].30 In fact, it has been argued that there is a strong analogy between uncomputability and the problem of induction [Kelly, 2004c]. Another problem is that the mentioned constant of the invariance theorem in general will be quite large so that for short sequences the theoretical results are worthless, while the approach may not work well in practice. In spite (or maybe because of) these two deficiencies, Solomonoff’s research has ignited a lot of research that on the one hand improved over theoretical results [Li and Vit´ anyi, 1997; Hutter, 2001; Hutter, 2004; Hutter, 2007], while on the other hand, many practical approaches can be considered as approximations to his uncomputable algorithm. In particular, the MDL approach mentioned in Section 1.2 (see Postulate 9) emanated from Solomonoff’s work. For a closer comparison of the two frameworks see Chapter 17 of [Gr¨ unwald, 2007]. Prediction with Expert Advice In the deterministic setting where the underlying sequence that shall be predicted is considered to be fixed, it will be necessary to compete with the best hypothesis in a confined hypothesis space, as it is impossible to compete with perfect prediction in general (similarly to the no-free-lunch theorem). On the other hand, it is obviously futile to predict deterministically. For each deterministic prediction there is a sequence where the prediction will be wrong (and more generally, will maximize the loss function). Thus, the learner has to maintain a probability distribution on the possible predictions. Note that this probability distribution is used for randomization. Unlike that, in the probabilistic setting, the learner often uses a probability distribution to approximate the real underlying distribution. 28 For the definition of the Kolmogorov complexity of a distribution we refer to Chapter 4 of [Li and Vit´ anyi, 1997]. 29 Cf. footnote 19. 30 For further theoretical limitations due to G¨ odel’s incompleteness results see [Legg, 2006].
Mechanizing Induction
737
Usually (cf. also the concept learning setting), the set of hypotheses will be large in order to guarantee that there is a hypothesis that predicts the sequence well. For more about this setting see [Merhav and Feder, 1998]. Here we will consider the setting where the hypothesis space H is finite. The hypotheses in H are usually referred to as experts that serve as reference forecasters. We assume that the experts’ predictions are taken from the same decision space D the learner chooses her prediction from. (In special cases D may equal X.) Note that as it is not known how the experts determine their predictions, there is no assumption about this. The learner may use these experts’ advice to determine her own prediction. Note that in general the learner’s prediction will not coincide with one of the expert’s prediction: all the experts may suggest a deterministic prediction, while we have already seen that it only makes sense for the learner to predict randomly according to some distribution. The learner’s goal is to compete with the best expert. That is, the learner will suffer a loss of (dt , xt ) at time t for her decision dt ∈ D. Similarly, at time t each expert E ∈ H has loss (dE t , xt ) for his decision dE t . Competing with the best expert then means that the learner will try to keep the regret with respect to the best expert RT := min
E∈H
T t=1
(dE t , xt ) −
T
(dt , xt )
t=1
as low as possible. Surprisingly, under some mild technical assumptions, one can show that for some learning algorithms the average regret (over time) tends to 0 for each individual sequence, when T approaches infinity. THEOREM 32 [Auer et al., 2002; Cesa-Bianchi and Lugosi, 2006]. Assume that the decision set D is a convex subset of Rn and consider some expert set H. Further let be a loss function that takes values only in the interval [0, 1] and is convex in the first argument,31 i.e. for each x ∈ X, λ ∈ [0, 1] and d, d ∈ D: λd + (1 − λ)d , x ≤ λ(d, x) + (1 − λ)(d , x). Then the regret RT of the exponentially weighted forecasting algorithm (as specified on pp.14 and 17 in [Cesa-Bianchi and Lugosi, 2006]) can be bounded as ( ( T ln |H| ln |H| + RT ≤ 2 2 8 for each T > 0 and each individual sequence over X. What is remarkable about this theorem is that it does not need any inductive assumptions. The reason why the theorem holds for any sequence is that, intuitively speaking, by considering the loss with respect to the best expert only the difference to this best expert matters, so that the underlying sequence in some 31 The convexity condition holds e.g. for the square loss function (1) and the logarithmic loss function (2).
738
Ronald Ortner and Hannes Leitgeb
sense is not important anymore. Of course, practically the theorem only has impact if there is at least one expert whose predictions are good enough to keep the loss with respect to the underlying sequence low. Increasing the number of experts |H| to guarantee this of course deteriorates the bound. For more results in the expert advice setting see [Cesa-Bianchi and Lugosi, 2006] that also deals with applications to game theory.
2 NONMONOTONIC REASONING
2.1 Introduction In order to cope successfully with the real world, AI applications need to reproduce patterns of everyday commonsense reasoning. As theoretical computer scientists began to realize in the late 1970s, such patterns of inference are hard, if not impossible, to formalize in standard first-order logic. New proof-theoretic and semantic mechanisms were sought-after by which conclusions could be inferred in all “normal” cases in which the premises were true, thus trying to capture the way in which human agents fill knowledge gaps by means of default assumptions, in particular, conditional default assumptions of an ‘if. . .then. . .’ form. EXAMPLE 33. Assume you want to describe what happens to your car when you turn the ignition key: ‘If the ignition key is turned in my car, then the car starts.’ seems to be a proper description of the situation. But how shall we represent this claim in a first-order language? The standard way of doing it, according to classical AI, would be in terms of universal quantification and material implication, i.e. by means of something of the form ∀t(ϕ[t] → ψ[t]), where ϕ, ψ are complex formulas, t is a variable for points of time, and → is the material conditional sign. But what if the gas tank is empty? You better improve your description by adding a formalization of ‘. . .and the gas tank is not empty’ to ϕ. However, the resulting statement could still be contradicted by a potato that is clogging the tail pipe, or by a failure of the battery, or by an extra-terrestrial blocking your engine, and so forth. The possible exceptions to universally quantified material conditionals are countless, heterogeneous, and unclear. Nevertheless we seem to be able to communicate and reason rationally with the original information ‘If the ignition key is turned in my car, then the car starts.’, and the same should be true of intelligent computers. How are human agents able to circumvent this problem? The key to an answer is to understand that we do not actually take ‘If the ignition key is turned in my car, then the car starts.’ as expressing that at any point of time it is not the case that the ignition key is turned and the car does not start — after all, what is negated here might indeed be the case in exceptional circumstances — but rather that normally given the ignition key is turned at a time, the car starts. Instead of trying to enumerate a possibly indefinite class of exceptions, we tacitly or explicitly qualify ‘If the ignition key is turned in my car, then the car starts.’ as saying
Mechanizing Induction
739
something about normal or likely circumstances, whatever these circumstances may look like. As a consequence, the logic of such everyday if-then claims differs from the logic of (universally quantified) material conditionals in first-order logic. In particular, while Monotonicity (or Strengthening of the Antecedent), i.e. ϕ→ψ ϕ∧ρ→ψ is logically valid for material → (whether in the scope of universal quantifiers or not), the acceptance of the conditional ‘If Tweety is a bird, then [normally] Tweety is able to fly.’ does not seem to rationally necessitate the acceptance of any of the following conditionals: ‘If Tweety is a penguin bird, then [normally] Tweety is able to fly.’; ‘If Tweety is a dead bird, then [normally] Tweety is able to fly.’; ‘If Tweety is a bird with his feet set in concrete, then [normally] Tweety is able to fly.’. So computer scientists found themselves in need of expressing formally if-then statements on the basis of which computers should be able to draw justified inferences about everyday matters, but where these statements do not logically obey Monotonicity; hence their speaking of ‘nonmonotonic’ conditionals or inference. This is the subject matter of Nonmonotonic Reasoning, without doubt one of the most vibrant areas of theoretical computer science in the last 30 years. Nonmonotonic reasoning systems become inductive reasoners in the sense of the Machine Learning part of this article by the following move: assume the complete information that a database contains is the factual information ϕ1 , . . . , ϕm together with the conditional information α1 ⇒ β1 , . . . , αn ⇒ βn where ‘⇒’ is a new conditional connective which expresses ‘if. . . then normally . . .’. From the conditionals that are stored in the database the reasoning system now aims to derive some further conditionals the antecedents of which exhaust the complete factual information in the database, i.e. conditionals of the form ϕ1 ∧ . . . ∧ ϕm ⇒ ψ For every conditional that can be derived in this way, the factual information ψ is inferred by the system. Since the conditionals involved only express what holds in normal circumstances, this is an inductive inference from ϕ1 , . . . , ϕm to ψ under the tacit assumption that the reasoning system does not face an abnormal situation. The antecedents of the conditionals which the system aims to derive have to consist of the total factual information that is accessible to the system, as it would be invalid to strengthen weaker antecedents by means of the Monotonicity rule. On the methodological side, the main question to be answered at this point is: Which
740
Ronald Ortner and Hannes Leitgeb
rules of inference may the reasoner apply in order to derive further “normality conditionals” from its given “normality conditionals”? We are going to deal with this question in detail further down below. Here are some brief pointers to the literature: Although Ginsberg [1987] is outdated as a collection of articles, it still proves to be useful if one wants to see where Nonmonotonic Reasoning derives from historically. Brewka, Dix, and Konolige [1997] and Makinson [2005] give excellent and detailed overviews of Nonmonotonic Reasoning. Schurz and Leitgeb [2005] is an informative compendium of papers dealing with some of the more empirical and philosophical aspects of Nonmonotonic Reasoning; for more references to psychological investigations into nonmonotonic reasoning see Oaksford, Chater, and Hahn [2009]. Finally, the Stanford Encyclopedia of Philosophy includes two nice entries on “Non-monotonic Logic” and “Defeasible Reasoning” which can be accessed online.
2.2 Nonmonotonic Reasoning: The KLM Approach There are, broadly speaking, two approaches of how to formalize statements such as ‘If the ignition key is turned in my car, then the car starts.’ or ‘If Tweety is a bird, then Tweety is able to fly.’ in terms of nonmonotonic conditionals: Either exceptional circumstances are represented explicitly as those which contradict certain explicitly made claims, or they are left implicit, simply by not mentioning them at all. The paradigm case of the first type of formalization is default logic (see Reiter [1980]) in which e.g. the Tweety case is handled by a so-called default rule which expresses: If you know that Tweety is a bird, and nothing you know is inconsistent with Tweety being able to fly, then you are allowed to conclude that Tweety is able to fly. Such consistency-based approach dominated the scene in the 1980s. According to the other approach — which goes back to Shoham [1987], but which is exemplified most famously by the KLM approach, i.e. Kraus, Lehmann, and Magidor [1990], which really took off in the 1990s — the conditional ‘If Tweety is a bird, then Tweety is able to fly.’ is left unchanged syntactically, but the ‘if’‘then’ connective that it contains is understood as: In the most normal or preferred circumstances in which Tweety is a bird, it is the case that Tweety is able to fly. In the following, we will concentrate exclusively on the second, preferential approach, which turned out to be the dominant one as far as the logical aspects of nonmonotonic reasoning are concerned. Although the KLM account was anticipated by theories in philosophical logic, philosophy of language, and inductive logic — as we will highlight in the later sections — it is still widely unknown outside of computer science. So summarizing its main achievements proves to be useful even though the original sources (mainly, KLM [1990] and Lehmann and Magidor [1992]) are themselves clear, self-contained and extensive. The KLM approach also led to new logical treatments of inference in neural networks, which we will discuss briefly as well.
Mechanizing Induction
741
Conditional Theories (Nonmonotonic Inference Relations) We are now going to deal with various systems of nonmonotonic reasoning which have been introduced by KLM [1990]. In contrast with KLM, we will not present these systems in terms of so-called inference or consequence relations, i.e. as binary relations |∼ on a propositional language L (cf. Makinson [1994] and [1989]), but rather, more syntactically minded, as conditional theories, i.e. as sets of conditionals closed under rules of nonmonotonic reasoning. So instead of saying that α |∼ β, we will say that α ⇒ β ∈ T H⇒ , where T H⇒ is a theory of conditionals, and ⇒ is a new conditional connective which we will use to express nonmonotonic conditionals. In this way, it will be easier to compare the logical systems of nonmonotonic reasoning with systems of conditional logic studied in other areas. Furthermore, calling the relations |∼ consequence relations typically leads to confusion on the side of philosophers: These are not meant to be relations of logical consequence; rather they have a similar methodological status as theories, i.e. they are meant to support plausible inferences within some intended domain of application. But mainly this is all just a matter of presentation; conditional theories in our sense can still be viewed as being nothing but inference relations. L will always be some language of propositional logic that is based on finitely many propositional variables, with connectives ¬, ∧, ∨, →, ↔, (for tautology), and ⊥ (for contradiction). L⇒ will be the set of all formulas of the form α ⇒ β for α, β ∈ L, with ⇒ being the new nonmonotonic conditional sign. Note that L⇒ does not allow for nestings of nonmonotonic conditionals nor for the application of propositional operators to nonmonotonic conditionals. Finally, whenever we will refer to a theory T H→ (rather than T H⇒ ), we mean a deductively closed set of formulas in L; each such set is going to entail deductively a set of material conditionals. We will always consider our conditional theories T H⇒ of “soft” or “defeasible” conditionals, such as bird ⇒ f ly, as extending “hard” material conditionals, such as penguin → bird, which are entailed by some given theory T H→ ; the corresponding notion of ‘extending’ is made precise by the rules of Left Equivalence and Right Weakening stated below. We will leave the question open at this point whether the formulas of L ought to be regarded as open formulas or rather as sentences, and whether formulas of the forms α → β and α ⇒ β ought to be regarded as tacitly quantified in some way or not. We will return to this point later. In our presentation of systems of nonmonotonic logic, we will follow the (more detailed) presentation given by Leitgeb [2004], part III. DEFINITION 34. 1. A conditional C-theory extending T H→ is a set T H⇒ ⊆ L⇒ with the property that for all α ∈ L it holds that α ⇒ α ∈ T H⇒ (Reflexivity) and which is closed under the following rules:
742
Ronald Ortner and Hannes Leitgeb
(a)
T H→ α ↔ β, α ⇒ γ (Left Equivalence) β⇒γ
(b)
T H→ α → β, γ ⇒ α (Right Weakening) γ⇒β
(c)
α ∧ β ⇒ γ, α ⇒ β (Cautious Cut) α⇒γ
(d)
α ⇒ β, α ⇒ γ (Cautious Monotonicity) α∧β ⇒γ
We refer to the axiom scheme and the rules above as the system C (see KLM [1990], pp.176–180). The rules are to be understood as follows: E.g. by Cut, if α ∧ β ⇒ γ ∈ T H⇒ (where propositional connectives such as ∧ always bind more strongly than ⇒) and α ⇒ β ∈ T H⇒ , then α ⇒ γ ∈ T H⇒ . 2. A conditional C-theory T H⇒ (extending whatever set T H→ ) is consistent iff ⇒ ⊥ ∈ / T H⇒ . 3. A conditional CL-theory T H⇒ extending T H→ is a conditional C-theory extending T H→ , which is also closed under the following rule: α0 ⇒ α1 , α1 ⇒ α2 , . . . , αj−1 ⇒ αj , αj ⇒ α0 (Loop) αr ⇒ αr (r, r are arbitrary members of {0, . . . , j}). We refer to C+Loop as the system CL (see KLM [1990], pp.187). 4. A conditional P-theory T H⇒ extending T H→ is a conditional CL-theory extending T H→ , which is closed under the additional rule: α ⇒ γ, β ⇒ γ (Or) α∨β ⇒γ We refer to CL+Or as the system P (see KLM [1990], pp.189–190; there it is also shown that Loop can actually be derived from the other rules in P). 5. A conditional R-theory T H⇒ extending T H→ is a conditional P-theory extending T H→ , which has the following property (this is a so-called non-Horn condition: see Makinson [1994], Section 4.1, for further details): / T H⇒ , then α ∧ β ⇒ γ ∈ T H⇒ (Rational If α ⇒ γ ∈ T H⇒ , and α ⇒ ¬β ∈ Monotonicity). We refer to P+Rational Monotonicity as the system R (see Lehmann and Magidor [1992], pp.16–48). Each of these rules is meant to apply for arbitrary α, β, γ, α0 , α1 , . . ., αj ∈ L.
Mechanizing Induction
743
REMARK 35. • It is easy to see that a conditional C-theory T H⇒ is consistent iff T H⇒ is non-trivial, i.e. T H⇒ = L⇒ (use Right Weakening and Cautious Monotonicity). • If a conditional C-theory T H⇒ extending T H→ is consistent, then also T H→ is consistent, i.e. T H→ ⊥ (use Reflexivity and Right Weakening). Cumulativity, i.e. Cautious Cut and Cautious Monotonicity taken together, has been suggested by Gabbay [1984] as a valid closure property of plausible reasoning. The stronger system P, which extends cumulativity by a rule for disjunction, has become the standard system of nonmonotonic logic and can be proved sound and complete with respect to many different semantics of nonmonotonic logic (some of them are collected in Gabbay, Hogger, and Robinson [1994]; see also G¨ardenfors and Makinson [1994], Chapter 4.3 in Fuhrmann [1997], Benferhat, Dubois, and Prade [1997], Benferhat, Saffiotti, and Smets [2000], Goldszmidt and Pearl [1996], Pearl and Goldszmidt [1997], Halpern [2001b]). We are going to deal with the most influential semantics for the logical systems introduced above — the preferential semantics of KLM [1990] — below. Derivable Rules LEMMA 36. (KLM [1990], pp.179–180) The following rules are derivable in C, i.e. if the premises of the following rules are members of a conditional C-theory T H⇒ , then one can prove that the same holds for their conclusions: 1.
α ⇒ β, α ⇒ γ α⇒β∧γ
2.
α ⇒ β, β ⇒ α, α ⇒ γ β⇒γ
(Equivalence)
3.
α ⇒ (β → γ) , α ⇒ β α⇒γ
(Modus Ponens in the Consequent)
4.
α ∨ β ⇒ α, α ⇒ γ α∨β ⇒γ
5.
T H→ α → β α⇒β
(And)
(Supra-Classicality)
LEMMA 37. (KLM [1990], p.191) The following rules are derivable in P: 1.
α∧β ⇒γ (S) α ⇒ (β → γ)
744
Ronald Ortner and Hannes Leitgeb
2.
α ∧ β ⇒ γ, α ∧ ¬β ⇒ γ α⇒γ
(D)
By means of any of the semantics for these systems of nonmonotonic logic, it is easy to prove that neither Contraposition nor Transitivity nor Monotonicity (for ⇒) is derivable in any of them. The following examples from everyday reasoning show that this is exactly as it ought to be: EXAMPLE 38. •
If a is a human, then normally a is not a diabetic. (Contraposition) If a is a diabetic, then normally a is not human. ???
If a is from Munich, then normally a is a German. • If a is a German, then normally a is not from Munich. If a is from Munich, then normally a is not from Munich. ??? ity) •
(Transitiv-
If a is a bird, then normally a is able to fly. (Monotonicity) If a is a penguin bird, then normally a is able to fly. ???
Derivability of Conditionals from Conditional Knowledge Bases The notion of derivability of a conditional from a set of conditionals (in AI terms: from a conditional knowledge base) is defined in analogy with derivability for formulas of classical propositional logic, with the exception of the system R. DEFINITION 39. Let KB⇒ ⊆ L⇒ : 1. A C-derivation (rel. to T H→ ) of ϕ ⇒ ψ from KB⇒ is a finite sequence α1 ⇒ β1 , . . . , αk ⇒ βk where αk = ϕ, βk = ψ, and for all i ∈ {1, . . . , k} at least one of the following conditions is satisfied: • αi ⇒ βi ∈ KB⇒ . • αi ⇒ βi is an instance of Reflexivity. • αi ⇒ βi is the conclusion of one of the rules of C, such that the conditional premises of that rule are among {α1 ⇒ β1 , . . . , αi−1 ⇒ βi−1 }, and in the case of Left Equivalence and Right Weakening the derivability conditions concerning T H→ are satisfied. 2. KB⇒ TC H→ ϕ ⇒ ψ (ϕ ⇒ ψ is C-derivable rel. to T H→ from KB⇒ ) iff there is a C-derivation of ϕ ⇒ ψ rel. to T H→ from KB.
Mechanizing Induction
745
, 3. DedTC H→ (KB⇒ ) = ϕ ⇒ ψ KB⇒ TC H→ ϕ ⇒ ψ (the conditional C-closure of KB⇒ rel. to T H→ ). 4. ϕ ⇒ ψ is C-provable (rel. to T H→ ) iff ∅ TC H→ ϕ ⇒ ψ. Analogous concepts can be introduced for the systems CL and P. REMARK 40. 1. As in the case of deductive derivability, it follows that (a) KB⇒ ⊆ DedTC H→ (KB⇒ ). then DedTC H→ (KB⇒ ) ⊆ DedTC H→ (KB⇒ ). (b) If KB⇒ ⊆ KB⇒
(c) DedTC H→ (DedTC H→ (KB⇒ )) = DedTC H→ (KB⇒ ). 2. T H⇒ is a conditional C-theory extending T H→ iff DedTC H→ (T H⇒ ) = T H⇒ . Since DedTC H→ (DedTC H→ (KB⇒ )) = DedTC H→ (KB⇒ ), DedTC H→ (KB⇒ ) is a conditional C-theory extending T H→ for arbitrary KB⇒ . In particular, DedTC H→ (∅) (the set of formulas which are C-provable rel. to T H→ ) is a conditional C-theory extending T H→ . 3. DedTC H→ (KB⇒ ) is the smallest conditional C-theory extending T H→ which contains KB⇒ . EXAMPLE 41. Assume KB⇒ to consist of 1. bird ⇒ f ly, 2. penguin ⇒ ¬f ly. Suppose T H→ contains 3. penguin → bird. By an application of Supra-Classicality to 3, one can derive 4. penguin ⇒ bird. Applying Cautious Monotonicity to 4 and 2 yields 5. penguin ∧ bird ⇒ ¬f ly. This can be interpreted as follows: Since by conditional 3 the penguin information is at least as specific as a mere bird information, conditional 2 overrides conditional 1: penguin birds are derived to be unable to fly. In the case of R, derivability has to be defined differently due to the presence of a non-Horn rule, i.e. Rational Monotonicity:
746
Ronald Ortner and Hannes Leitgeb
DEFINITION 42. 1. KB⇒ TR H→ ϕ ⇒ ψ (ϕ ⇒ 0ψ is R-derivable rel. to T H→ from KB⇒ ) iff ϕ ⇒ ψ is a member of {T H⇒ | T H⇒ ⊇ KB⇒ , T H⇒ is a cond. R-theory extend. T H→ }. , 2. DedTR H→ (KB⇒ ) = ϕ ⇒ ψ KB⇒ TR H→ ϕ ⇒ ψ (the conditional R-closure of KB⇒ rel. to T H→ ). 3. ϕ ⇒ ψ is R-provable (rel. to T H→ ) iff ∅ TR H→ ϕ ⇒ ψ. DedTR H→ satisfies the same closure conditions as stated above. In particular, note that the deductive closure operator of each of these systems of nonmonotonic logic is monotonic (see 1b in Remark 40 above); so these logics are nonmonotonic only in the sense that they are logical systems for conditionals which are not monotonic with respect to their antecedents, i.e. which do not obey Monotonicity as a logically valid rule. In other words: the term ‘nonmonotonic reasoning’ is ambiguous — it can either refer to ‘inference by means of nonmonotonic conditionals’ (this is what we have considered so far) or to ‘nonmonotonic deductive closure/entailment’ (this is what we will deal with in Subsection 2.2 below) or to both. Obviously, the system C is weaker than CL in terms of derivability, and the system CL is weaker than P, where the weaker-than relations in question are nonreversable. More surprisingly, P and R are equally strong in terms of derivability: THEOREM 43. (See Lehmann and Magidor [1992], pp.24f, for the semantic version of this result.) KB⇒ TP H→ α ⇒ β iff KB⇒ TR H→ α ⇒ β. With respect to provability, i.e. derivability from the empty knowledge base, all of the systems dealt with above turn out to be equally strong and in fact just as strong as classical logic (if ⇒ is replaced by →). Preferential Semantics Next we follow KLM [1990] and Lehmann and Magidor [1992] by introducing preferential or ranked models for conditional theories, where the intended interpretation of the preference relations or rankings that are part of such models is in terms of “degrees of normality”. (Such ranked models are also closely related to Spohn’s [1987] so-called ordinal conditional functions or ranking functions.) For each of these types of models, we presuppose a non-empty set W of possible worlds which we consider to be given antecedently and we think of as representing an agent’s “hard” knowledge. Each world w ∈ W is assumed to stand in a standard satisfaction relation with respect to the formulas of L.
Mechanizing Induction
747
DEFINITION 44. 1. A cumulative model M is a triple S, l, ≺ with (a) a non-empty set S of so-called “states”, (b) a labeling l : S → 2W \ {∅} of states, (c) a normality “order”, or preference relation, ≺ ⊆ S × S between states; if s1 ≺ s2 , we say that s1 is more normal than s2 (note that ≺ is not necessarily a strict order relation); (d) such that, M satisfies the Smoothness Condition (see below). 2. Factual formulas α ∈ L are made true by states s ∈ S in the following way: s |≡ α iff ∀w ∈ l(s): w α (in such a case we also say that s is an α-state). 3. For every α ∈ L let α ˆ = {s ∈ S |s |≡ α }. 4. For every α ∈ L: s ∈ α ˆ is minimal in α ˆ iff ¬∃s ∈ α ˆ : s ≺ s. 5. The Smoothness Condition: Every state that makes α true is either itself most normal among the states which make α true, or there is a more normal state that makes α true and which is also most normal among the states that make α true; i.e.: ˆ. ∀α ∈ L, ∀s ∈ α ˆ : s is minimal in α ˆ or ∃s ≺ s, such that s is minimal in α 6. Relative to a cumulative model M = S, l, ≺, we can define: Mα⇒β iff ∀s ∈ S: if s is minimal in α ˆ , then s |≡ β (i.e.: the most normal states among those that make α true also make β true, or: normal α are β). 7. Let T H⇒ (M) = {α ⇒ β |M α ⇒ β }: T H⇒ (M) is the conditional theory corresponding to M. 8. α ⇒ β is cumulatively valid iff for every cumulative model M: M α ⇒ β. 9. Let M be a cumulative model: M KB⇒ iff for every α ⇒ β ∈ KB⇒ it holds that M α ⇒ β. 10. We say that KB⇒ c α ⇒ β (KB⇒ cumulatively entails α ⇒ β) iff for every cumulative model M: if M KB⇒ , then M α ⇒ β.
748
Ronald Ortner and Hannes Leitgeb
The additional types of models we will study are: DEFINITION 45. 1. A cumulative-ordered model M is a cumulative model S, l, ≺, such that ≺ is a strict partial order, i.e. irreflexive and transitive. 2. A preferential model M is a cumulative-ordered model S, l, ≺, such that ∀s ∈ S: l(s) is a singleton, i.e. l(s) = {w} for some w ∈ W . 3. A ranked model M is a preferential model S, l, ≺, where for some k ∈ N there is a surjective mapping rk : S → {0, . . . , k}, such that for all s1 , s2 ∈ S: s1 ≺ s2 iff rk(s1 ) < rk(s2 ) (rk(s) is called the ‘rank’ of s under rk). Here is a diagram of what a typical ranked model looks like: α @p p p p p min(α) p p p This model consists of three layers of worlds of equal rank. Within the set of α-states, the minimal ones are singled out, as they are taken to minimize “abnormality”; if these minimal α-states are all β-states, then α ⇒ β is considered satisfied by the model. For each of these classes of models, the corresponding notions of satisfaction, determined conditional theories, validity (cumulative-ordered-valid, preferentially valid, rank-valid), and entailment (co , p , r , i.e. cumulative-ordered-entails, preferentially entails, rank-entails) can be introduced in analogy with the case of cumulative models. The definition of ranked models in Lehmann and Magidor [1992] is actually more complex than ours, but our definition is equivalent for the case of a finite set W of worlds, and it is certainly more handy. Obviously, the various kinds of entailment defined above come with strictly increasing strength, except for (see Lehmann and Magidor [1992]): THEOREM 46. KB⇒ preferentially entails α ⇒ β iff KB⇒ rank-entails α ⇒ β. As for validity, all notions of validity corresponding to the classes of models defined above coincide; indeed, they coincide with validity for material conditionals α → β. REMARK 47. {α ⇒ β ∈ L⇒ |α → β ∈ T H→ (W ) } is a conditional theory of any of the defined types. KLM [1990] and Lehmann and Magidor [1992] show the following soundness and completeness properties of systems of nonmonotonic logic with respect to preferential semantics:
Mechanizing Induction
749
THEOREM 48. 1. (KLM [1990], pp.184–185) T H⇒ ⊆ L⇒ is a consistent conditional C-theory extending T H→ iff there is a cumulative model M based on the set W of worlds satisfying T H→ , such that T H⇒ = T H⇒ (M). 2. (KLM [1990], p.189) T H⇒ ⊆ L⇒ is a consistent conditional CL-theory extending T H→ iff there is a cumulative-ordered model M based on the set W of worlds satisfying T H→ , such that T H⇒ = T H⇒ (M). 3. (KLM [1990], p.196) T H⇒ ⊆ L⇒ is a consistent conditional P-theory extending T H→ iff there is a preferential model M based on the set W of worlds satisfying T H→ , such that T H⇒ = T H⇒ (M). 4. (Lehmann and Magidor [1992], pp.21–23) T H⇒ ⊆ L⇒ is a consistent conditional R-theory extending T H→ iff there is a ranked model M based on the set W of worlds satisfying T H→ , such that T H⇒ = T H⇒ (M). THEOREM 49. Let T H→ be the set of formulas satisfied by every world in the given set W of worlds. It holds: 1. KB⇒ TC H→ α ⇒ β iff KB⇒ c α ⇒ β. H→ α ⇒ β iff KB⇒ co α ⇒ β. 2. KB⇒ TCL
3. KB⇒ TP H→ α ⇒ β iff KB⇒ p α ⇒ β. 4. KB⇒ TR H→ α ⇒ β (iff KB⇒ TP H→ α ⇒ β) iff KB⇒ r α ⇒ β.
Nonmonotonic Deductive Closure/Entailment As we have seen in Subsection 2.2, deductive closure in nonmonotonic logic as understood above is actually monotonic, and by the results in the last subsection the same is true of the relations of logical entailment introduced above, i.e.: , then also KB⇒ α ⇒ β (with beif KB⇒ α ⇒ β and KB⇒ ⊆ KB⇒ ing one of these entailment relations). However, there are also some strengthenings of logical entailment which are even nonmonotonic in the entailment sense:
750
Ronald Ortner and Hannes Leitgeb
E.g. Lehmann’s and Magidor’s [1992] rational closure operator (which is virtually identical to Pearl’s [1990] so-called system Z) strengthens entailment by demanding truth preservation not in every ranked model in which a given conditional knowledge base is satisfied but only in those ranked models which maximize cautiousness and normality, in a sense that is made precise in Lehmann and Magidor [1992]. Goldszmidt, Morris, and Pearl [1993] maximum entropy approach and Lehmann’s [1995] lexicographic entailment are further methods of nonmonotonic closure. Some Complexity Considerations While some of the consistency-based approaches to nonmonotonic reasoning, according to which exceptions to conditional defaults are stated explicitly (recall the introductory part of Subsection 2.2), do have nice implementations in terms of PROLOG or logic programs, nonmonotonic reasoning in the preferential KLM style is implemented in very much the same manner as standard systems of modal logic, and most of the complexity considerations concerning the latter (see standard textbooks on modal logic) carry over to the former. Let ϕ be the conjunction of all entries in a (finite) factual knowledge base. It is to be decided whether ϕ ⇒ ψ is entailed by the conditional knowledge base in one of the senses explained. One can show that this decision problem is co-NP-complete for preferential entailment and hence just as hard as the unsatisfiability decision problem for propositional logic (see Lehmann and Magidor [1992], p.16). However, as Lehmann and Magidor prove as well, the decision problem is polynomial in the case of Horn assertions. Lehmann and Magidor [1992], p.41, show that the decision procedure for rational closure is essentially as complex as the satisfiability problem for propositional logic. An excellent overview of such results can be found in Eiter and Lukasiewicz [2000]. One of the lessons to be drawn from these results is this: While progress in Nonmonotonic Reasoning has added to the expressive power of symbolic knowledge representation, it has not increased accordingly the inferential power of symbolic reasoning mechanisms by finding ways of improving their computational efficiency significantly. The Interpretation of Conditionals Reconsidered Computer scientists rarely address the question of what the exact interpretation of default conditionals of the form ϕ ⇒ ψ ought to be. In particular, the following two sets of locutions are often not distinguished properly: on the one hand, • if ϕ then normally ψ • if ϕ then it is very likely that ψ and, on the other, • normal ϕ are ψ
Mechanizing Induction
751
• by far most of the ϕ are ψ In the first set, ϕ and ψ are to be replaced by sentences such as ‘Tweety is a bird.’ and ‘Tweety is able to fly.’, whereas in the second set ϕ and ψ are to be substituted by generics such as ’birds’ and ‘flyers’ (or, in a more formalized context, by open formulas such as ‘x is a bird.’ and ‘x is able to fly.’). In the first set, ⇒ is a sentential operator, while in the second set it is actually a generalized quantifier. (See van Benthem [1984] for more on this correspondence between conditionals and quantifiers; see Peters and Westerstahl [2006] for an extensive treatment of generalized quantifiers.) As far as preferential semantics is concerned, the set W of “possible worlds” is not so much a set of possible worlds in the second case but rather a universe of “possible objects” which are ordered by the normality of their occurrence. Accordingly, if a member of the first set were intended to express something probabilistic, then the probability measure in question should be a subjective probability measure by which rational degrees of belief are attributed to propositions, whereas in the case of the members of the second set, the corresponding probability measure should be a statistical one by which (limit) percentages are attributed to properties. For both sets of interpretation, the systems of nonmonotonic logic studied above are valid, but the application of these systems to actual reasoning tasks is still sensitive to the intended interpretation of ϕ ⇒ ψ.
2.3
Bridges
Now we are turning to formalisms and theories which are, in a sense to be explained, closely related to Nonmonotonic Reasoning. The Bridge to the Logic of Counterfactuals Amongst conditionals in natural language, usually the following distinction is made (this famous example is due to Ernest Adams): 1. If Oswald had not killed Kennedy, then someone else would have. 2. If Oswald did not kill Kennedy, then someone else did. Sentence 2 is accepted by almost everyone, whilst we do not seem to know whether sentence 1 is true. This invites the following classification: A conditional such as sentence 2 is called indicative, a conditional like sentence 1 is called subjunctive. In conversation, the antecedents of subjunctive conditionals are often assumed or presupposed to be false: in such cases, one speaks of these subjunctive conditionals as counterfactuals. Subjunctive and indicative conditionals may have the same antecedents and consequents while differing only in their conditional connectives, i.e. their ‘if’-‘then’ occurrences have different meanings. What both occurrences of ‘if’-‘then’ in these examples have in common, however, is that they are nonmonotonic: E.g. the indicative ‘If it rains, I will give you an umbrella.’ does
752
Ronald Ortner and Hannes Leitgeb
not seem to logically imply “If it rains and I am in prison, I will give you an umbrella.’, nor does the subjunctive ‘If it rained, I would give you an umbrella.’ seem to logically imply “If it rained and I were in prison, I would give you an umbrella.’. Accordingly, add e.g. ‘. . .and Kennedy in fact survived all attacks on his life.’ to the antecedent of ‘If Oswald did not kill Kennedy, then someone else did.’ and the resulting conditional does not seem acceptable anymore. Therefore, philosophical logicians started to investigate new logical systems in which Monotonicity or Strengthening of the Antecedent is not logically valid. For a nice and recent introduction into this topic, presented from the viewpoint of the philosophy of language, see Bennett [2003]. We will consider the logic of indicative conditionals in our subsection on Probabilistic Logic below, but for now we are going to focus on subjunctive conditionals. D. Lewis [1973a], [1973b] famously introduced a semantics for subjunctive conditionals which we will state more or less precisely (compare Stalnaker’s related semantics in Stalnaker [1991]). Reconsider ‘If Oswald had not killed Kennedy, then someone else would have.’: according to Lewis, this counterfactual says something — in this case: something false — about the world: If the world had been such that Oswald had not killed Kennedy, but otherwise it would have been as similar as possible to what our actual world is like, then someone else would have killed Kennedy in that world. However, if we consider all the possible worlds in which Oswald did not kill Kennedy, and if we focus just on those worlds among them which are maximally similar to our actual world, then it seems we only end up with worlds in which no one killed Kennedy at all — that is exactly why we tend to think that ‘If Oswald had not killed Kennedy, then someone else would have’ is false. Now let us make this intuition about subjunctive conditionals formally precise. Lewis’ semantics does so by introducing the following “ingredients”: • We focus on a language L that is closed under the following syntactic rules: – If A is in L, then ¬A is in L. – If A is in L and B is in L, then (A ∨ B) is in L. – If A is in L and B is in L, then (A ∧ B) is in L. – If A is in L and B is in L, then (A → B) is in L. – If A is in L and B is in L, then (A > B) is in L. (The last clause is for subjunctive conditionals.) • We choose a non-empty set W , which we call the set of possible worlds. • We assume that we can “measure” the closeness or similarity of worlds to any world w in W . Formally, this can be done by assuming that for every world w there is a sphere system Sw of “spheres” around w, i.e. a class Sw of subsets of W , such that the following two conditions are satisfied:
Mechanizing Induction
753
– {w} is a sphere in Sw , – if X and Y are spheres in Sw , then either X is a subset of Y or Y is a subset of X. (Lewis considers further conditions on systems of spheres, but we restrict ourselves just to the most relevant ones.) So, a sphere system Sw around w looks like this: w '$ @ q @qhq q q qq q &% Intuitively, such a sphere system is meant to express: – If X is a sphere in Sw and w is a member of X, then w is closer or more similar to w than all those worlds in W that are not in X. – If w is not a member of any sphere around w — formally: w is not a member of the union Sw of all spheres around w — then w is not possible relative to w. • Finally, we consider a mapping V that maps each formula A in L and each world w in W to a truth value in {0, 1} according to the following semantic rules: – V (¬A, w) = 1 if and only if V (A, w) = 0. – V (A ∨ B, w) = 1 if and only if V (A, w) = 1 or V (A, w) = 1. – V (A ∧ B, w) = 1 if and only if V (A, w) = 1 and V (A, w) = 1. – V (A → B, w) = 1 if and only if V (A, w) = 0 or V (A, w) = 1. – The truth condition for subjunctive conditionals: V (A > B, w) = 1 if and only if either of the following two conditions is satisfied: ∗ There is no A-world in Sw , i.e. for all worlds w in Sw : V (A, w ) = 0. w '$ @ @qhq q qq q A q q &%
754
Ronald Ortner and Hannes Leitgeb
∗ There is a sphere X in Sw , such that (i) for at least one world w in X it holds that V (A, w ) = 1, and (ii) for all worlds w in X it holds that: V (A → B, w ) = 1. w '$ @ @qhq q qq qq A q &% B • Summing up: We call W, (Sw )w∈W , V a (Lewis-)spheres model for subjunctive conditionals if and only if all of the conditions above are satisfied. (By means of ‘(Sw )w∈W ’ we denote the family of sphere systems for worlds w in W within a given spheres model.) • We call a formula ϕ in L logically true (according to the spheres semantics) if and only if ϕ is true at every world in every spheres model. Accordingly, an argument P1 , . . . , Pn ∴ C is called logically valid — equivalently: C follows logically from P1 , . . . , Pn — if and only if (P1 ∧ . . . ∧ Pn ) → C is logically true. Given further constraints on such models, the truth condition for subjunctive conditionals can be simplified: • W, (Sw )w∈W , V satisfies the Limit Assumption if and only if for every world w in W , and for every A in L for which Sw contains at least one A-world, it holds that there is a least sphere X in Sw that includes a world w for which V (A, w ) = 1 is the case. (‘Least’ implies that every sphere that is a proper subset of X does not contain any A-world at all.) • If W, (Sw )w∈W , V satisfies the Limit Assumption, then Lewis’ truth condition for A > B reduces to: V (A > B, w) = 1 if and only if either of the following two conditions is satisfied: – There is no A-world in Sw , i.e. for all w in Sw : V (A, w ) = 0. – If Xleast is the least A-permitting sphere, i.e. the least sphere X in Sw for which it is the case that for some world w in X it holds that V (A, w ) = 1, then for all worlds w in Xleast it is the case that V (A → B, w ) = 1. (In words: B holds at all closest A-worlds.) Obviously, Lewis’ sphere systems with the Limit Assumption are very similar to ranked models with their Smoothness Assumption that we have discussed above,
Mechanizing Induction
755
and indeed one can be viewed as a notational variant of the other (modulo some minor differences such as the existence of a unique “most normal” world in Lewis’ semantics). Accordingly, the satisfaction clause for counterfactuals A > B in the one case mimics the satisfaction clause for nonmonotonic conditionals α ⇒ β in the other. Lewis [1973a], [1973b] showed the following soundness and completeness result: THEOREM 50. The system VC of conditional logic (see below) is sound and complete with respect to the spheres semantics for subjunctive conditionals. • Rules of VC: 1. Modus Ponens (for →) 2. Deduction within subjunctive conditionals: for any n ≥ 1 (B1 ∧ . . . ∧ Bn ) → C ((A > B1 ) ∧ . . . ∧ (A > Bn )) → (A > C) 3. Interchange of logical equivalents • Axioms of VC: 1. Truth-functional tautologies 2. A > A 3. (¬A > A) → (B > A) 4. (A > ¬B) ∨ (((A ∧ B) > C) ↔ (A > (B → C))) C1 Weak Centering: (A > B) → (A → B) C2 Centering: (A ∧ B) → (A > B) ‘V’ stands for ‘Variably strict’, which reflects that one can think of subjunctive conditionals as strict conditionals (A → B), but with variably strict degrees of necessity; ‘C’ is short for the ‘Centering axioms’ C1 and C2 (or semantically for assuming that {w} is a sphere in Sw ). Unsurprisingly, if the logical consequence relation is restricted to counterfactuals of the form A > B with A and B not containing >, i.e. if only the so-called “flat” fragment of Lewis’ logic for counterfactuals is considered, the system P of nonmonotonic logic re-emerges. So, on the formal level, the main difference between the logic of counterfactuals and nonmonotonic logic turns out to be a syntactical one: while the former allows for nestings of conditionals and also for the application of propositional connectives to conditionals, the latter does not.
756
Ronald Ortner and Hannes Leitgeb
The Bridge to Belief Revision AGM [1985] — short for: Alchourr´ on, G¨ ardenfors, Makinson — and G¨ ardenfors [1988] have developed a now well-established theory of belief revision, i.e. a theory which states and justifies rationality constraints on how agents ought to revise their beliefs in the light of new information. In this theory a belief state of an agent is considered as a set of formulas, i.e. the set of formulas the agent believes to be true. Furthermore, agents are assumed to be rational in the sense that their belief sets are deductively closed. So we have: • Belief set G: a deductively closed set of formulas Now an agent is taken to receive some new evidence, where evidence is regarded as being given by a formula again: • Evidence A: a formula Formally, the agent’s revision of her belief set G on the basis of her new evidence A is supposed to lead to a new belief set that is denoted by ‘G ∗ A’: • Revised belief set G ∗ A: a deductively closed set of formulas The corresponding function ∗, which maps a pair of a formula and a set of formulas to a further set of formulas, is called the “revision operator”. How is this revision process considered to take place? In principle, there are two possible cases to consider: • (Consistency Case) If A is consistent with G, then it is rational for the agent to simply add A to G, whence G ∗ A will presumably be simply G ∪ {A} together with all of its logical consequences. • (Inconsistency Case) If A is inconsistent with G, then in order to revise G by A the agent has to give up some of her beliefs; she does so rationally, or so Quine, G¨ ardenfors, and others have argued, if she follows a principle of minimal mutilation, i.e. she gives up as few of her old beliefs as possible. This guiding idea does not necessarily determine G ∗ A uniquely, but it yields rational constraints on the belief revision operator ∗ which can be stated as axioms. Such axioms have indeed been suggested in the theory of belief revision, and they have been studied systematically in the last two decades; they are usually referred to as the ‘AGM axioms’. We will not state these axioms here, since their standard presentation is too far removed from nonmonotonic inference relations or conditionals: for more information see the references given above (moreover, Hansson [1999] is a recent textbook on belief revision). But even independently of the exact details of the axiomatic treatment of belief revision, it is clear that belief revision operators may be expected not to be monotonic in view of the Inconsistency case from above: if A logically implies B, then G ∗ A is by no means guaranteed to be logically stronger than, i.e. a superset of G ∗ B.
Mechanizing Induction
757
As Grove [1988] has shown, belief revision operators can be characterized semantically by a sphere semantics that is “almost” like Lewis’ sphere semantics for subjunctive conditionals and which is more or less identical to the ranked model semantics for nonmonotonic conditionals. Without going into the formal details, this is the main idea: Whereas in Lewis’ semantics the innermost sphere of a sphere system around a world w contains exactly one world, namely w (Centering), Grove’s sphere systems are not “around” particular worlds at all, and consequently the innermost sphere of a sphere system might contain more than one world. Indeed, the set of formulas which are true in all worlds of the innermost sphere is regarded as the “original” unrevised belief set G in a sphere system: G '$ @ q @ qqh q q q q qq q &% Intuitively, such a sphere system is meant to express: • If X is a sphere and w is a member of X, then w is more plausible to be a candidate for being the actual world than all those worlds that are not in X. The spheres themselves correspond to epistemic “fallback positions” that are supposed to kick in if new evidence contradicts the current belief set G. • If w is not a member of any sphere, then w is not regarded epistemically possible. Alternatively, one can use a graphical representation along the lines of ranked models: Instead of proper spheres, one has layers or ranks again; the lowest layer corresponds to the innermost sphere, while taking the union of the lowest layer with the second layer from below corresponds to the next larger sphere, and so forth. G is the set of formulas that are true in all worlds which are members of the lowest layer; G ∗ A is the set of formulas which are satisfied by all those worlds that have minimal rank among the worlds that satisfy A: A @p p p p p G∗A p p p G For every such sphere system S in Grove’s sense, i.e. every class S of subsets of W satisfying Lewis’ assumptions on spheres except for the “centeredness on worlds” assumption, a corresponding belief revision operator ∗S can be defined in much the same way as the truth conditions for subjunctive conditionals are
758
Ronald Ortner and Hannes Leitgeb
determined in Lewis’ semantics and as the satisfaction conditions for nonmonotonic conditionals are stated in preferential semantics: B ∈ G ∗S A if and only if either of the following two conditions is satisfied: • There is no A-world in the union S of spheres in S, i.e. for all worlds w in S: A is false in w . • There is a sphere X in S, such that (i) for at least one world w in X it holds that A is true in w , and (ii) for all worlds w in X it holds that: A → B is true in w . Grove [1988] proved the following theorem: THEOREM 51. • For every sphere system S in Grove’s sense (with W being the set of all truth value assignments over L), the corresponding operator ∗S is a belief revision operator, i.e. it satisfies the AGM axioms. • For every belief revision operator ∗ satisfying the AGM axioms, there is a sphere system S in Grove’s sense (with W being the set of all truth value assignments over L), such that for all A, B ∈ L: B ∈ G ∗ A iff B ∈ G ∗S A Clearly, the semantics of belief revision operators, Lewis’ semantics of subjunctive conditionals, and ranked model semantics of nonmonotonic logic share a lot of formal structure. Accordingly, there are translation results from belief revision into nonmonotonic logic (as well as Lewis’ logic) and vice versa; see e.g. G¨ardenfors and Makinson [1994]. Expressions of the form B ∈G∗A are mapped thereby to expressions of the form α ⇒ β ∈ T H⇒ However, the intended philosophical interpretations of these logical frameworks differ of course: In particular, counterfactuals are meant to express something ontic, belief revision operators are meant to be epistemic, and nonmonotonic conditionals are best regarded open to both understandings. The Bridge to Probabilistic Logic Since the nonmonotonicity phenomenon was already well known in probability theory — a conditional probability P (Y |X) being high does not entail the conditional probability P (Y |X ∩ Z) being high — it is not surprising that some of the modern accounts of nonmonotonic conditionals turn out to rely on a probabilistic
Mechanizing Induction
759
semantics. Let us go back to Adams’ example of an indicative conditional, i.e. ‘If Oswald did not kill Kennedy, then someone else did.’. According to Adams [1975], asserting such an indicative conditional aims at expressing that one’s subjective conditional probability of ‘Someone other than Oswald killed Kennedy.’ given that ‘Oswald did not kill Kennedy.’ is high. Adams famously developed a nontruth-conditional semantics along these lines, which we will sketch below. A more recent introduction to this type of probabilistic logic is given by Adams [1998]; Pearl [1988] nicely builds on, and extends, Adams’ original theory. Let L be the language of propositional logic. We state a probabilistic semantics for two types of formulas: (i) formulas A, B, C, D, E, F, . . . of L, (ii) formulas of the form B ⇒ C, where B and C are members of L (so we disregard again both nestings of conditionals and propositional constructions from conditionals). The formulas in (ii) are meant to represent indicative conditionals. By a probability measure on L we mean the following: DEFINITION 52. A probability measure on L is a function P with the following properties: 1. P : L → [0, 1], i.e.: P maps each sentence in L to a real number x, such that 0 ≤ x ≤ 1. 2. For all A, B ∈ L: If A is logically equivalent to B, then P (A) = P (B). 3. For all A, B ∈ L: If A ∧ B ⊥, then P (A ∨ B) = P (A) + P (B). That is: If two sentences are inconsistent with each other, then the probability of their disjunction equals the sum of their probabilities. 4. For all A ∈ L: If A is logically true, then P (A) = 1. (The axioms are not meant to be independent of each other.) Additionally, conditional probabilities can be introduced by means of the socalled “Ratio Formula”: • For all A ∈ L with P (A) > 0,
P (B|A) =
P (B ∧ A) P (A)
where the ‘P ’ on the left hand side denotes the conditional probability measure that belongs to, or corresponds to, the unconditional probability measure that is denoted by ‘P ’ on the right hand side. Now we consider arguments of either of the following two forms:
760
Ronald Ortner and Hannes Leitgeb
A1 .. .
A1 .. .
Am B1 ⇒ C1 .. .
Am B1 ⇒ C1 .. .
Bn ⇒ Cn D
Bn ⇒ Cn E⇒F
According to Adams’ semantics, such arguments are called probabilistically valid if and only if for all infinite sequences P1 , P2 , P3 , . . . of subjective probability measures on L the following is the case: If Pi (A1 ) .. .
tends to 1 for i → ∞,
If Pi (A1 ) .. .
tends to 1 for i → ∞,
Pi (Am ) Pi (C1 |B1 ) .. .
tends to 1 for i → ∞, tends to 1 for i → ∞,
Pi (Am ) Pi (C1 |B1 ) .. .
tends to 1 for i → ∞, tends to 1 for i → ∞,
Pi (Cn |Bn ) then Pi (D)
tends to 1 for i → ∞,
Pi (Cn |Bn ) then Pi (F |E)
tends to 1 for i → ∞,
tends to 1 for i → ∞
tends to 1 for i → ∞
where if Pi (ϕ) = 0 then Pi (ψ|ϕ) is regarded to be equal to 1. REMARK 53. It is possible to omit this last extra clause if conditional probability measures — so-called Popper functions — are used from the start, rather than having conditional probabilities determined by absolute probabilities through the standard ratio formula. More about this may be found in McGee [1994], H´ ajek [2003], and Halpern [2001a]. So put in a slogan: An argument is valid according to Adams’ probabilistic semantics if and only if the more certain the premises, the more certain the conclusion. Adams [1975] showed the following soundness and completeness result: THEOREM 54. The following list of rules is sound and complete with respect to probabilistic validity: • In case A logically implies B: A⇒B
(Supraclassicality)
• In case A is logically equivalent with A : A⇒B (Left Logical Equivalence) A ⇒ B
Mechanizing Induction
•
761
⇒A (Trivial Antecedent 1) A where is any propositional tautology
•
A (Trivial Antecedent 2) ⇒A where is any propositional tautology
•
A⇒B A ⇒ C (Cautious Monotonicity) A∧B ⇒C
A⇒B • A ∧ B ⇒ C (Cautious Cut) A⇒C •
A⇒C B ⇒ C (Disjunction) A∨B ⇒C
These rules are to be understood in the way that if one has derived the premises of any of these rules from a set of factual or conditional assumptions, then one may also derive the conclusion of that rule from the same set of assumptions. Once again, one can show that neither Contraposition nor Transitivity nor Monotonicity is probabilistically valid. Indeed, Adams’ logic of indicative conditionals is nothing else but the system P of nonmonotonic logic that has been discussed above. This probabilistic style of doing nonmonotonic reasoning has become quite prominent in the meantime (see e.g. Lukasiewicz [2002]) and connects Nonmonotonic Reasoning to an area that is sometimes referred to as ‘Uncertain Reasoning’ (see Paris [1994] for an excellent introduction into all formal aspects of uncertain reasoning). As Adams himself has observed (see also Snow [1999]), an equivalent probabilistic semantics can be given in terms of so-called “probabilistic orders of magnitude” which replace the qualitative degrees of normality in preferential semantics. (See Schurz [2001] for a philosophical investigation into the conceptual differences between qualitative and statistical notions of normality.) Lehmann and Magidor [1992], pp.48–53, suggest a probabilistic semantics for their system R in terms of probability measures which allow for nonstandard real number values. See McGee [1994], Hawthorne [1996], Bamber [2000], Halpern [2001a], Arlo-Costa and Parikh [2005] for further probabilistic accounts of conditionals, nonmonotonic inference relations, and even nonmonotonic deductive closure or entailment. The Bridge to Neural Network Semantics Interpreted dynamical systems — the paradigm instances of which are artificial neural networks that come with a logical interpretation — may also be used to
762
Ronald Ortner and Hannes Leitgeb
yield a semantics for nonmonotonic conditionals. Here are some relevant references: d’Avila Garcez, Lamb, and Gabbay [2008] give a general overview of connectionist non-classical logics, including connectionist (i.e. neural networksrelated) nonmonotonic logic, as well as lots of references to their own original work. Balkenius [1991], Blutner [2004], and Leitgeb [2001], [2004], [2005] are important primary references. The main idea behind all of these theories is that if classical logic is replaced by some system of nonmonotonic reasoning, then a logical description or characterization of neural network states and processes becomes possible. The following exposition will introduce Leitgeb’s approach which yields a neural network semantics for KLM-style systems; the presentation will follow the more detailed introduction to neural network semantics for conditionals in Leitgeb [2007]. The goal is to complement the typical description of neural networks as dynamical systems by one according to which cognitive dynamical systems have beliefs, draw inferences, and so forth. Hence, the task is to associate states and processes of cognitive dynamical systems with formulas. Here is what we will presuppose: We deal with discrete dynamical systems with a set S of states. On S, a partial order is defined, which we will interpret as an ordering of the amount of information that is carried by states; so s s will mean: s carries at least as much information as s does. We will also assume that for every two states s and s there is a uniquely determined state sup(s, s ) which (i) carries at least as much information as s, which also (ii) carries at least as much information as s , and which (iii) is the state with the least amount of information among all those states for which (i) and (ii) hold. Formally, such a state sup(s, s ) is the supremum of s and s in the partial order . Finally, an internal next-state function is defined for the dynamical system, where this next-state function is meant to be insensitive to possible external inputs to the system; we will introduce inputs only in the subsequent step. In this way, we get what is called an ‘ordered discrete dynamical system’ in Leitgeb [2005]: DEFINITION 55. An ordered discrete dynamical system is a triple S = S, ns, , such that: 1. S is a non-empty set (the set of states). 2. ns : S → S (the internal next-state function). 3. ⊆ S × S is a partial order (the information ordering) on S, such that for all s, s ∈ S there is a supremum sup(s, s ) ∈ S with respect to . In case an artificial neural network is used, the information ordering on its states, i.e. on its possible patterns of activation, can be defined according to the following idea: the more the nodes are activated in a state, the more information
Mechanizing Induction
763
the state carries. Accordingly, sup(s, s ) would be defined as the maximum of the activation patterns that correspond to s and s ; in such a case one might also speak of sup(s, s ) as the “superposition of the states s and s ”. (But note that this is just one way of viewing neural networks as ordered systems.) The internal dynamics of the network would be captured by the next-state mapping ns that is determined by the pattern of edges in the network. Next, we add external inputs which are regarded to be represented by states s∗ ∈ S and which are considered to be fixed for a sufficient amount of time. The state transition mapping Fs∗ can then be defined by taking both the internal nextstate mapping and the input s∗ into account: The next state of the system is given by the superposition of s∗ with the next internal state ns(s), i.e.: Fs∗ (s) := sup(s∗ , ns(s)) The dynamics of our dynamical systems is thus determined by iteratively applying Fs∗ to the initial state. Fixed points sstab of Fs∗ are regarded to be the “answers” which the system gives to s∗ , as it is common procedure in neural network computation. Note that in general there may be more than just one such stable state for the state transition mapping Fs∗ that is determined by the input s∗ (and by the given dynamical system), and there may also be no stable state at all for Fs∗ : in the former case, there is more than just one “answer” to the input, in the latter case there is no “answer” at all. The different stable states may be reached by starting the computation in different initial states of the overall system. Now formulas can be assigned to the states of an ordered discrete dynamical system. These formulas are supposed to express the content of the information that is represented by these states. For this purpose, we fix a propositional language L. The assignment of formulas to states is achieved by an interpretation mapping I. If ϕ is a formula in L, then I(ϕ) is the state that carries exactly the information that is expressed by ϕ, i.e. neither less nor more than what is expressed by ϕ. So we presuppose that for every formula in L there is a uniquely determined state the total information of which is expressed by that formula. If expressed in terms of belief, we can say that in the state I(ϕ) all the system believes is that ϕ, i.e. the system only believes ϕ and all the propositions which are contained in ϕ from the viewpoint of the system. (This relates to Levesque’s [1990] modal treatment of the ‘all I know’ operator.) We will not demand that every state necessarily receives an interpretation but just that every formula in L will be the interpretation of some state. Furthermore, not just any assignment whatsoever of states to formulas will be allowed, but we will additionally assume certain postulates to be satisfied which will guarantee that I is compatible with the information ordering that was imposed on the states of the system beforehand. An ordered discrete dynamical system together with such an interpretation mapping is called an ‘interpreted ordered system’ (cf. Leitgeb [2005]). This is the definition in detail: DEFINITION 56. An interpreted ordered system is a quadruple SI = S, ns, , I, such that:
764
Ronald Ortner and Hannes Leitgeb
1. S, ns, is an ordered discrete dynamical system. 2. I : L → S (the interpretation mapping) is such that the following postulates are satisfied: (a) Let T HI = {ϕ ∈ L |for all ψ ∈ L: I(ϕ) I(ψ) }: then it is assumed that for all ϕ, ψ ∈ L: if T HI ϕ → ψ, then I(ψ) I(ϕ). (b) For all ϕ, ψ ∈ L: I(ϕ ∧ ψ) = sup(I(ϕ), I(ψ)). (c) For every ϕ ∈ L: there is an I(ϕ)-stable state. (d) There is an I()-stable state sstab , such that I(⊥) sstab . We say that SI satisfies the uniqueness condition if for every ϕ ∈ L there is precisely one I(ϕ)-stable state. E.g., postulate 2b expresses that the state that belongs to a conjunctive formula ϕ ∧ ψ ought to be the supremum of the two states that are associated with the two conjuncts ϕ and ψ: this is the cognitive counterpart of the proposition expressed by a conjunctive sentence being the supremum of the propositions expressed by its two conjuncts in the partial order of logical entailment. For a detailed justification of all the postulates, see Leitgeb [2005]. Finally, we define what it means for a nonmonotonic conditional to be satisfied by an interpreted ordered system. We say that a system satisfies ϕ ⇒ ψ if and only if whenever the state that is associated with ϕ is fed into the system as an input, i.e. whenever the input represents a total belief in ϕ, the system will eventually end up believing ψ in its “answer states”, i.e. the state that is associated with ψ is contained in all the states which are stable with respect to this input. Collecting all such conditionals ϕ ⇒ ψ which are satisfied by the system, we get what we call the ‘conditional theory’ that corresponds to the system. DEFINITION 57. Let SI = S, ns, , I be an interpreted ordered system: 1. SI ϕ ⇒ ψ iff for every I(ϕ)-stable state sstab : I(ψ) sstab . 2. T H⇒ (SI ) = {ϕ ⇒ ψ |SI ϕ ⇒ ψ } (the conditional theory corresponding to SI ). Leitgeb [2005] proves the following soundness and completeness theorem: THEOREM 58. • Let SI = S, ns, , I be an interpreted ordered system which satisfies the Uniqueness Assumption: Then T H⇒ (SI ) is a consistent conditional C-theory extending T HI .
Mechanizing Induction
765
• Let T H⇒ be a consistent conditional C-theory extending a given classical theory T H→ : It follows that there is an interpreted ordered system SI = S, ns, , I, such that T H⇒ (SI ) = T H⇒ , T HI ⊇ T H→ , and SI satisfies the uniqueness condition. These results can be extended into various directions. In particular, some interpreted ordered systems can be shown to have the property that each of their states s may be decomposed into a set of substates si which can be ordered in a way such that the dynamics for each substate si is determined by the dynamics for the substates s1 , s2 , . . . , si−1 at the previous point of time. Such systems are called ‘hierarchical’ in Leitgeb [2005]. We will not go into any details, but one can prove soundness and completeness theorems for such hierarchical interpreted systems and the system CL. In Leitgeb [2004] further soundness and completeness theorems are proved for more restricted classes of interpreted dynamical systems and even stronger logical systems for nonmonotonic conditionals in the KLM tradition. As it turns out, if artificial neural networks with an information ordering are extended by an interpretation mapping along the lines explained above, then they are special cases of interpreted ordered systems; moreover, if the underlying artificial neural network consists of layers of nodes, such that the layers are arranged hierarchically and all connections between nodes are only from one layer to the next one, then the corresponding interpreted ordered system is a hierarchical one. Thus, various systems of nonmonotonic logic are sound and complete with respect to various types of neural network semantics. However, so far these results only cover the short-term dynamics of neural networks that is triggered by external input and for which the topology of edges and the distribution of weights over the edges within the network is taken to be rigid. The long-term dynamics of networks given e.g. by supervised learning processes which operate on sequences of input-output pairs is still beyond any logical treatment that is continuous with KLM-style nonmonotonic reasoning. So, the inductive logic of learning, rather than inference, within neural networks is still an open research problem (see Leitgeb [2007] for a detailed statement of this research agenda). The Bridge to Philosophy of Science In traditional general philosophy of science, the nonmonotonicity phenomenon is well-known from inductive logic and the theory of statistical explanation. In order to cope with it, Carnap introduced his “requirement of total evidence”: an inductive argument should only be applied by an agent if its premises comprise the agent’s total knowledge; in the nonmonotonic reasoning context we saw this principle at work already in the introduction to Section 2. Hempel improved Carnap’s rule by the related “rule of maximal specificity”; for a discussion of both rules see Stegm¨ uller [1969], Chapter IX; for more on Carnap and Hempel see Zabell
766
Ronald Ortner and Hannes Leitgeb
[2009] and Sprenger [2009]. In the meantime, progress in Nonmonotonic Reasoning has started to feed back into philosophy of science. E.g.: Flach [2004] argues that the same logics that govern valid commonsense inferences can be interpreted as logics for scientific induction, i.e. for data constituting incomplete and uncertain evidence for empirical hypotheses. His formal account of scientific confirmation relations is modelled after the KLM approach to nonmonotonic inference relations. Schurz [2002] suggests to take system P of nonmonotonic logic to be the logic of ceteris paribus laws in science, i.e. laws that are meant to hold only in normal or standard conditions. More such bridges to philosophy of science may be expected to emerge. ACKNOWLEDGMENTS Ronald Ortner and Hannes Leitgeb would like to thank each other. BIBLIOGRAPHY [Adams, 1975] E. W. Adams. The Logic of Conditionals. D. Reidel, Dordrecht, 1975. [Adams, 1998] E. W. Adams. A Primer of Probability Logic. CSLI Publications, Stanford, 1998. [Alchourr´ on, G¨ ardenfors and Makinson, 1985] C. E. Alchourr´ on, P. G¨ ardenfors, and D. Makinson. On the logic of theory change: Partial meet contraction and revision functions. Journal of Symbolic Logic, 50:510–530, 1985. [Alon and Asodi, 2005] N. Alon and V. Asodi. Learning a hidden subgraph. SIAM Journal on Discrete Mathematics, 18(4):697–712, 2005. [Angluin and Chen, 2008] D. Angluin and J. Chen. Learning a hidden graph using O(log n) queries per edge. Journal of Computer and System Sciences, 74(4):546–556, 2008. [Angluin and Smith, 1983] D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Computing Surveys, 15(3):237–269, 1983. [Angluin, 1992] D. Angluin. Computational learning theory: Survey and selected bibliography. In Proceedings of the Twenty Fourth Annual ACM Symposium on Theory of Computing (STOC), 4-6 May 1992, Victoria, British Columbia, Canada, pages 351–369. ACM, 1992. [Angluin, 2004] D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175–194, 2004. [Argamon and Shimoni, 2003] S. Argamon and A. R. Shimoni. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17:401–412, 2003. [Arl´ o-Costa and Parikh, 2005] H. Arl´ o-Costa and R. Parikh. Conditional probability and defeasible inference. Journal of Philosophical Logic, 34:97–119, 2005. [Assouad, 1983] P. Assouad. Densit´e et dimension. Universit´ e de Grenoble. Annales de l’Institut Fourier, 33(3):233–282, 1983. [Auer and Ortner, 2007] P. Auer and R. Ortner. A new PAC bound for intersection-closed concept classes. Machine Learning, 66(2–3):151–163, 2007. [Auer et al., 1995] P. Auer, R. C. Holte, and W. Maass. Theory and applications of agnostic PAC-learning with small decision trees. In A. Prieditis and S. J. Russell, editors, Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning (ICML), Tahoe City, California, USA, July 9-12, 1995, pages 21–29. Morgan Kaufmann, 1995. [Auer et al., 1998] P. Auer, P. M. Long, and A. Srinivasan. Approximating hyper-rectangles: Learning and pseudorandom sets. Journal of Computer and System Sciences, 57(3):376–388, 1998. [Auer et al., 2002] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002. [d’Avila Garcez, Lamb, and Gabbay, 2008] A. S. d’Avila Garcez, L. C. Lamb, and D. M. Gabbay. Neural-Symbolic Cognitive Reasoning. Cognitive Technologies, Springer, Berlin, 2008.
Mechanizing Induction
767
[Balkenius and G¨ ardenfors, 1991] C. Balkenius and P. G¨ ardenfors. Nonmonotonic inferences in neural networks. In J. A. Allen, R. Fikes, and E. Sandewall, editors, Principles of Knowledge Representation and Reasoning, pages 32–39. Morgan Kaufmann, San Mateo, 1991. [Bamber, 2000] D. Bamber. Entailment with near surety of scaled assertions of high conditional probability. Journal of Philosophical Logic, 29:1–74, 2000. [Baxter, 1998] J. Baxter. Theoretical models of learning to learn. In S. Thrun and L. Pratt, editors, Learning to learn, pages 71–94. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [Benferhat, Dubois, and Prade, 1997] S. Benferhat, D. Dubois, and H. Prade. Nonmonotonic reasoning, conditional objects and possibility theory. Artificial Intelligence, 92:259–276, 1997. [Benferhat, Saffiotti, and Smets, 2000] S. Benferhat, A. Saffiotti, and P. Smets. Belief functions and default reasoning. Artificial Intelligence, 122:1–69, 2000. [Bennett, 2003] J. Bennett. A Philosophical Guide to Conditionals. Clarendon Press, Oxford, 2003. [van Benthem, 1984] J. van Benthem. Foundations of conditional logic. Journal of Philosophical Logic, 13:303–349, 1984. [Blum, 1998] A. Blum. On-line algorithms in machine learning. In A. Fiat and G. J. Woeginger, editors, Online Algorithms, The State of the Art, Lecture Notes in Computer Science, volume 1442, pages 306–325. Springer, 1998. [Blumer et al., 1987] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Information Processing Letters, 24(6):377–380, 1987. [Blumer et al., 1989] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989. [Blutner, 2004] R. Blutner. Nonmonotonic inferences and neural networks. Synthese, 142:143– 174, 2004. [Board and Pitt, 1990] R. A. Board and L. Pitt. On the necessity of Occam algorithms. In Proceedings of the Twenty Second Annual ACM Symposium on Theory of Computing (STOC), 14-16 May 1990, Baltimore, Maryland, USA, pages 54–63. ACM, New York, NY, USA, 1990. [Bouvel et al., 2005] M. Bouvel, V. Grebinski, and G. Kucherov. Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In D. Kratsch, editor, GraphTheoretic Concepts in Computer Science, 31st International Workshop, WG 2005, Metz, France, June 23-25, 2005, Revised Selected Papers, Lecture Notes in Computer Science, volume 3787, pages 16–27. Springer, 2005. [Brewka, 1996] G. Brewka Principles of Knowledge Representation. CSLI Publications, Stanford, 1996. [Brewka, 1997] G. Brewka, J. Dix, and K. Konolige Nonmonotonic Reasoning. An Overview. CSLI Publications, Stanford, 1997. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. University of Chicago Press, Chicago, 1950. [Carnap, 1962] R. Carnap. The aim of inductive logic. In E. Nagel, P. Suppes, and A. Tarski, editors, Logic, Methodology and Philosophy of Science, pages 303–318. Stanford University Press, Stanford, 1962. [Cesa-Bianchi and Lugosi, 2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cambridge, 2006. [Chaitin, 1969] G. J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16:145–159, 1969. [Corfield et al., 2005] D. Corfield, B. Sch¨ olkopf, and V. Vapnik. Popper, falsification and the VC-dimension. Technical Report 145, Max Planck Institute for Biological Cybernetics, Department of Empirical Inference, T¨ ubingen, Germany, November 2005. [Dawid and Vovk, 1999] A. P. Dawid and V. G. Vovk. Prequential probability: principles and properties. Bernoulli, 5(1):125–162, 1999. [Dawid, 1984] A. P. Dawid. Present position and potential developments: Some personal views. Statistical theory. The prequential approach. Journal of the Royal Statistical Society. Series A. General, 147(2):278–292, 1984. [Dawid, 1985] A. P. Dawid. Comment on the impossibility of inductive inference. Journal of the American Statistical Association, 80(390):340–341, 1985.
768
Ronald Ortner and Hannes Leitgeb
[Domingos, 1998] P. Domingos. Occam’s two razors: The sharp and the blunt. In R. Agrawal, P. E. Stolorz, and G. Piatetsky-Shapiro, editors, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), August 27-31, 1998, New York City, New York, USA, pages 37–43. AAAI Press, 1998. ´ ´ e de Probabilit´ [Dudley, 1984] R. M. Dudley. A course on empirical processes. In Ecole d’Et´ es de Saint-Flour XII-1982. Lecture Notes in Mathematics 1097. Springer, New York, 1984. [Ehrenfeucht et al., 1989] A. Ehrenfeucht, D. Haussler, M. J. Kearns, and L. G. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–261, 1989. [Eiter and Lukasiewicz, 2000] T. Eiter and T. Lukasiewicz. Default reasoning from conditional knowledge bases: Complexity and tractable cases. Artificial Intelligence, 124:169–241, 2000. [Feldman, 2008] V. Feldman. Hardness of proper learning. In Ming-Yang Kao, editor, Encyclopedia of Algorithms. Springer, 2008. [Flach, 2004] P. A. Flach. Logical characterizations of inductive learning. In D. M. Gabbay and R. Kruse, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, volume 4, pages 155–196. Kluwer, Dordrecht, 2004. [Freund et al., 1997] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2–3):133–168, 1997. [Fuhrmann, 1997] A. Fuhrmann. An Essay on Contraction. CSLI Publications, Stanford, 1997. [Gabbay, 1984] D. M. Gabbay. Theoretical foundations for non-monotonic reasoning in expert systems. In K. R. Apt, editor, Logics and Models of Concurrent Systems, pages 439–458. Springer, Berlin, 1984. [Gabbay, Hogger, and Robinson, 1994] D. M. Gabbay, C. J. Hogger, and J. A. Robinson, editors. Handbook of Logic in Artificial Intelligence and Logic Programming. Volume 3, Clarendon Press, Oxford, 2003. [G¨ ardenfors, 1988] P. G¨ ardenfors. Knowledge in Flux. The MIT Press, Cambridge, Mass., 1988. [G¨ ardenfors and Makinson, 1994] P. G¨ ardenfors and D. Makinson. Nonmonotonic inference based on expectations. Artificial Intelligence, 65:197–245, 1994. [Ginsberg, 1987] M. L. Ginsberg, editor. Readings in Nonmonotonic Reasoning. Morgan Kaufmann, Los Altos, 1987. [Giraud-Carrier and Provost, 2005] C. G. Giraud-Carrier and F. J. Provost. Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper? In Proceedings of the ICML-2005 Workshop on Meta-learning, pages 12–19, 2005. [Gold, 1967] E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967. [Goldreich et al., 1998] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653–750, 1998. [Goldszmidt, Morris, and Pearl, 1993] M. Goldszmidt, P. Morris, and J. Pearl. A maximum entropy approach to nonmonotonic reasoning. Pattern Analysis and Machine Intelligence, 15:220–232, 1993. [Goldszmidt and Pearl, 1996] M. Goldszmidt and J. Pearl. Qualitative probabilities for default reasoning, belief revision, and causal modeling. Artificial Intelligence, 84:57–112, 1996. [Grove, 1988] A. Grove. Two modellings for theory change. Journal of Philosophical Logic, 17:157–170, 1988. [Gr¨ unwald, 2007] P. D. Gr¨ unwald. The Minimum Description Length Principle. MIT Press, Cambridge, MA, USA, 2007. [H´ ajek, 2003] A. H´ ajek. What conditional probability could not be. Synthese, 137:273–323, 2003. [Halpern, 2001a] J. Halpern. Lexicographic probability, conditional probability, and nonstandard probability. In J. van Benthem, editor, Proceedings of the Eighth Conference on Theoretical Aspects of Rationality and Knowledge, pages 17–30. Morgan Kaufmann, Ithaca, NY, 2001. [Halpern, 2001b] J. Halpern. Plausibility measures: A general approach for representing uncertainty. In Proceedings of the 17th International Joint Conference on AI (IJCAI 2001), pages 1474–1483. Morgan Kaufmann, Ithaca, NY, 2001. [Hansson, 1999] S. O. Hansson. A Textbook of Belief Dynamics. Kluwer, Dordrecht, 1999. [Haussler and Welzl, 1987] D. Haussler and E. Welzl. -nets and simplex range queries. Discrete & Computational Geometry, 2(2):127–151, 1987.
Mechanizing Induction
769
[Haussler et al., 1991] D. Haussler, M. Kearns, N. Littlestone, and M. K. Warmuth. Equivalence of models for polynomial learnability. Information and Computation, 95(2):129–161, 1991. [Haussler et al., 1994] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting {0, 1}functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994. [Haussler, 1988] D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36(2):177–221, 1988. [Haussler, 1992] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992. [Hawthorne, 1996] J. Hawthorne. On the logic of nonmonotonic conditionals and conditional probabilities. Journal of Philosophical Logic, 25:185–218, 1996. [H¨ offgen et al., 1995] K.-U. H¨ offgen, H.-U. Simon, and K. S. Van Horn. Robust trainability of single neurons. Journal of Computer and System Sciences, 50(1):114–125, 1995. [Hutter, 2001] M. Hutter. New error bounds for Solomonoff prediction. Journal of Computer and System Sciences, 62(4):653–667, 2001. [Hutter, 2004] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2004. [Hutter, 2007] M. Hutter. On universal prediction and Bayesian confirmation. Theoretical Computer Science, 384(1):33–48, 2007. [Kearns et al., 1994] M. J. Kearns, R. E. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115–141, 1994. [Kelly and Schulte, 1995] K. T. Kelly and O. Schulte. The computable testability of theories making uncomputable predictions. Erkenntnis, 43(1):29–66, 1995. [Kelly, 2004a] K. T. Kelly. Justification as truth-finding efficiency: how Ockham’s razor works. Minds and Machines, 14(4):485–505, 2004. [Kelly, 2004b] K. T. Kelly. Learning theory and epistemology. In I. Niiniluoto, J. Wole´ nski, and M. Sintonen, editors, Handbook of Epistemology, pages 183–204. Kluwer Academic Publishers, Dordrecht, 2004. [Kelly, 2004c] K. T. Kelly. Uncomputability: the problem of induction internalized. Theoretical Computer Science, 317(1-3):227–249, 2004. [Kolmogorov, 1965] A. N. Kolmogorov. Three approaches to the definition of the concept “quantity of information”. Problemy Peredaˇ ci Informacii, 1(vyp. 1):3–11, 1965. [Kraus, Lehmann, and Magidor, 1990] S. Kraus, D. Lehmann, and M. Magidor. Nonmonotonic Reasoning, Preferential Models and Cumulative Logics. Artificial Intelligence, 44:167–207, 1990. [Legg, 2006] S. Legg. Is there an elegant universal theory of prediction? In J. L. Balc´ azar, P. M. Long, and F. Stephan, editors, Algorithmic Learning Theory, 17th International Conference, ALT 2006, Barcelona, Spain, October 7-10, 2006, Proceedings, Lecture Notes in Computer Science, volume 4264, pages 274–287. Springer, 2006. [Lehmann and Magidor, 1992] D. Lehmann and M. Magidor. What does a conditional knowledge base entail? Artificial Intelligence, 55:1–60, 1992. [Lehmann, 1995] D. Lehmann. Another perspective on default reasoning. Annals of Mathematics and Artificial Intelligence, 15:61–82, 1995. [Leitgeb, 2001] H. Leitgeb. Nonmonotonic reasoning by inhibition nets. Artificial Intelligence, 128:161–201, 2001. [Leitgeb, 2004] H. Leitgeb. Inference on the Low Level. An Investigation into Deduction, Nonmonotonic Reasoning, and the Philosophy of Cognition. Kluwer, Dordrecht, 2004. [Leitgeb, 2005] H. Leitgeb. Interpreted dynamical systems and qualitative laws: From inhibition networks to evolutionary systems. Synthese, 146:189–202, 2005. [Leitgeb, 2007] H. Leitgeb. Neural network models of conditionals: An introduction. In X. Arrazola, J. M. Larrazabal et al., editors, Proceedings of the First ILCLI International Workshop on Logic and Philosophy of Knowledge, Communication and Action, LogKCA-07, pages 191– 223. University of the Basque Country Press, Bilbao, 2007. [Levesque, 1990] H. Levesque. All I know: A study in autoepistemic logic. Artificial Intelligence, 42:263–309, 1990. [Lewis, 1973a] D. Lewis. Counterfactuals and comparative possibility. Journal of Philosophical Logic, 2:418–46, 1973. [Lewis, 1973b] D. Lewis. Counterfactuals. Basil Blackwell, Oxford, 1973. [Li and Vit´ anyi, 1997] M. Li and P. Vit´ anyi. An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York, second edition, 1997.
770
Ronald Ortner and Hannes Leitgeb
[Li et al., 2003] M. Li, J. Tromp, and P. Vit´ anyi. Sharpening Occam’s razor. Information Processing Letters, 85(5):267–274, 2003. [Lukasiewicz, 2002] T. Lukasiewicz. Nonmonotonic probabilistic logics between model-theoretic probabilistic logic and probabilistic logic under coherence. In S. Benferhat and E. Giunchiglia, editors, Proceedings of the 9th International Workshop on Non-Monotonic Reasoning, NMR 2002, pages 265–274. Toulouse, 2002. [Maass, 1994] W. Maass. Efficient agnostic PAC-learning with simple hypothesis. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT 1994), July 12-15, 1994, New Brunswick, NJ, USA, pages 67–75. ACM, 1994. [Makinson, 1994] D. Makinson. General patterns in nonmonotonic reasoning. In D. M. Gabbay, C. J. Hogger, and J. A. Robinson, editors, Handbook of Logic in Artificial Intelligence and Logic Programming, volume 3, pages 35–110. Clarendon Press, Oxford, 1994. [Makinson, 1989] D. Makinson. General theory of cumulative inference. In M. Reinfrank et al., editors, Non-Monotonic Reasoning, Lecture Notes on Artificial Intelligence, volume 346, pages 1–18. Springer, Berlin, 1989. [Makinson, 2005] D. Makinson. Bridges from Classical to Nonmonotonic Logic. Texts in Computing, volume 5. College Publications, London, 2005. [McGee, 1994] V. McGee. Learning the impossible. In E. Eells and B. Skyrms, editors, Probability and Conditionals. Belief Revision and Rational Decision, pages 177–199. Cambridge University Press, Cambridge, 1994. [Merhav and Feder, 1998] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124–2147, 1998. [Mitchell, 1990] T. M. Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers Computer Science Department, May, 1980. Reprinted in J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann, 1990. [Mitchell, 1997] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. [Norton, 2003] J. D. Norton. A material theory of induction. Philosophy of Science, 70:647–670, 2003. [Oaksford, Chater, and Hahn, 2009] M. Oaksford, N. Chater, and U. Hahn. Inductive Logic and Empirical Psychology. This volume, 2009. [Osherson et al., 1988] D. N. Osherson, M. Stob, and S. Weinstein. Mechanical learners pay a price for Bayesianism. Journal of Symbolic Logic, 53(4):1245–1251, 1988. [Osherson and Weinstein, 2009] D. N. Osherson and S. Weinstein. Formal Learning Theory in Context. This volume, 2009. [Paris, 1994] J. Paris. The Uncertain Reasoner’s Companion – A Mathematical Perspective. Cambridge Tracts in Theoretical Computer Science, volume 39. Cambridge University Press, Cambridge, 1994. [Pearl, 1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, 1988. [Pearl, 1990] J. Pearl. System Z: a natural ordering of defaults with tractable applications to nonmonotonic reasoning. In R. Parikh, editor, Proceedings of the Third Conference on Theoretical Aspects of Reasoning About Knowledge, pages 121–135. Morgan Kaufmann, San Mateo, 1990. [Pearl, 1997] J. Pearl and M. Goldszmidt. Probabilistic Foundations of Reasoning with Conditionals. In [Brewka, 1997], pages 33–68. [Peters and Westerstahl, 2006] S. Peters and D. Westerstahl. Quantifiers in Language and Logic. Oxford University Press, Oxford, 2006. [Pitt and Valiant, 1988] L. Pitt and L. G. Valiant. Computational limitations on learning from examples. Journal of the ACM, 35(4):965–984, 1988. [Popper, 1969] K. R. Popper. Logik der Forschung. Mohr, third edition, 1969. [Putnam, 1963] H. Putnam. ‘Degree of confirmation’ and inductive logic. In P. A. Schilpp, editor, The Philosophy of Rudolf Carnap, The library of living philosophers, volume 11, pages 761–783. Open Court, La Salle, Illinois, 1963. Reprinted in Mathematics, Matter and Method. Philosophical Papers, volume 1, pages 270–292. Cambridge, Cambridge University Press, 1975. [Rao et al., 1995] R. Bharat Rao, D. F. Gordon, and W. M. Spears. For every generalization action, is there really an equal and opposite reaction? In A. Prieditis and S. J. Russell, editors,
Mechanizing Induction
771
Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995), Tahoe City, California, USA, July 9-12, 1995, pages 471–479. Morgan Kaufmann, 1995. [Reiter, 1980] R. Reiter. A logic for default reasoning. Artificial Intelligence, 13:81–132, 1980. [Rissanen, 1976] J. J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976. [Rissanen, 1978] J. J. Rissanen. Modelling by the shortest data description. Automatica, 14:465– 471, 1978. [Sauer, 1972] N. W. Sauer. On the density of families of sets. Journal of Combinatorial Theory. Series A, 13:145–147, 1972. [Schaffer, 1993] C. Schaffer. Overfitting avoidance as bias. Machine Learning, 10(2):153–178, 1993. [Schaffer, 1994] C. Schaffer. A conservation law for generalization performance. In W. W. Cohen and H. Hirsh, editors, Machine Learning, Proceedings of the Eleventh International Conference (ICML 1994), Rutgers University, New Brunswick, NJ, USA, July 10-13, 1994, pages 259–265. Morgan Kaufmann, 1994. [Schurz, 2001] G. Schurz. What is ‘normal’ ? An evolution-theoretic foundation of normic laws and their relation to statistical normality. Philosophy of Science, 68:476–497, 2001. [Schurz, 2002] G. Schurz. Ceteris paribus laws: Classification and deconstruction. Erkenntnis, 57:351–372, 2002. [Schurz and Leitgeb, 2005] G. Schurz and H. Leitgeb, editors. Non-Monotonic and Uncertain Reasoning in the Focus of Paradigms of Cognition. Special volume of Synthese, 146/1–2, 2005. [Shannon, 1948] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 623–656, 1948. [Shelah, 1972] S. Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41:247–261, 1972. [Shoham, 1987] Y. Shoham. A semantical approach to nonmonotonic logics. In J. P. McDermott, editor, Proceedings of the Tenth International Joint Conference on Artificial Intelligence, pages 388–392. Morgan Kaufmann, San Mateo, 1987. [Snow, 1999] P. Snow. Diverse confidence levels in a probabilistic semantics for conditional logics. Artificial Intelligence, 113:269–279, 1999. [Solomonoff, 1964a] R. J. Solomonoff. A formal theory of inductive inference. I. Information and Control, 7:1–22, 1964. [Solomonoff, 1964b] R. J. Solomonoff. A formal theory of inductive inference. II. Information and Control, 7:224–254, 1964. [Solomonoff, 1978] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. Institute of Electrical and Electronics Engineers. Transactions on Information Theory, 24(4):422–432, 1978. [Solomonoff, 1997] R. J. Solomonoff. The discovery of algorithmic probability. Journal of Computer and System Sciences, 55(1):73–88, 1997. [Spohn, 1987] W. Spohn. Ordinal conditional functions: A dynamic theory of epistemic states. In W. L. Harper and B. Skyrms, editors, Causation in Decision, Belief Change, and Statistics, volume 2, pages 105–134. D. Reidel, Dordrecht, 1988. [Sprenger, 2009] J. Sprenger. Hempel and the Paradoxes of Confirmation. This volume, 2009. [Stalnaker, 1991] R. C. Stalnaker. A theory of conditionals. In N. Rescher, editor, Studies in Logical Theory, American Philosophical Quarterly Monograph Series, volume 2, pages 98–112. Blackwell, Oxford, 1991. [Stegm¨ uller, 1969] W. Stegm¨ uller. Wissenschaftliche Erkl¨ arung und Begr¨ undung. Springer, Berlin, 1969. [Valiant, 1984] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. [Vapnik and Chervonenkis, 1971] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971. [Vapnik and Chervonenkis, 1974] V. N. Vapnik and A. Y. Chervonenkis. Theory of Pattern Recognition (in Russian). Nauka, Moscow, 1974. [Vapnik, 1995] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
772
Ronald Ortner and Hannes Leitgeb
[Vapnik, 1998] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [Vilalta et al., 2005] R. Vilalta, C. G. Giraud-Carrier, and P. Brazdil. Meta-learning. In O. Maimon and L. Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 731–748. Springer, 2005. [Vit´ anyi and Li, 2000] P. M. B. Vit´ anyi and M. Li. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Transactions on Information Theory, 46(2):446–464, 2000. [von Luxburg and Sch¨ olkopf, 2009] U. von Luxburg and B. Sch¨ olkopf. Statistical Learning Theory: Models, Concepts, and Results. This volume, 2009. [V yugin, 1998] V. V. V yugin. Non-stochastic infinite and finite sequences. Theoretical Computer Science, 207(2):363–382, 1998. [Webb, 1996] G. I. Webb. Further experimental evidence against the utility of Occam’s razor. Journal of Artificial Intelligence Research, 4:397–417, 1996. [Wenocur and Dudley, 1981] R. S. Wenocur and R. M. Dudley. Some special VapnikChervonenkis classes. Discrete Mathematics, 33(3):313–318, 1981. [Wolpert, 1995] D. H. Wolpert. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In D. H. Wolpert, editor, The mathematics of generalization. Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pages 117–214. Addison-Wesley Publishing, Reading, MA, 1995. [Wolpert, 1996a] D. H. Wolpert. The existence of a priori distinctions between learning algorithms. Neural Computation, 8(7):1391–1420, 1996. [Wolpert, 1996b] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996. [Wolpert, 2001] D. H. Wolpert. The supervised learning no-free-lunch theorems. In Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications, 2001. [Zabell, 2009] S. Zabell. Carnap and the Logic of Induction. This volume, 2009.
INDEX
abduction, 117, 214, 227, 332 as an eliminative inference, 134 acceptance, 540–543 accidenal generalizations, 401 accidental generalizations, 404, 410 accommodates, 110 accumulation, 419 active learning, 732 actual causation, 364 Adams, E., 752, 759–762 adjustable parameter, 95, 96 admissibility, 515 affirming the consequent, 578, 579 after-effect, 365 Agicola, R., 21 agnostic learning, 726, 730–731 agreement of independent measurement, 104, 105, 111 al-Fˆ arˆ abˆı, 14–15 Albinus, see Alcinous Alcinos, 8 Alcuin, 17 Alexander of Aphrodisias, 10 algorithmic level, 559 algorithmic probability, see Solomonoff’s theory of induction Allais paradox, 489 ambiguity resolution, 567 ampliative reasoning, 119 analogical reasoning, 223 analogy, 167, 168, 288 analogy by proximity, 289 analogy by similarity, 289 analytic/synthetic-distinction, 209 Anderson, D. R., 142 Anderson, N. H., 559 Apostles, 184 approximately true, 345
Apuleius, 10 argument from ignorance, 604 argumentation, 601 argumentum ad ignorantium, 602 Aristotle, 2–8, 11–13, 18, 20, 25, 118, 286, 337 Arnauld, A., 30, 31 asymptotic rules, 78, 79 atomistic, 319 attachment-heuristic, 590 attribute symmetry, 280 audience dependence, 604 Avicenna, 15, 16 axioms fo confirmation, 268 Backhouse, R., 162 Bacon, F., 22, 24, 25, 27, 35, 338 Bar-Hillel, Y., 313 Barazan, R., 25 Barnes, J., 6 Bateman, B., 162, 171 Bayes, 724, 735 Bayes’s method, 192 Bayes’s rule, 183, 189, 192, 193 Bayes’s theorem, 137, 173, 193, 415, 417, 499, 559, 570, 580 Bayes’s postulate, 273 Bayes, T., 157, 167, 270, 415 Bayesian, 193, 195, 401 Bayesian apparatus for inductive reasoning, 420 Bayesian conditioning, 448 Bayesian confirmation theory, 80–83 Bayesian decision theory, 334 Bayesian inference, 197, 633, 636, 637, 642 Bayesian method, 161, 173, 174, 183 Bayesian model, 576
774
Bayesian statistics, 161, 270, 291, 583, 632, 633, 635–637, 642, 645, 649 Bayesianism, 104, 112, 113, 169, 173, 197, 707 Bayesians, 312 beam balance, 95, 110 belief, 165, 477, 540–545 belief revision, 539, 579, 756–759 belief-updating, 707 Berkeley, G., 22, 163 Bernier, F., 30 Bertrand’s Paradox, 426, 509, 510 Bertrand, J., 270 bet / bets, 517, 518 betting, 182 method, 179 odds, 268 quotients, 190, 191 scheme, 191 system, 190 binomial distribution, 269 Black, M., 56, 325 Bloomsbury group, 162 Boche, 64–65 Boethius, 12, 18 Bolzano, B., 154, 312 BonJour, L., 56, 63, 70–75, 79, 86 Boole, G., 154–156, 158, 164, 197, 272, 313 Boolean functions, 721, 722 bootstrapping, 94 Borad, C. D., 163 Borel, E., 178, 179 Brahe, T., 99 Braithwaite, R. B., 163, 180, 186, 277 Brandom, R., 65 Brier score, 442 Brier’s rule, 195 Brier, G., 442 Broad, C. D., 37, 44, 162, 278, 312 Buridan, J., 19 Burks, A., 147 Butler, J., 55
Index
Bylander, T., 149 c-principle, 324 calculus of probability, 158, 166, 171, 193 Calderoni, M., 189 calibration score, 441 Campbell, N., 173, 187, 188 Carabelli, A., 170, 171 Carnap’s c∗ function, 274 Carnap’s m∗ function, 274 Carnap’s symmetry, 191 Carnap, invisible college, 305 Carnap, R., 71, 84, 154, 156, 161, 168–170, 265, 311, 383, 394, 396, 399, 630–632 Carnapian inductive logic, 625, 626, 630–633, 635–637, 644, 645, 649, 650 Cassirer, E., 363 causal, 393 decision theory, 531–534 expected utility, 531–533 causality, 187, 358, 363, 364, 398 causation, 393, 404 cause, 392, 393 cepheid variable star, 86–88 chance, 176, 187, 188, 195, 514, 515 chance/credence principles, 466 choice, 182, 594 Chomsky hierarchy, 709 Chomsky, N., 564 Cicero, 11, 12, 17, 21 circularity, 605 classical, 194 Clifford distance, 343 coding-based approaches, 609 cognitive problem, 335 cognitive progress, 335 cognitive utility, 542 coherence, 183, 189, 190, 268, 419 coherent, 268 colligation of facts, 94–96, 102, 103, 107
Index
combination postulate, 273 combined system, 319 commonsense reasoning, 554 commutativity, 503 of evidence, 451 comparative probability, 164 comparatives, 209 complete, 213 completeness for primitive predicates, 302 complexity, 724, 727, 729–731, 734– 736, 750 computability, 736 computational level, 559 computational neuroscience, 560 conceivability, 50 concept class, 725 concept learning, 720–730 concept-ladenness, 94 conception, 93–95, 102, 110 of acceleration, 93 conditional inference, 578 conditional information, 330 conditional knowledge base, 744, 746, 750 conditional logic, 741, 755, 765 conditional probability, 164, 268, 416, 417, 480, 481 conditional theory, 741–743, 745–750, 764–766 conditionalization, 499–502, 504, 513, 514, 516, 518–520, 526 conditions of adequacy, 394 Condorcet, M. J. A. N., Marquis de, 300 confidence, 712 intervals, 711 confirmation, 105, 266, 332, 477, 499, 534–540, 735, 766 bias, 583 function, 267, 268 of universal generalizations, 278 connectionist networks, 109 consensus probability, 172
775
consilience of induction, 94, 102, 103, 104, 105, 110, 111 consistency, 182–184 consistent, 182, 721, 722, 727–728, 740, 742–743, 749, 756, 759, 765 constituent, 318 containing a single event, 368 context of discovery, 102 continuity, 415, 416 continuum of inductive methods for the samping of species, 295 convergence theorems, 495, 496 Cook, A., 173 coordination, 358–361, 373 Copernicus, 93, 96, 105 correlation, 417 corroboration, 382 Cottrell, A., 169 countable additivity, 494–497, 505, 516, 518 counterfactual, 393. 400, 401, 752– 759 conditions, 403 counterinduction, 56–57, 78, 80 Cournot, A. A., 266, 300 Cox’s theorem, 436 Cox, R. T., 197 credence, 303 credibility, 304 cross-induction, 367, 375, 379, 380 cruelty, 66 crystal ball, 514, 515 cue integration, 597 cumulativity, 743, 747–749 curve fitting, 59–60, 95, 220, 408 Darwin, C., 70 data compression (and learning), 730, 734–736 data selection, 582 Dawid’s theorem, 293 de Finetti, B., 157, 161, 174, 179, 183, 184, 186, 189–197, 265,
776
312, 373, 433, 626, 631, 636 de Finetti’s representation theorem, 275, 463 De Morgan, A., 154, 156, 157, 177, 278 decision, 477 by sampling, 598 making, 593 theory, 527, 528, 530–533 weights, 596 by-experience, 600 deduction, 45–48, 58, 117 from the phenomena, 220 deductive reasoning, 577 deductive subsumption, 105 deductive systemaization, 338 deductive-omological model of explanation, 129 deductivism, 205 default, 738, 740, 750–751 degree of belief, 156, 157, 164, 166, 171, 172, 174, 177, 180–183, 185, 187, 189, 190, 192, 197, 312, 486–489, 497, 498, 500, 512–514, 517, 523, 539, 541, 543, 544 degree of confidence, 43, 50–54 degree of confirmation, 314, 332 degree of truthlikeness, 345 demonstrative induction, 213 demonstrative reasoning, 48 Dempster-Shafer theory, 484 denying the antecedent, 578 dependence hypothesis, 583 depth, 326 Descartes, R., 22, 25, 27, 28, 140 detachment rule, 334 determinism, 153, 196 diachronicity, 500 Diaconis, P., 287, 452 Diogenes Laertius, 8, 9, 12 direct inference, 314 Dirichlet family of prior distributions, 276, 290
Index
discovery, 113 distance, 343 distributional methods, 569 distributive normal forms, 318 diversity, 574 dogmatism, 419 dominance, 416 Donkin, W. F., 154, 177, 178 Doob Martingale Convergence Theorem, 445 Dummett, M., 64–65 Dutch Book, 172. 190, 268, 496, 499, 513, 516–520 arguments, 433 theorem, 556 dynamic (or evolving) probability, 299 Earman, J., 104 ecological rationality, 606 economy, 140 Edgeworth, F. Y., 270 efficient PAC learning, 731 Einstein, A., 358, 360, 361, 364 eliminative induction, 214 Ellis, R. L., 154, 266 empirical success, 94, 105 empiricism, 104, 337, 571 empiricist philosophy, 94 enthymemes, 206 entrenchment, 401, 405, 406, 408, 409 entropy, 197, 426, 507 enumerative induction, 122 Epagoge meaning of, 3–6 translation of in Latin and Arabic, 11, 14–15 epistemic peer, 514 epistemic rationality, 521, 524 epistemic scoring rule, 440 epistemic syllogism, 230 epistemic utility, 526, 542 epistemic value, 542 equivalence condition, 395 error, 726
Index
estimation of verisimilitude, 345 ethically neutral proposition, 267 Euclid, 31 everyday inference, 554 evidential decision theory, 529–531 evidential expected utility, 529, 531, 533 evolution, 70 Ewens sampling formula, 296 Example (paradeigma), 7, 9, 11, 34 exchangeability, 161, 162, 183, 189, 191–193, 195–197, 271, 323 exchangeable random partitions, 294 exchangeable sequence, 459 expectation degree of, 155 expected cognitive utility, 526 expected epistemic utility, 335, 526 expected utility, 268, 523, 527–530 expected utility maximization, 527, 542 expected utility theory, 594 expected value, 512, 518, 530 experimentation, 329 expert probabilities, 469 explanation as rational expectability, 133 explanatory power, 331 explanatory reasoning, 123 explicandum, 266 explication, 93 explicative reasoning, 118 explicatum, 266 externalism, 66–67 extrapolation, 709 horizontal vs. vertical, 130 fallacies, 601 fallibilism, 334 falsifiability, 726 Fann, K. T., 123 Feigl, H., 74 Fermat, P., 32, 180 fiducial argument, 644, 646–650
777
Field shift, 453 Field, H., 453 finite additivity, 415 finite continuum of inductive methods, 286 finite de Finetti representation theorem, 284 finite exchangeability, 282 finite rule of succession, 284 finitely additive probability, 268 first induction theorem, 288 Fisher, R. A., 173, 383, 429 Formal Learning Theory, 707 formal learning theory, 347 formal theory of epistemology, 104 Foucher de Careil, A., 36 foundationalism, 61–63 foundations of deductive inference, 358 foundations of mathematics, 557 four-fold pattern, 596, 597 Fr´echet, M., 189 Fraassen, B. van, 45, 371 Frankfurt, H., 142 Freedman, D., 287, 462 Frege, G., 313 frequencies, 187, 188. 192 of the frequencies, 281 frequency, 185, 186, 193–195, 490, 491 interpretation of probability, 379 notion of probability, 169 frequentism, 173, 187, 194, 557 frequentist interpretation, 154, 197 frequentist statistics, 421 Friedman, M, 104 Fries, J. F., 266 G¨ ardenfors, P., 743, 756, 758 Gabbay, D. M., 149 Gaifmann, H., 469 Galavotti, M. C., 153, 180 Galileo, 93, 104 gamma function, 276 Garrett, D., 67–70 Gassendi, P., 29, 31
778
generalised quantifiers, 588, 591 generalized combined system, 321 Gigerenzer, G., 560 Gillies, D., 165, 166, 171, 172 Glymour, C., 104, 109 Good, I. J., 161, 184, 197, 265 Goodman’s new riddle of induction, 297 Goodman, N., 57–60, 76, 333, 384 graph learning, 732 graphcial model, 588 Grosseteste, R., 17, 18 group invariance, 292 grue, 57–60, 76, 298, 486, 508 Hacking, I., 79, 81, 153, 313 Hall, N., 468 Halpern, J. Y., 438 Hanson, N. R., 148 Harman, G., 148 Harper, W. L., 99, 104 Harrod, R. F., 162 Hausdorff moment theorem, 270 Heisenberg, W., 357, 358, 361, 385, 386 Helmholtz, H. von, 562 Hempel’s equivalence condition, 297 Hempel’s paradox, 297 Hempel, C. G., 105, 129, 335, 383, 394, 395, 397, 403, 409 Hertz, P., 364, 377, 379 heuristics, 560 hierarchy of knowledge, 94 hierarchy of successive generalizations, 104 high probability, 332 conclusion, 582 Hilbert, D., 359, 369 Hilpinen, R., 326 Hintikka, J., 279, 311 Hobbes, T., 31, 162 Hosiasson-Lindenbaum, J., 267, 333 Howie, D., 173 Howson, C., 80, 82, 438
Index
Hume’s fork, 49–61 Hume’s problem, 43–88 Hume, D., 1, 10, 19, 35, 43–61, 67– 70, 88, 163, 184, 192, 313, 359, 362, 391–393, 398, 399, 407, 410, 411 Humphreys, P., 109 Huygens, C., 180 hypergeometric distributions, 283 hypergeometric probabilities, 283 hypothesis (or hypothetical inference), 125 hypothesis acceptance, 711, 715 hypothetical reasoning, 126 hypothetico-deductive method, 338 hypothetico-deductivism, 104 hypothetico-inductive inferences, 339 Ibn Sina, see Avicenna ignorance priors, 423 imaging, 532, 533 implementational level, 560 incremental information, 330 indepedent and identically distributed, 283 independence, 192, 196 hypothesis, 583 indeterminate probabilities, 482–485 indeterminism, 196 indicative, 752, 759, 761 induction, 45–47, 58, 117, 157, 167– 170, 176, 184, 185, 192, 213 as self-corrective method, 142 by simple enumeration, 24–26, 29 eliminative, 26–27 mathematical, 32 inductive, 167, 175 argument, 168, 169 assumptions, 721–722, 727, 730– 731, 734, 738 behaviour, 333 explanation, 331 generalisation, 211 justification of induction, 56
Index
logic, 170, 184, 311, 711 method, 170 reasoning, 571 strength, 601 systematization, 339 inductivism, 174, 338 inference by analogy, 314, 340 inference from signs (semeiosis), 1 inference to the best explanation, 148, 214, 227 inferential asymmetries, 580 infinitely exchangeable sequence, 281 infinitesimal probabilities, 482 infinitesimals, 482, 497, 498 InfoMin, 504, 507 information content, 330 information minimization, 504 informativeness, 589 initial credence function, 304 innate, 406, 409 innateness, 406 instance conformation, 317 instantial relevance, 281 instinct, 67–70 internal scales, 599 interrogtive model of inquiry, 350 invariance to subsequence selection, 365 inverse inference, 314 inverse S-shape, 600 inverse square law, 102, 103, 105 James, W., 147, 185 Jaynes, E. T., 197, 425 Jeffrey conditionalization, 178, 386, 497, 499, 501–504, 519 Jeffrey, R. C., 170, 178, 189, 197, 265, 314, 386, 449 Jeffreys, H., 159, 161, 172–177, 197, 278, 312, 427, 640, 642 Jerusalem system, 319 Jevons, W. S., 154, 158, 167, 272, 312 Johnson, W. E., 158–161, 163, 175, 183, 191, 271
779
Johnson–Carnap continuum, 274 Joyce, J., 440 Judy Benjamin problem, 454 justification, 102, 122 justificationism, 226 K-dimensional system, 324 Kahneman, D., 561 Kaila, E., 312 Kant, I., 118, 358, 363 Kelly, T., 451 Kemeny, J., 268, 313 Kepler’s three laws, 93 Kepler, J., 93, 95–100, 104, 108, 109 Keynes, J. M., 27, 62, 71, 78, 159, 161–172, 175, 178–180, 184, 185, 197, 271, 312, 381 Keynes, J. N., 162 Keynes, M., 163 Kingman, J. H. C., 295 Kleene’s recursion theorem, 710 KLM, see Kraus, Lehmann, Magidor knowledge, 122 knowledge base, see conditional knowledge base Kolmogorov complexity, 724, 730, 734, 736 Kolmogorov distance, 504 Kolmogorov existence theorem, 283 Kolmogorov, A. N., 371, 557 Kolmogorov, A., 371 Korb, K. B., 109 Kornblith, H., 66 Kowalski, R. A., 149 Kraus, Lehmann, Magidor, 740–744, 747, 749–750, 762, 765–766 Kries, J. von, 270 Kuhn, T., 101 Kuipers, T., 325 Kullback-Leibler entropy, 454 Kyburg, H. E., 334 L’Huilier, S., 284 Lakatos, I., 10, 311 λ-continuum, 315, 316, 486
780
λ − α-continuum, 322 Lang, M., 86–88 Lange, M., 452 language, 564 language aquisition, 568 Laplace, P.-S. Marquis de, 153, 158, 166, 167, 196, 265, 312, 423 Laplacian demon, 560 law, 404, 410 of inertia, 93 of Large Numbers, 348 of large numbers, 363, 364 law-like generalisations, 129 lawlike, 401, 404, 410 lawlikeness, 402 Leibniz, G. W. F., 29, 33, 35, 153, 304, 313 Levi, I., 334 Lewis, C. I., 378 Lewis, D., 466, 752–758 likelihoodism, 112 liklihood function, 418 limiting frequency interpretation of probability, 361 Lindley’s theorem, 443 Lindley, D., 197, 443 linearity, 416 linguistics, 564 Lipton, P., 148 Locke, J., 2, 6, 22, 29, 33, 163 log-loss, see self information loss logic of criticism, 205 logic of discovery, 148, 205 logic of justification, 205 logic programming, 149 logical, 159 indepedence, 302 interpretation, 153, 155, 165 interpretation of probability, 169 pragmatics, 315 probabilities, 71–72 probability, 161 relation, 165, 171 logicism, 153, 154, 156, 172, 174, 175,
Index
177, 194, 197 logicist, 162, 169, 178 Los, J., 327 loss aversion, 596 loss function, 734 lottery paradox, 541 Mach, E., 175, 189 MacHale, D., 154 machine learning, 347, 569 Mackie, J. L., 52, 58 Majer, U., 180 Makinson, D., 740–741, 743, 756, 758, 765, 766 Markov exchangeability, 289 Marr, D., 559 matching bias, 586 material rules of inference or inferencelicences, 217 mathematical expectation, 181 max-heuristic, 590 maxim of pragmatism, 135 maximum likelihood estimation, 626, 637, 643, 648 McMullin, E., 148 McTaggart, J., 162 MDL, see minimul length description Mellor, H., 180 mental accounting, 595 mental models, 555 metalearning, 724 method of betting, 179, 180 metric, 504 Mill’s four methods of induction, 108 Mill, J. S., 2, 36, 62, 93–99, 101, 102, 104–108, 111, 163, 266, 338 Miller’s paradox, 299 Miller’s principle, 514 Miller, D., 311 Milton, J., 162 min-heuristic, 590, 591 minimal change, 419 minimum description length, 724, 736 Mises, R. von, 286, 359, 364, 365
Index
mixing measure, 276 modal logic, 328 modal propositions, 19 model selection, 109 modus ponens, 578 modus tollens, 578 Mondadori, M., 314 monotonic logic, 215 monotonicity, 553, 740, 742–744, 746, 750, 752, 759, 761, 764–766 Moore, G. E., 162, 165, 180 multinomial hypothesis, 633–636, 643, 645 multinomial probability, 275 multinomial sampling, 271 Myrvold, W., 104 Nagel, E., 147, 365, 377–381 Natorp, P., 363 natural, 410 and unnatural kinds, 410 images, 607 kinds, 87, 405 languages, 716 sampling, 585 necessity, 328 negations paradigm, 585 negative conclusion bias, 581 neural network semantics, 762 New Principle, 469 new riddle of induction, 57–60, 76, 384 Newton, I., 93, 94, 101, 103–105 Neyma, J., 333 Neyman-Pearson hypothesis testing, 626, 637–640, 642, 646, 648 Nicod’s criterion, 297 Nicole, P., 30, 31 Nietzsche, F., 36 Niiniluoto, I., 339 Nizzoli, M., 35 no-free-lunch-theorems, 721–722, 736 nomic constituent, 328 non-deductive reasoning, 121
781
non-demonstrative reasonig, see nondeductive reasoning nonmonotonicity, 554, 740, 750, 759, 761, 764, 765, 766 nonpragmatic vindications of probabiism, 440 normal distribution, 292 normal sequences, 365, 369 normality, 415 normative vs. descriptive, 557 Norton, J. D., 104 novel predictions, 136 numerosity, 574 objective, 165, 185, 186, 188, 191, 493, 494, 500, 537, 539, 540 objective Bayesianism, 423 objective probability, 179, 190, 194 observational errors, 342 Occam’s razor, 35, 140, 723, 724, 729, 734 Ockham, William of, 18, 19, 22 Oddie, G., 343 odds, 488, 517, 521, 542 Okasha, S., 82 old evidence, 535, 536, 538, 539 online learning, 731, 734 operational definitions, 267 optimal data selection, 583 optimality theory, 564 order of evidence, 502 ordinary language dissolution, 63–66 outcome set, 477 outcome space, 478, 518 overdetermination, 104 overfitting, 723 overfitting avoidance, 723 P-entailments, 590 PAC learning, 725–731 packaging assumption, 521, 522 Papineau, D., 66 Pappus, 30 Paradeigma, see Example paradox of the ravens, 297
782
paradox of the second ace, 297 paradoxes of confirmation, 333 parallel distributed processing, 109 parametric statistical family, 292 parsing, 566 partial belief, 183, 186 partial entailment, 485, 486 partial exchangeability, 289 partially exchangeable sequence, 462 Pascal’s triangle, 32 Pascal, B., 180, 415 Pearson, E. S., 333 Pearson, K., 175 Peirce’s three stages of inquiry, 143 Peirce, C. S., 117, 185, 186, 188, 313, 371 penalty methods, 195 perception, 554 perceptual system, 606 perceptuo-motor control, 563 perceptuo-motor task, 600 perfomance guarantee, 712 permutation postulate, 191, 271 personal probabilities, 53–54 physical chance, 491, 492, 514 physical probability, 177 Pietarinen, J., 326 Plato, 2, 3, 9, 165, 359 plausibility, 566 Plotinus, 12 Poincar´e, H., 178, 188, 271 Poisson, S. D., 300 Polya’s Urn, 460 polynomial learnability, 729–730 Popper, K., 44–45, 148, 299, 311, 359, 377–379, 382, 383 Popper–R´enyi, 538 Popper–R´enyi function, 481, 498, 501 posit, 373, 374, 377, 382, 384 positive instantial relevance, 316 positive relevance, 332 positivists, 359 posits, 368 possibility, 50, 328
Index
poverty of the stimulus, 568, 570 practical syllogism, 231 pragmatic, 402, 405, 406, 410, 581 justification of induction, 74–80 rationality, 521, 524, 527 pragmatism, 134 prediction, 102–105, 110, 625, 630, 635, 636, 639, 644, 645, 650 with expert advice, 736–738 predictive induction, 125 predictive inference, 314 predictive probabilities, 270 preface paradox, 541 preferences, 267, 488, 489, 522, 523 preferential semantics, 740, 743, 747– 751, 758, 762 prequential analysis, 734 preservation, 417, 419 Prevost, P., 284 Price, R., 167 principal principle, 514 principle of causality, 176, 363 principle of cogent reason, 270 principle of finite attainability, 375 principle of indifference, 166, 167, 183, 269, 383, 505, 506–509, 511, 512 principle of induction, 167, 171 principle of insufficient reason, 153, 158, 166, 423, 505 principle of lawful distribution, 363 principle of limited variety, 168, 170 principle of logical omniscience, 299 principle of maximum entropy, 197, 507 principle of plenitude, 329 principle of the uniformity of nature, 55–61, 70 principle of uniformity of nature, 167 prior probabiities, 80–83 priority heuristic, 597 probabilism, 493, 516, 522–524 probabilistic coherence, 431 probabilistic entailment, 589
Index
probabilistic fallacies, 561 probabilistic logic, 752, 759–762 probability, 11, 50–54, 153–155, 157– 167, 169–171, 173–180, 182, 183, 186–195, 312, 509, 555, 594 calculus, 190 density, 509 function, 478 games of chance, 156 heuristics model, 588 logic, 312, 327 space, 478, 518 theory, 180 probability1 , 266 probability2 , 266 probable approximate truth, 346 probable verisimilitude, 346 problem of induction, 43–88 problem of old evidence, 298 problem of the long run, 78 problem of the priors, 423 processing limitations, 597 projectability, 298 propensity, 348, 492 property testing, 732 prospect theory, 595 psycholinguistics, 566 psychologism, 557 psychology, 555 Ptolemy, 13, 96 Putnam, H., 76, 338 Q-predicate, 315 quantum mechanics, 177, 515 quasi-rebular, 571 query learning, 732 Quintilian, 11 Ramsey test, 579, 582 Ramsey, F. P., 80–81, 159, 161, 162, 168, 170–172, 174, 176, 177, 179–183, 185–191, 196, 197, 265, 266, 312, 433, 579 randomness, 364, 365
783
range, 316 Raven paradox, 383, 540 realism, 104 reduced array selection task, 585 reference class, 367–369 reflection, 512, 513, 516, 520 reflective equilibrium, 411 regret, 738 regular divisions, 365 regular measure function, 267 regularity, 498, 500, 516, 520, 536, 538 Reichenbach’s axiom, 316 Reichenbach’s distinction between the context of discovery and the context of justification, 148 Reichenbach, H., 74–80, 147, 193, 312 relevance logic, 215 representation, 606 theorem, 186, 191, 192, 488, 522– 524 representative function, 316 Rescher, N., 180 resemblance, 341 responsiveness to evidence, 419 retroduction, 121 reverse engineering, 556 rhetoric, 13, 20 rigidity, 502 condition, 581 Ross, Sir David, 6 Royce, J, 147 rudimentary or crude induction, 125 rule and exception, 571 rule of succession, 183, 169, 316, 425 rule of acceptance, 333 Russell, B., 62, 162, 163, 173, 180, 377, 381, 382 Sahlin, N.-E., 184 Salmasius, C., 162 Salmon, W., 44, 47, 57, 62, 63, 65, 74, 77, 79 sample complexity, 728–730
784
sample space, 627–633, 637–645, 647 sampling of species problem, 294 Savage, L. J., 184, 189, 191, 195, 197, 265, 283, 333, 434 Scheines, 109 Schoenberg’s theorem, 292 scientific discovery, 109 Scott, D., 439 selection task, 586 self-describing recursive function, 710 self-information loss, 734 Sellars, W., 87, 128 semantic information, 330 sequence prediction, 732–738 Sextus Empiricus, 8–10, 359 Shimony, A., 440 Sidgwick, H., 162, 163 σ-algebra, 478 similarity, 340, 405, 410 simple, 410 simple enumerative induction, 107 simplest, 408 simplicity, 102–104, 408, 409, 608 postulate, 174 singular predictive inductive inference, 211 skepticism, 43, 67–70, 82 Skyrms, B., 54, 63, 184, 197, 435 slippery slope, 605 Socrates, 2, 5 Solomonoff’s theory of induction, 734– 736 Solomonoff, R., 730, 734 some-not-heuristic, 590 special reflection, 513 sphere semantics, 753–758 Spinoza, B., 22 Spirtes, P., 109 state desription, 272, 315 statistical hypothesis, 625, 626, 629, 632, 633, 635–639, 641, 643– 646, 648, 649 statistical inductive methods, 167 statistical inference, 109, 160, 192,
Index
193 statistical mechanics, 515 statistical methodology, 175, 176 statistical or quantitative induction, 125 statistical reasoning Bayesian approach, 109 likelihoodist, 109 statistical syllogism, 83–85 statistics, 625, 626, 628–630, 632, 634, 636, 640–642, 645, 646, 648, 649 Stegm¨ uller, W., 324 Stove, D., 52, 58, 83 Strachey, L., 162 straight rule, 75, 359, 362, 364, 376, 377, 380, 385, 387, 508 Strawson, P. F., 63 strictly coherent, 499 structure description, 272, 315 subject interpretation, 178 subjective, 165, 185, 186, 493, 494, 500, 537, 539 subjective Bayesianism, 431 subjective interpretation, 186, 187, 189 subjective interpretation of probability, 174, 558 subjective probability, 179, 183, 189– 193, 196, 555 subjective theory, 194 of probability, 190 subjectivism, 153, 155, 157, 171, 172, 174, 175, 177, 183, 189, 194, 195, 197, 487 subjectivist, 178, 179, 184, 192, 194, 195 subjunctive, see counterfactual substantial information, 330 successive generalization, 105 sufficient statistics, 273 sufficientness postulate, 274, 323 Suppes, P., 325 surprise value, 330 syllogistic reasoning, 586
Index
symmetry, 161, 178, 194 synchronicity, 499 syntactic ambiguity, 566 synthetic a priori, 358, 363, 364, 369 systematic power, 331 tacking problem, 540 target concept, 720, 725 Tarski, A., 377, 380, 381 Thagard, P., 148 The Equation, 578, 582 The Principal Principle, 466 The Theoretician’s Dilemma, 338 theory of communication, 608 theory of probability, 153–155, 157, 159, 163–165, 169, 172, 173, 182, 184, 185 three prisoner paradox, 297 Tichy, P., 343 total evidence, 298 total probability, 417 Toumela, R., 326 trade-off, 597 transduction, 732 transition counts, 289 transitivity, 489 transmitted information, 331 truth, 336 permanently settled belief, 123 truthlikeness, 343 Turing, A., 281 two-dimensional continuum, 320 two-envelope paradox, 297 Uchii, S., 328 unification, 104 uniform density, 509 uniform distribution, 508 uniformity of nature, 158, 392, 398 universal inference, 314 urn of nature, 272 utility, 527, 529, 543, 544, 594 Vailati, G., 189 Valiant, L., 727
785
validity formal and semantic, 209 Valla, L., 21 van Fraassen, B., 440 Vapnik, V., 725, 732 variational distance, 504 variety principle of limited, 27 variety of evidence, 323 VC-dimension, 725–726, 728–729 Venn, J., 154, 162, 266 verifiability theory of meaning, 216 verisimilitude, 343 Vico, G., 21 Victorinus, 17 Vienna Circle, 266, 359 visual perception, 562 Vives, J. L., 21 Waismann, F., 163, 267, 312 Wallis, J., 31, 32 “washing out”, 445 Wason’s selection task, 582 weight, 168, 169, 183, 184 Whewell, W., 93–96, 101–109, 111– 113, 338 Whitehead, A. N., 162, 163 Wilkins, J., 304 Williams, D., 83–86 Wittgenstein, L., 64, 163, 168, 180, 184, 185, 267, 312 Woods, J., 149 Woolf, L., 162 Wright, G. H. von, 312 Wrinch, D., 173–175, 278 Zabarella, J., 21, 22, 32, 36 Zabell, S., 153, 161, 452 zero-denominator problem, 482