ASPECTS
OF INDUCTIVE LOGIC
Edited by
JAAKKO HINTIKKA Professor of Philosophy, University of Helsinki and Stanford University
PATRICK SUPPES Professor of Philosophy and Statistics, Stanford University
cp~c
~
~ 1966
NORTHHOLLAND PUBLISHING COMPANY AMSTERDAM
© NorthHolland Publishing Company  Amsterdam  1966
All rights reserved No part of this book may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the publisher
PRINTED IN THE NETHERLAN DS
PREFACE
Of the fourteen papers included in this volume, eight were read, in a form not always identical with the present one, at an International Symposium on Confirmation and Information which was held in Helsinki, Finland, from September 30 to October 2, 1965. These are the papers by Black, Tornebohm, Walk, Von Wright, Hintikka, Hintikka and Hilpinen, Hintikka and Pietarinen, and Tuomela. The present volume thus in effect incorporates the proceedings of this symposium. The Helsinki symposium was held under the auspices of the International Union for the History and Philosophy of Science (Division of Logic, Methodology, and Philosophy of Science) and of the Finnish Philosophical Society. It was financially supported by the International Union and by the Finnish Ministry of Education, and helped in other ways by the University of Helsinki. The present volume thus owes an indirect debt to all these institutions. Many of its contributors are undoubtedly also grateful to the other participants ofthe Symposium whose remarks helped to shape their papers. Among these participants, a special mention should perhaps be made of those speakers whose contributions are not published here: Dagfinn Follesdal (Oslo), Yrjo Reenpaa (Helsinki), and Wolfgang Stegmuller (Munich). The Helsinki symposium was partly inspired by the informal seminar on induction which was held at Stanford University during the Spring term of 1965. This seminar was attended by members of the Stanford Philosophy Department and also by Max Black, who at the time was a Fellow of the nearby Center for Advanced Study in the Behavioral Sciences. The earliest form of several of the papers included in this volume was presented at the Stanford Seminar. The papers included in the present volume do not fall within any single approach to induction and to its logic. On the contrary, a wide spectrum of views are represented, not only in the sense that the main allegiance of the different authors to the several wellestablished schools of thought is often somewhat different, but also in the sense that new approaches are suggested and explored in some of the papers. For instance, Ernest Adams explores the
VI
PREFACE
logic of conditionals in its relation to probability much more systematically than has been done before, Suppes exhibits some unexamined connections between the problems of concept formation and the logic of induction; and Peter Krauss and Dana Scott develop in detail the model theory that arises from assigning probabilities rather than truth values to firstorder formulas. In some other papers, the concept of information is brought to bear on the logic of induction in a novel fashion. One of them (by H intikka and Pietarinen) suggests a more hopeful view of the possibility of conceiving of induction in a decisiontheoretic spirit as a maximization of certain "epistemic utilities" than earlier attempts in this direction have indicated. In Walk's paper, some "episternic utilities" different from information are studied and related to the concept of information. The extensions of Carnapian techniques and results outlined in Hintikka's paper may bring out a need of modifying the underlying philosophical viewpoint so as to do fuller justice to the critics of Carnap's earlier work. The new points of view frequently enable the authors to put earlier work into a fresh perspective. Thus Suppes appraises critically the relevance of the notion of total evidence to probabilistic inference, while Black surveys the paradoxes of confirmation for which Von Wright and Suppes suggest new treatments. Hintikka and Hilpinen report certain new positive results concerning the possibility of essentially probabilistic rules of acceptance. It is our hope that some of the novel approaches suggested by the different authors, and the results they obtain, will turn out to lead to new and better ways of understanding the subtle and difficult processes of induction. Stanford, California March 1966
THE EDITORS
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC JAAKKO HINTIKKA University of Helsinki, Helsinki, Finland and Stanford University, Stanford, California and RISTO HILPINEN University of Jyvdskyld, Finland
1. According to a widespread philosophical view, most of our knowledge of empirical matters of fact is logically speaking not absolutely certain, but only probable. This feature is perhaps seen best by considering empirical generalizations and statements concerning future events. It is impossible to verify a universally quantified sentence with factual content in a logically conclusive way (except trivially by exhausting the whole universe). Such sentences may be more or less confirmed by available evidence, but not entailed by it. In the same way, singular predictions concerning future events may in some cases be very probable, but usually not absolutely certain. However, most of us are presumably ready to claim that we know the truth of many generalizations as well as the truth of many statements concerning future events. It thus lies close at hand to try to define knowledge in terms of truth and high probability, and many philosophers have in fact tried to do so. In this paper we shall consider some of the difficulties encountered by these attempts, and propose a partial solution to them. In addition, we shall consider the corresponding difficulties which arise in connection with probabilistic rules of acceptance. Other aspects of the interrelation of knowledge, certainty, probability, and entailment will not be taken up here. 2. Roderick M. Chisholm has defined the concept of knowledge in the following way! : "S knows that h is true" means: (i) S accepts h, (Dl) (ii) S has adequate evidence for h, and (iii) h is true. 1
Chisholm [1957] p. 16.
2
JAAKKO HINTIKKA AND RISTO HILPINEN
Part (i) of the definition (01) is ignored here, because it is not essential for those logical aspects of the concept of knowledge in which we are here interested. What we are here interested in are the conditions on which people are justified in making knowledgeclaims. Parts (ii) and (iii) of (01) can be said to define "S is in a position to know that h is true" or simply "h is knowable to S". For the sake of simplicity, we shall in the sequel omit the reference to a person S, because it is immaterial for the purposes of our argument. "Having adequate evidence for h" in part (ii) of (01) means of course that there is a body of evidence e that gives h support strong enough to make it acceptable. We express this in short by "Ac(h, e)". The acceptability of h makes it rational to believe that h is true". If evidence e makes h acceptable and, in addition, h is true, its truth is knowable on the basis of e. This will be expressed by" K (h, e)". (01) thus yields another definition ofknowledge: (02) K(h,e)
=df.
Ac(h,e)&h.
When is evidence e adequate for the acceptance of a hypothesis h? As we suggested earlier, it is tempting to require only that e makes h probable enough. "Probable" here refers of course to logical probability, i.e. to a degree of confirmation. According to the usual probabilistic analysis of empirical knowledge, the acceptability of a proposition can thus be defined as follows: (03) Ac(h,e)
=df.
P(h,e) > II;.
Probability 1  I; is supposed to be relatively high, in other words, 0 < I; ~ 0.5. According to Chisholm, I; = 0.5 is sufficient. Similar views have been put forward by Hempel, too '. We shall here leave the question concerning the precise value of I; open. (D3) purports to be a definition of acceptability, and it may be called a putative rule of acceptance for empirical hypotheses. It says that it is reasonable to accept a hypothesis if and only if its degree of confirmation is higher than 1 I;. The question whether it is reasonable to incorporate such probabilistic rules of acceptance as (D3) into inductive logic has recently been " The concept of acceptability used here is different from the concept used by Chisholm [1957]. According to Chisholm, a hypothesis h is acceptable for S if and only if S does not have adequate evidence for the contradictory of h. We use the expression "h is acceptable on the basis of e" as a synonym for the expression "e gives adequate evidence for h", We might distinguish these two concepts by calling the concept used by us acceptability in the strong sense, whereas Chisholm uses the word "acceptable" in the weak sense. See pp. 89. 3 See Chisholm [19571 p, 28 and Hempel [1962] p. ISS.
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
3
subject to a great deal of discussion. Carnap and many other modern writers on probability and induction, partly influenced by modern statistical decision theory, have argued against their usefulness 4. On the other hand, many philosophers of science have stressed the importance of the tentative acceptance of hypotheses in scientific enquiry e.g. in connection with the "hypotheticodeductive" method 5. We are not going to discuss the importance of rules of acceptance from the point of view of the methodology of science. Instead, we shall consider the possibility of incorporating a probabilistic rule of acceptance into a system of quantitative inductive logic, because such a rule would in our view be very helpful for the purpose of explicating in terms of inductive probabilities such classificatory expressions used by most philosophers as "empirical knowledge", "practical certainty" and "rational belief". 3. In spite of the intuitive plausibility of definitions (02) and (03), they lead to difficulties, as Keith Lehrer and R. C. Sleigh (among others) have recently emphasized 6. These difficulties are connected with two closure conditions that are usually presented as principles of epistemic logic. The conditions in question are: (CKl) If K(h l,e)&K(h 2,e)& ... &K(hk,e) and if I (h l & h 2 & ... & hk) :::J h a, then also K (h a, e). The condition (CKl) is very natural, because all sentences entailed by a set of true sentences are true, too. In order to know that a proposition is true it should therefore suffice that it be entailed by other propositions that are known to be true, which is just what (CKl) says. Another very obvious condition is (CK2) The set K={hi:K(h i, is logically consistent.
en
(CKl) and (CK2) together say that the set K defined in (CK2) is consistent and logically closed. (CKl) and (CK2) are adopted as epistemic principles 4 According to one school of thought, inductive reasoning about a proposition h should lead, not to its acceptance or rejection, but to the assignment of a credibilityvalue, i.e. degree of confirmation, to the proposition. By using credibilityvalues it is possible to determine in the usual decisiontheoretic way how one should act in each particular situation to maximize one's expected utility. For this kind of conception of inductive inference, see Carnap [1962] pp. 316317, and Jeffrey [1956]. 5 See e.g. Kyburg [1965] pp. 301310, and Popper [1959] e.g, pp. 22, 418419. 6 The difficulties and contradictions that arise in connection with the attempt to define knowledge in terms of truth and high probability have recently been subject to a great deal of discussion. See e.g, Lehrer [1964] and Sleigh [1964].
4
JAAKKO HINTIKKA AND RISTO HILPINEN
by Roderick M. Chisholm and Richard M. Martin (in the form (CAl)(CA2) given below). They are assumed to hold for the concept of knowledge by Jaakko Hintikka in his book Knowledge and belief? It is easy to see that definitions (D2) and (D3) together with conditions (CK 1) and (CK2) give rise to a contradiction, when used without any restrictions. The contradiction in question is sometimes called the lottery paradox s, In its simplest form it comes about as follows: Suppose that the following sentences are true: h[
(1)
hz
(2)
P(hbe»le
(3)
P(hz,e»le.
(4)
According to (D2), (D3), and (CKl) sentences (1)(4) together entail K(h 1 &hz,e).
(5)
Because of definitions (D2) and (D3), (5) entails P(h[&hz,e»
Ie.
(6)
On the other hand, because of the multiplication theorem of probabilities, it is possible that although (1)(4) are true,
rt», &hz,e) ~ le, (7) which contradicts (6). Conditions (CK 1)(CK2) do not concern the concept of knowledge alone. Carl G. Hempel has put forward similar requirements of consistency and of logical closure as necessary conditions of rationality in the formation of any beliefs 9. Because "h is acceptable" means just that it is reasonable to believe that h is true, Hempel's conditions can be expressed as follows: (CAl) If Ac(hl>e)&Ac{h z,e)& ... &Ac(hk,e) and if f(h 1 &h z & ... & h k ) =:> h o, then also Ac(h o, e). (CA2) The set A = {hi:Ac(hj,e)} is logically consistent. 7 See Chisholm [1957] p. 13, Martin [1963] pp. 95101, and Hintikka [1962] chapters 2 and 3. 8 The expression "lottery paradox" has been used by Kyburg [1965] p. 305. The contradiction (1)(7) is a special case of the lottery paradox. Another form of the contradiction in question is exemplified in formulas (8.1)(9). 9 Hempel [1962] p. 149.
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
5
These conditions are very natural and their intuitive justification corresponds closely to the justification of (CKl) and (CK2). In addition, Hempel has stated the requirement of total evidence as third condition of adequacy for the acceptability of hypotheses. This condition creates of course special problems, but in the present discussion it can be supposed to be satisfied 10. It is clear that any purely probabilistic definition of acceptability, e.g. (D3), will contradict (CAl) and (CA2), because of the multiplication theorem of probabilities. For instance, it is possible that the following k + 1 sentences are true: (8.l) P(h I,e»le
and
P(h z,e»le
(8.2)
P(h k,e) > 1e
(8.k)
P(~hI v~hz
v ... v~hk>e»
Ie.
(9)
According to (8.1)(8.k), (9) and (D3), every member of an inconsistent set of sentences {hI' hz, ... , hk> ~hI V ~hz v ... v ~hk} is acceptable, which contradicts (CA2). The contradiction in question will come about, although s is very small, if the number k is sufficiently large. 4. The contradictions discussed in the preceding section show that it is not possible to define acceptability in terms of high probability alone and at the same time require that the concept in question has to satisfy conditions of adequacy as strong as (CAl) and (CA2). Moreover, if we wish to stick to the corresponding epistemic principles (CK1) and (CK2), (D3) will not do as a definition of acceptability. Because of this, Herbert Heidelberger has concluded that it is a mistake to assimilate probability to knowledge, in other words, he has rejected the usual probabilistic analysis of empirical knowledge altogether t. Henry E. Kyburg has tried to tackle the problem in another way, although perhaps not entirely successfully so far. He has replaced (CAl) and (CA2) by weaker conditions and retained in his system of inductive logic a purely probabilistic rule of acceptance 12. Kyburg's studies are not conHempel 11962] p. 15!. Heidelberger [1963]. 12 Kyburg takes (CAl) to mean not that every logical consequence of the conjunction of the sentences in A belongs to A, but only that every logical consequence of each single element of A belongs to A. He interprets (CA2) in the same way. See Kyburg [1965] p, 307. However, the lottery paradox still arises in the system of inductive logic presented by Kyburg [1961], though in a somewhat peculiar way. See Schick [1963] p. I I.
10
11
6
JAAKKO HINTIKKA AND RISTO HILPINEN
cerned with the concept of knowledge, but with the concept of rational belief. However, it seems to us that although it perhaps is possible to speak of the rationality of beliefs in a way that does not presuppose (CAl) and (CA2), the concept of empirical knowledge in any case. has to satisfy (CKl) and (CK2). Nevertheless, the conclusion drawn by Heidelberger is in our view too hasty. What the contradictions and paradoxes show is only that high probability cannot be a sufficient condition of acceptability, if "acceptability" is understood in the strong sense required by (CAl) and (CA2). As we shall show in the sequel, it is possible in certain interesting cases to incorporate in a system of quantitative inductive logic a rule of acceptance which satisfies (CAl) and (CA2). The rule in question is probabilistic, but not purely probabilistic. It is obtained from (03) by means of a very simple additional condition. Because it fulfils Hempel's requirements, the concept of knowledge defined by means of it and (02) satisfies (CK1) and (CK2). 5. The problem of the acceptability of hypotheses is especially interesting in the case of general propositions, for such propositions provide clear examples of the probabilistic character of empirical knowledge. Because of the rule (CA 1), it will also be possible to justify the acceptability of many singular sentences, if the acceptability of the corresponding generalizations can first be justified. This aspect of the problem is also of special interest from the point of view of the philosophy and methodology of science, where general laws loom especially large. Therefore we shall first consider the acceptability of general sentences, and turn later to singular sentences. Dealing with general propositions by means of a probabilistic rule of acceptance presupposes of course that it is possible to attach probabilities to such sentences in a reasonable way. Carnap's wellknown system of inductive logic is of little use here, because according to it all universal sentences with factual content receive negligible degrees of confirmation if evidence does not contain a relatively large part of the individuals in the whole universe. In particular, in infinite domains of individuals Carnap's confirmation function c* gives all generalizations zero probability. In other words, according to c*, no factual generalization concerning an infinite universe has any credibility whatsoever, on the contrary, we ought to believe that there are all possible kinds of individuals in our universe 13. These disadvantages cannot be wholly 13 The same holds for all the other systems in Catnap's Acontinuum of inductive methods, except for A = O. See R. Carnap [1950] pp. 570571; and [1952].
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
7
eliminated by using the notion of instanceconfirmation 14. In discussing the applicability of inductive logic Carnap has stressed its importance for practical decisions. A system based on c* may perhaps be applicable for practical decisionmaking, because in these decisions it normally suffices to consider only the instanceconfirmation of a generalization 15. However, such a system cannot be used as a rational reconstruction of the more theoretical aspects of scientific activity, because of the weakness just mentioned. Jaakko Hintikka has recently constructed a system of inductive logic that seems to give fairly natural degrees of confirmation to general sentences. Our solution of the paradoxes of acceptability makes use of Hintikka's system. It is in principle applicable to all firstorder languages. We shall here restrict our attention mainly to the case in which only monadic predicates are used 16. 6. Let us consider a simple language L k that contains k primitive monadic predicates Pi (i= 1,2, ... , k). By means of these predicates and propositional connectives it is possible to define exactly K = 2k different kinds of individuals. These kinds of individuals are specified by certain complex predicates CtAj= 1,2, ... , K), which we shall call attributive constituents or Ctpredicates. In L k they are simply Carnap's Qpredicates in a new guise. By specifying of each attributive constituent Ct , whether it is instantiated or not it is possible to define 2K different constituents. Constituents describe all the different kinds of "possible worlds" that can be specified by means of our monadic predicates, quantifiers, and propositional connectives 17. Let us assume that the whole domain of individuals of which we are speaking in L k contains N individuals. Suppose that we have observed n individuals sampled from the universe in question and that the observed individuals exemplify c different attributive constituents. Let e be a singular sentence that describes our sample. If attributive constituents are numbered in a suitable way, an arbitrary 14 Hintikka has argued that it is not possible to explain our preferences among generalizations by means of the notion of instanceconfirmation. See Hintikka [1965a] pp. 274288, especially p. 277. 15 See e.g. Carnap [1950] pp. 571573. 16 For the system of inductive logic used here, see Hintikka [1965a]. 17 Constituents, attributive constituents, and distributive normal forms have been characterized in greater detail in a number of papers by Hintikka. See e.g, Hintikka [1953], and [1965b] pp. 4790.
8
JAAKKO HINTTKKA AND RISTO HILPINEN
constituent compatible with evidence e can be written as follows:
(EX)Ct i1(X) & (Ex)Ctjx) & & (x) (Cti,(x) v Ct i2(X) v
& (Ex)Ctic(x) & ... &(Ex)Ctdx) V Ctic(x) v ... v Ctdx)),
(10)
where c;':2 w;':2 K and where Ct.,(x), Cti,(x), ... , Ctic(x) are all the attributive constituents instantiated in our sample. We shall call constituent (10) CWo In Hintikka's system of inductive logic, a priori probabilities are first distributed among the 2K constituents. The probability of each constituent is then divided evenly among the statedescriptions that make the constituent in question true. In the simplest case, we may assume that all constituents have received an equal a priori probability 1/2K • We shall discuss this assumption later, but meanwhile we shall base our calculations on it. A posteriori probabilities, or degrees of confirmation, are given to the constituents by Bayes' wellknown formula
(11) where the sum in the denominator is taken over all constituent compatible with the evidence e. For a given value of i, K~i~c, there are (~.:::;) such constituents c; If the number of those statedescriptions which make the constituent C w true, given the evidence e, is expressed by m(C w ) , and the corresponding number in the absence of any evidence is expressed by M (C w ) , and if all the constituents have an equal a priori probability, the degree of confirmation of C w with respect to e is (12) The case in which the universe in question is infinite is the easiest to deal with, and it also seems the most interesting from the point of view of inductive logic. Hence we shall assume in the sequel that we are considering an infinite (or at any rate very large) universe. In this case (12) becomes (approximately) (13) It is easy to see that the value of (13) is the greater the smaller w is, and that it
assumes its greatest value when W=C. In other words, according to (13), e gives the strongest support to the constituent C; which says that in the
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
9
whole universe there exist only such kinds of individuals as are already exemplified in our evidence e. This result is very plausible from the intuitive point of view, at least if n is large in relation to K. According to (13) the degree of confirmation of Ce is
P(Ce,e) =
Ij{1 + ~f(K ~ c)(_c .)n}. + 1=1
C
I
I
(14)
Because every consistent general sentence of L k can be transformed into a distributive normal form, viz. into a disjunction of constituents, the degree of confirmation of any general sentence h is obtained as the sum of the probabilities of the constituents in its normal form, i.e,
P(h,e) = IP(Clhl,e),
(15)
i
where "C?)" denotes an arbitrary constituent which occurs in the normal form of h. What formulas (13) and (14) say can also be expressed as follows: Prior to any evidence, each of the 2K constituents of L k had an equal probability 1(2K • Evidence e changes these probabilities in the following way: All constituents incompatible with e now have zero for their a posteriori probability. All constituents of the kind described in (10), with c;;2;w;;2;K, are compatible with e and have thus positive degrees of confirmation with respect to it. Of these constituents, the most highly confirmed is the constituent C; that says that in the whole universe there are only such kinds of individuals as are already instantiated in experience. In addition to this, it is easy to see from formulas (13) and (14) that p( Co e) grows when n grows and that the probabilities of all the constituents C w with w> c become correspondingly smaller. Moreover, when n grows without limit, p(C e , e) approaches one, and the probabilities of all the other constituents compatible with e approach the value zero. We might express these results in a concise form as follows: (Ll) lim P (C w , e) = 0 (Ll) entails (L2), because
L
when
w
>
C•
P(C w,e)=1.
c~w~K
As in (11), there are in (16) for each w exactly (~=~) equal terms.
(L2)
(16)
10
JAAKKO HINTIKKA AND RISTO HILPINEN
The two lemmas (Ll) and (L2) concerning the behavior of Hintikka's confirmation function are important for our argument, because they imply that the value of P (Cc' e) can be raised arbitrarily close to one by making n large enough. In other words, if we let n become sufficiently large, say n > no, inequality (17) where 0no. According to (14), (17) is logically equivalent to
Ij{1 + ~£C(K ~ c)(~)n} > 18. .=1
I
C+ I
(18)
(18) can be expressed in a simpler way if we define a new constant 8':
,
(04)
1 1, 18
8 =df.
i.e,
1 18=. 1 +/;'
(19)
In virtue of definition (04), (18) and therefore also (17) are equivalent to (20) As soon as n is large enough to make (20) true, (17) holds, too, because of the equivalence of (17) and (20). According to (20), the critical value no depends only on the values of K and c, provided the value of 8 and thus also of 8' is fixed. A specific value of K is a characteristic of the language L k used in our generalizations, but the value of c depends on the number of attributive constituents which happen to be exemplified in evidence e. The critical value of n can be made independent of the specific value of c by replacing (20) by a stronger condition
/;' > max C
KC(K. c)(.)nJ . [I + C
i= 1
I
C
I
(21)
In other words, the critical value of n is computed by using that value of c, o~ c ~ K 1, which makes the righthand side of (20) as large as possible. (21) entails (20), and therefore it entails (17), too. We now define the critical value of n as follows:
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
(05) no =
df.
11
the largest integer n for which B
f
~ max e
Ie(K. c)(.)nJ . [K + C
i= 1
I
I
C
In virtue of (05), (21) can be expressed in a very simple way: n > no.
(C.Ac)
The acceptability of a general sentence of the language L k can now be defined in the following way: (O.Ac) Ac(h,e)
=df.
(i) P(h,e) > IB, where 0 < B ~ 0.5 (ii) n > no.
(O.Ac) is a probabilistic definition of acceptability, but it is not purely probabilistic. According to part (ii) of (O.Ac), the number n of individuals included in the sample described by e must exceed no in order for e to make h acceptable. Our definition (O.Ac) is easily shown to satisfy Hempel's conditions (CAl) and (CA2). The proof is as follows: Because of (05), part (ii) of (O.Ac) is equivalent to (21). (21) entails (17), and therefore (O.Ac) entails (17). Moreover, because of (16), (17) entails
I
w>e
Because we have assumed that
I
w>e
P ( C w e)
18 (ii) there is a constituent C w ' 0;;:; w;;:; K, such that P(Cw,e) > 18.
Rule (O.Ac+) would satisfy Hempel's conditions just as well as (O.Ac). In the case 0: = 0 (O.Ac) and (D.Ac +) are equivalent. But if 0: is large enough to make (31) true, it is possible that a hypothesis is acceptable although n is very small or zero. In this case the hypothesis in question would be acceptable because of the high a priori probability of C K • However, (O.Ac +) will lead to implausible and unreasonable consequences. If a constituent has a high degree of confirmation on the basis of a very small number of observations, the constituent in question can be only C K . In other words, only C K could according to (C.Ac+) be acceptable if evidence e contains few or no observations. C K says that individuals of all the possible kinds exist in our universe. This implies that no contingent general law, i.e. no general implication (x) (R (x):» Q(x)) which is not logically true, holds in the universe in question. If a scientist accepted CK when it has high probability, viz. when n is very small or zero, he would prior to any investigation accept the view that no general implications with factual content hold in our universe. Such a procedure would of course be eminently
15
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
pessimistic, because general laws often are just what scientists are looking for. Thus we have to reject the rule (D.Ac+). 8. So far we have considered only the acceptability of general hypotheses. According to the principle (CAl), all singular hypotheses that are substitution instances of acceptable generalizations are acceptable, too. In this section we shall inquire whether there are other kinds of singular propositions that can be accepted without contradicting Hempel's conditions. The singular sentence e that describes our evidence is of course trivially acceptable, because the acceptability of factual hypotheses is decided on the basis of e. In other words, all observational reports that are used in testing hypotheses are assumed to be true. Our simple language L k forces us here into a considerable oversimplification, because it is not possible to discuss here the reliability of measurements and other questions concerning the acceptability of observation reports. This simplification is not perhaps too serious, however. When we speak of singular hypotheses, we are in the first place interested in singular propositions concerning unobserved individuals. Let us consider a singular hypothesis "A (ai)", i.e. "An unobserved individual ai is A", where A is an arbitrary primitive or complex predicate of the language L k • Any predicate A of L k can be transformed into a disjunction of attributive constituents. We shall call such a disjunction "the normal form of A". The number of Ctpredicates that occur in the normal form of A is called "the logical width of the predicate A". If "A (a,)" is a substitution instance of an acceptable generalization, i.e. if "(x)A(x)" is acceptable, the normal form of A contains at least all the attributive constituents Ct., Ct i 2, ... , Ct., that are already exemplified in the evidence described bye, and possibly other Czpredicates as well. The probability that an unobserved individual a, will exemplify an attributive constituent Clij' given the evidence e, is according to Bayes' formula
LP(Cj)P [(Ctij(ai) & e), CJ P(Clij(a;),e) = j L ( ) (j
P
c,
P
(32)
«c;
where the sums include all constituents Cj compatible with e. In the system of inductive logic with IX = 0 and N = 00 which was described above, the probability that a, will exemplify an attributive constituent Ct., (1 ~j ~ c) that is already exemplified in the evidence e is according to (32)
P(Ctij(ai),e)
=
KC(K L . i=O
I
c)(: 1 )n+ ljKC(K c)( 1 )n L . . C
+I
1=0
I
C
+I
(33)
16
JAAKKO HINTIKKA AND RISTO HILPINEN
Accordingly, the probability that a, will exemplify the predicate Ac(x)= ci., (x) v Cli2(x) V ... V Clic (x) is peA (a.) C
I'
e) =
e.K£C(K  e)(_1 )n+ljK£C(K  e)(_1 )n i=O
i
e+i
i=O
i
e+i
(34)
It is easy to see that for large values of n the value of (33) is approximately 1/ c. When n grows without limit, the value of (34) approaches one. Conversely, the probability that a, will exemplify an attributive constituent different from the c attributive constituents already exemplified in the evidence e differs only very slightly from zero when n is large in comparison with K. Suppose now that "(x)A (x)" is an acceptable generalization. In this case the normal form of A includes the attributive constituents Clil, Cli,' ... , Clic plus possibly other attributive constituents as well. However, if the generalization in question is acceptable, the number n must be fairly large, and consequently the probability that a, will exemplify an attributive constituent Ct, _ with j > c is negligible. Therefore we shall assume, for the sake of simplicity, that A (x) is logically equivalent to Ct., (x) v Ct., (x) V ... v Cldx), and that the logical width of A is therefore equal to c. To indicate this we shall attach the subscript "c" to "A". In this case the degree of confirmation of the prediction "Ac(a;)" is expressed by (34). If "(x) Ac(x)" is acceptable, we ought according to (CAl) to accept not only the prediction that a specified individual, say at, is A c ' but also the corresponding prediction concerning any unobserved individual a.. Consequently, we ought according to (CAl) to accept the proposition that any number r of unexamined individuals will exemplify A c ' The probability of such a hypothesis is given by
P[AcCat)&Ac(az)&··· &Ac(ar),e] =
When r grows without limit, (35) approaches the value
Irt:(K~e)C:iY (36) is equal to P (C c '
(36)
e). In other words,
(L5) lim P [Ac(a 1) &A c(a2) & ... &Ac(a r), e] = P(C c, e). r"
00
In fact, when r becomes infinite, (35) says just the same thing as Cc , viz. that there are in the whole universe only such kinds of individuals as are already
17
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
exemplified in the evidence e. If "(x)Ac(x)" is an acceptable generalization, ric; e) is according to (D.Ac) and (L4) higher than Ie. Therefore the acceptance of "Ac(ai)" will not lead to the lottery paradox, because the conjunction of any number of such propositions has always a degree of confirmation higher than 1  e. What happens if "A (a i) " is not a substitution instance of an acceptable generalization? We shall consider again a situation in which n is large enough to make (C.Ac) true. If"A(ai)" is not a substitution instance of an acceptable generalization, although n is large, not all attributive constituents Ct., l;i;j;i; c, occur in the normal form of A. Instead, A may be equivalent to, say, the disjunction of cd such Ctpredicates. It is of course possible that the normal form of A contains also Ctpredicates not exemplified in the evidence e, but such Ctpredicates can be left out of consideration for the same reason as in the previous case. In other words, let us assume that the logical width of A is c d, where l;i; d < c, and A (a;) is logically equivalent to Cti.(a;) v ... v Ctic_d(a;)' The degree of confirmation of the prediction in question is
P(ACd(aJ,e) = (c 
KC(K _ c)(
d)·.I
.
1=0
Z
1
.
)n+ljKC(K _ c)( .L
C+Z
.
1=0
1
.
Z
C+Z
)n
(37)
Given suitable values of e, c and d, it is possible that
p(Acd(ai),e) > 1 e.
(38)
In other words, it is possible that a singular hypothesis may have a very high degree of confirmation although it is not a substitution instance of an acceptable generalization. (38) may hold again, not only for a specific individual, say aI' but for any unobserved individual. However, the degree of confirmation of an arbitrarily long conjunction of such singular hypotheses is never higher than Ie. The probability of such an rtermed conjunction is
P [(Acial) &Acd(a Z) & ... &Acd(a,»), e] =
, KC(K _. =(cd)'I ;=0
Z
c)(1.)n+'jKc(K . c)(1.)n I c
+ Z
i=O
Z
C +
Z
(39)
(39) is easily seen to approach zero when r grows without limit. The acceptance of Acd(a;) will give rise to the lottery paradox, because it is possible that (40.1) p(ACd(al),e) > 1  s , (40.2) P(Acd(aZ),e) > 1  s ,
P(ACd(a,),e) > 1 
E,
(40.r)
18
JAAKKO HINTIKKA AND RISTO HILPINEN
but, however,
In fact, (39) shows that the value of (41) may be pushed arbitrarily close to zero by choosing r large enough. The formulas (40.1)(41) violate Hempel's conditions. Since similar results are forthcoming always when A (a;) is not a substitution instance of an acceptable generalization, we are led to define the acceptability of singular hypotheses in the following way: (O.Ac.sing) A singular hypothesis A (a;) is acceptable if and only if the generalization (x)A (x) is acceptable. The system of inductive logic which was used in the preceding argument is not particularly wellsuited for the study of the probabilities of singular predictions. As formula (33) shows, it does not take into account the observed relative frequencies of the different kinds of individuals. It makes only the rough distinction whether there are observed individuals that exemplify a given attributive constituent or not. The degree of confirmation of the prediction "Ctij(aJ" is thus independent of the number of individuals exemplifying Ct ij in the evidence e, provided that this number is ~ 1. In the system of inductive logic in question, one does not learn anything from observed relative frequencies. Our principal results are, however, independent of the weakness just mentioned. Hintikka [l965c] has sketched another system in which these shortcomings are corrected. In this system, the degree of confirmation of a singular prediction "Ctij(ai)" depends on the number n j of individuals that have exemplified Ct., in the evidence e. Also in this system the satisfiability of Hempel's conditions (CAl) and (CA2) can be warranted only by the definition (O.Ac.sing) 20. The definition (O.Ac.sing) shows that Hempel's conditions are very strong indeed. Their strength does not, however, seem quite unreasonable from the point of view of the concept of knowledge. Because (O.Ac.sing) satisfies Hempel's conditions (CAl) and (CA2), it also satisfies the corresponding epistemic principles (CKl) and (CK2). According to (02) and (O.Ac.sing), it is illegitimate to use the expression "knowledge" of singular propositions concerning unobserved events unless they can be deduced from acceptable generalizations, a doctrine wellknown from the history of philosophy. The concept of knowledge defined by (02), (O.Ac) and (O.Ac.sing) recalls e.g. "n For this system, see Hintikka [1965c] pp. 2130.
KNOWLEDGE, ACCEPTANCE, AND INDUCTIVE LOGIC
19
the Aristotelian dictum that one can only have knowledge of the general 21. 9. Our results concerning the acceptability of hypotheses are established for a simple language L, only. Is it possible to generalize these results to richer languages? It is possible to characterize and define attributive constituents, constituents, and distributive normal forms in the whole firstorder logic, and in suitable higherorder logics. However, it is much more difficult to extend the principles of inductive logic which we have used in such a way that we can see clearly their consequences. It is especially difficult to construct for a language covering the whole firstorder logic a system of inductive logic that would assign probabilities to general sentences in a reasonable way. Independently of the richness of the language considered, if it were possible to show in the language in question that the degree of confirmation of one constituent approaches asymptotically one when more and more individuals are examined, results similar to those obtained in the present paper would follow. This is, however, very questionable as regards to the full firstorder logic. If there were a finite number of constituents such that the sum of their degrees of confirmation approaches one when n grows, it would also be possible to obtain positive results comparable to those obtained in this paper. In this case, however, the definition of acceptability would be somewhat more complicated than (D.Ac) or (D.Ac.sing). The generalizability of our results thus remains an open question. To some extent even the relevant questions remain unasked. Up to the present time it has been possible to deal with inductive generalizations in languages more complex than L k by means of rules of acceptance, but not by means of a quantitative concept of confirmation 22. Therefore the problem of the relation between probability and acceptability simply has not yet come up in the case of richer languages 23.
21 This does not of course imply that there are no important differences between the concept of knowledge defined by means of (D2), (D.Ac) and (D.Ac.sing) and the Aristotelian conception of knowledge. 22 For systems of inductive logic based on rules of acceptance, see e.g. Kemeny [1953] and Putnam [1963]. 23 This study has been supported by a grant from the Finnish State Council for Humanities iValtion humanistinen toimikunta). The contribution of the senior author has been facilitated by a Finnish State Fellowship (Va/tion apuraha varttuneille tieteenharjoittajillei. The work has been carried out independently by the junior author on the basis of suggestions from the senior author.
20
JAAKKO HINTIKKA AND RISTO HILPINEN
References CARNAP, R., 1950, The logical foundations of probability (University of Chicago Press, Chicago; second edition, 1963) CARNAP, R., 1952, The continuum of inductive methods (University of Chicago Press, Chicago) CARNAP, R., 1962, The aim of inductive logic, in: Logic, Methodology, and Philosophy of Science, eds. Nagel, Suppes and Tarski (Stanford University Press, Stanford, California) pp. 303318 CHISHOLM, R. M., 1957, Perceiving: A philosophical study (Cornell University Press, Ithaca, N.Y.) HEIDELBERGER, H., 1963, Knowledge, certainty, andprobability, Inquiry, vol. 6, pp. 245255 HEMPEL, C. G., 1962, Deductivenomological versus statistical explanation, in: Minnesota Studies in the Philosophy of Science, vol. 3, eds. H. Feigl and G. Maxwell (University of Minnesota Press, Minneapolis, Minnesota) pp. 98169 HINTIKKA, J., 1953, Distributive normal forms in the calculus of predicates, Acta Philosophica Fennica, vol. 6 HINTIKKA, J., 1962, Knowledge and belief (Cornell University Press, Ithaca, N.Y.) HINTIKKA, J., 1965a, Towards a theory of inductive generalization, in: Proc. 1964 Intern. Congress for Logic, Methodology, and Philosophy of Science, ed. Y. BarHillel (NorthHolland Publ. Co., Amsterdam) pp. 274288 HINTIKKA, J., 1965b, Distributive normal forms in firstorder logic, in: Formal Systems and Recursive Functions, Proc. Eighth Logic Colloquium, Oxford 1963, eds. J. N. Crossley and M. A. E. Dummett (NorthHolland Publ. Co., Amsterdam) pp. 4790 HINTIKKA, J., 1965c, On a combined system of inductive logic, in: Studia logicornathematica et philosophica in honorem Rolf Nevanlinna, Acta Philosophica Fennica, vol. 18, pp, 2130 JEFFREY, R. C.; 1956, Valuation and acceptance of scientific hypothesis, Philosophy of Science, vol. 23, pp. 237246 KEMENY, J. G., 1953, The use of simplicity in induction, Philosophical Review, vol. 62, pp. 391408 K YBURG, H. E., 1961, Probability and the logic ofrational belief(Wesleyan University Press, Middletown, Conn.) KYBURG, H. E., 1965, Probability, rationality and a rule of detachment, in: Proc. 1964 Intern. Congress for Logic, Methodology, and Philosophy of Science, ed. Y. BarHillel, (NorthHolland Publ. Co., Amsterdam) pp. 301310 LEHRER, K., 1964, Knowledge and probability, The Journal of Philosophy, vol. 61, pp. 368372 MARTIN, R. M., 1963, Intension and decision (PrenticeHall, Englewood Cliffs, N. J.) POPPER, K. R., 1959, The logic of scientific discovery (Hutchinson, London) PUTNAM, H., 1963, Degree of confirmation and inductive logic, in: The Philosophy of RudolfCarnap, ed. P. A. Schilpp (Open Court Publ. Co., La Salle, Illinois) pp. 761783. SCHICK, F., 1963, Consistency and rationality, The Journal of Philosophy, vol. 60, pp. 519 SLEIGH, R. C., 1964, A note on knowledge and probability, The Journal of Philosophy, vol. 61, p. 478
CONCEPT FORMATION AND BAYESIAN DECISIONS* PATRICK SUPPES Stanford University, Stanford. California
1. Introduction. The primary aim of this paper is to examine and develop some relations between decision theory and recent work on concept formation by learning theorists. Some of the ground rules of this investigation perhaps need to be stated at the very beginning. Let me first try to make clear how I conceived in a general way the relation between decisions and concept formation. If we examine the structure of decision theory as expounded for example, in the excellent book of Savage [1954], we find that there is really no place for the formation of new concepts by the decisionmaker. The theory is conceived in such a way that the decisionmaker has a probability distribution over all possible states of the world and a utility function over all possible future histories of the universe. As observations are made or experiments performed by the decisionmaker, the information received is brought into his formal decision framework by appropriate modifications of initial probabilities as new conditional probabilities. The important thing I wish to emphasize is that the theory provides no place for the decisionmaker to acquire a new concept on the basis of new information received. The theory is static in the sense that it is assumed the decisionmaker has a fixed conceptual apparatus available to him throughout time. There are, it seems to me, two important ways in which concept formation enters in the making of actual decisions. The first kind of modification in the decision structure that may be introduced by concept formation is a relatively straightforward refinement or at least modification of the initial partition of the possible states of nature by the consideration of additional concepts. The consideration of these additional concepts is almost always brought about by the reception of a cue or stimulus resulting from some new observation. The essential thing however in this kind of modification is that the concepts newly introduced are already a part of the conceptual apparatus of the decision
*
The work on this paper was supported by a contract between ARPA, U.S. Department of Defense and the System Development Corporation.
22
PATRICK SUPPES
maker. It is just that he has not been using them to partition the space of the states of nature until a particularly critical, or new, sort of observation was obtained. Because the new concept brought into focus is actually one already known to the decisionmaker, it is becoming customary in the psychological literature to call this process concept identification, rather than concept formation, and we shall so in fact refer to it here. Advocates like de Finetti and Savage of Bayesian theory would indeed claim that this first kind of concept formation is already taken care of by considering all the possible states of the world and all possible future histories. From a theoretical standpoint, or at least one theoretical standpoint, there is indeed a good argument to back up this claim. Yet from a more behavioristic viewpoint it is quite unrealistic, for no actual decisionmaker is able in any genuine way to define an a priori distribution over all possible states of the world or a utility function over all future histories. His powers of discrimination and analysis, even in terms of the empirical data available to him, are inadequate to this task. In actual practice the decisionmaker is always operating with what Savage has termed a smallworld situation. The decisionmaker operates with a fairly small number ofconcepts and the partition of the possible states of nature generated by these concepts. I want to emphasize that it is not necessary that the partition itself be finite for some of the concepts may be conceived by the experimenter as being measured on a continuum. The crucial thing is that the concept space is always finitedimensional and in fact, the number of dimensions is a relatively small integer. The second way in which concept formation modifies the decision structure is the genuine case of concept formation proper. In this instance the decisionmaker actually forms a concept he did not previously have in his repertoire. Numerous examples of this kind of concept formation are to be found in the learning experience of anyone. It may be rightly claimed that this kind of concept formation is essential to any major advance in science or technology. In the next section] turn to some simple examples of concept identification and attempt to show how they disturb the simple Bayesian picture of decision making. In the following section I consider some common problems besetting Bayesian and stimulussampling learning models. In the final section I sketch a possible line of attack on the structural or combinatorial problems facing any theory of concept formation. I also try to show that Bayesian considerations are not central to the most pressing problems of the theory of concept formation, and that no theory of complex problem solving is possible without an approximate solution to these problems.
CONCEPT FORMATION AND BAYESIAN DECISIONS
23
Before embarking on these somewhat detailed considerations, I would like to indicate in a general way how the subject matter of this paper relates to the more standard literature of inductive logic. The most important and also the most subtle point centers around the conception of rational behavior back of the general criteria used to evaluate an inductive logic or procedure. In the case of deductive logic the response is simple and clear. The criteria of soundness and completeness make no allowance for an imperfect or limited knower. The inattention to the obvious finite capacity of any actual knower is a simplifying abstraction that makes the mathematical theory of deductive inference a manageable subject in the tradition of classical mathematics. To a large extent the same sort of simplifying abstraction has been assumed in deductive logic, but with an important difference. No adequate inductive criteria corresponding to the deductive criteria of soundness and completeness are as yet available. Bayesian decision theory provides a possible answer, but certainly not one that is as yet uniformly acceptable. In my own judgment the problem of finding such criteria in inductive logic is not as interesting as in deductive logic, because the finite capacity of the learner (when talking about induction it seems more natural to speak of a learner rather than a knower) is central to the fundamental problem of making an induction from a finite sample. Put another way, problems ofinduction seem continually to run up against massive combinatorial problems that do not play the same essential role in deduction. And once we begin talk about, say, 10 (2)2 possibilities it is natural to ask about the kind oflearner that is going to "look over" possibilities whose number is of this order of magnitude, and then to try to inject some semirealism into the discussion. Now that it is generally recognized that even the biggest conceivable computers could not attack by brute force methods the combinatorial problems of playing a winning game of chess, for example, the crucial role of concept formation in providing a powerful method of introducing new and necessary structure is more easily made apparent. In addition, the continued concern in the literature of inductive logic with overly simplified, unrealistic problems suggests that there is a useful place for an explicit analysis of why even the relatively powerful Bayesian methods of induction are far too weak to solve most complex problems. From a more general standpoint, then, an objective of this paper is to make a contribution to the analysis of the concept of rationality. The discussions of rationality in the literature of induction or ethics seem to have largely ignored the difficult problems of concept formation that must be faced by any agent that does not have an unlimited memory and unlimited powers of analysis.
24
PATRICK SUPPES
2. Some simple examples of concept identification. To illustrate some of the comparisons I want to draw between a Bayesian approach to information processing and decision making, on the one hand, and psychological models of behavior on the other, I shall begin with an experiment that is really too simple to be described as a concept experiment, but because of its very simplicity will be a satisfactory paradigm for the making of certain initial distinctions. I have in mind a simple pairedassociate experiment. The task for the subject is to learn to associate each one of a list of nonsense syllables with an appropriate response. In a typical setup the list might consist of twenty nonsense syllables of the form CVC. The responses are given by pressing one of two keys. On a random basis ten of the syllables are assigned to key 1 and ten to key 2. The subject is shown each nonsense syllable in turn, is asked to make a response, and is then shown the correct response by one of several devices, for example, by the illumination of a small light above the correct key. After the subject has proceeded through the list once, he is taken through the Jist a second time but the order of presentation of the twenty items is randomized. A criterion of learning is set, for example, four times through the list without a mistake. The subject is asked to continue to respond until he satisfies this criterion. The criterion is selected so as to give substantial evidence that the subject has indeed learned the correct association of each stimulus item and its appropriate response  at least this language of association is the one ordinarily used by many psychologists concerned with this type of experiment. Let me describe two simple psychological models for this experiment before discussing the obvious Bayesian model and its defects. The simple stimulusassociation model to be applied to the phenomena is the following. The subject begins the experiment by not knowing the arbitrary association established by the experimenter between individual stimuli and the response keys. He is thus in the unconditioned state U. On each trial there is a constant probability c that he will pass from the unconditioned state to the conditioned state C. It is postulated that this probability c is constant over trials and independent of responses on preceding trials. Once the subject has passed into the conditioned state it is also postulated that he remains there for the balance of the experiment. A simple transition matrix for the model, which is a firstorder Markov chain with two states U and C, is the following:
CONCEPT FORMATION AND BAYESIAN DECISIONS
25
To complete the model for the analysis of experimental data it is also necessary to state what the probabilities of response are in the two states U and C. When the subject is in the unconditioned state, it is postulated that there is a guessing probability p of making a correct response, and that this guessing probability is independent of the trial number and the preceding pattern of responses. When the subject is in the conditioned state, the probability of making a correct response is postulated to be 1. The most striking psychological aspect of the stimulusassociation model just described is the allornone character it postulates for the learning process. The organism responds with a constant guessing probability until the correct conditioning association is established on an allornone basis. From that point on he responds correctly with probability 1. This means that for an individual subject the individual learning curve has the following simple appearance.
o
.
»~ p
...         
+' U
~ L
UQ)
'" OC +,'"
=1Il
~~
o
.0
o
t:
0
•
'Trials
The important thing to note about this curve is that it is perfectly flat until conditioning occurs and at that point there is a strong discontinuity. The second psychological model is a linearincremental model that postulates that the probability of making a correct response increases each time the subject is exposed to the stimulus and is shown the correct response. Let P« be the probability of a correct response on trial n, and let qn= 1 POl that is, let qn be the probability of an incorrect response or error on trial n. The simplest way of formulating this model is in terms of q.: It is postulated that the following recursion will describe the course of learning: qn + 1 = aq.. This linear model can be put within the framework of stimulusassociation theory in a rather simple way. Instead of postulating that a single stimulus is being sampled and conditioned in connection with each nonsense syllable displayed, it may be postulated that there are a large num ber of stimuli being sampled and conditioned. These are simple and reasonable assumptions
26
PATRICK SUPPES
about sampling and conditioning. As the number of stimuli becomes quite large, the linear model emerges as an asymptotic limit. (For a detailed derivation of the linear model from stimulussampling and conditioning assumptions, see Estes and Suppes [1959J.) The learning curve postulated for an individual subject by the linear models looks something like the following .
....v
Ol
L
L
o
VOl
_VI
oC o
>,0
....
VI
.  Q)
i';LP o
.a C L
a,
o
•
Trials
As is evident enough from the two theoretical learning curves for individual subjects predicted by the two models, there are quite sharp behavioral differences in the predictions of the oneelement stimulusassociation model and the linearincremental model. On the other hand, it is worth noting that the matter of discriminating the two models must be approached with some care. For example, the mean learning curve obtained by averaging data over a group of subjects, or a group of subjects and a list of items as well, is precisely the same for the two models. In the linear model it would naturally be written: (1) In the oneelement stimulusassociation model the same mean learning curve would naturally be written: qn+ t = (1  c)"t q . (2) In estimating parameters from behavioral data it is natural to equate Pl and p (or q land q) and, on that basis, the estimate of a will simply be the same as the estimate of 1  c; there is no behavioral difference between the two models in the prediction of the mean learning curve. On the other hand, perhaps the most striking difference between the two models can be obtained by looking at data prior to the last error, that is, we sum data over subjects and items, but we restrict that summation to response data occurring before
CONCEPT FORMATION AND BAYESIAN DECISIONS
27
the last error on a given subjectitem. When data are summed in this fashion, the oneelement stimulusassociation model predicts the discontinuous learning curve shown above for an individual subject, whereas the linearincremental model predicts a smooth incremental learning curve. That the data from experiments of this kind favor very much the oneelement stimulusassociation models over the linearincremental models has been shown by a number of experiments (see, e.g., Bower [1961]). Let us now attempt to apply Bayes' theorem in a correspondingly direct way to an analysis of the pairedassociate experiment. Without any loss of generality we may restrict the analysis to a single item, that is, to the learning of a single association between a given nonsense syllable and the correct response. Let HI be the hypothesis that the correct response for the single syllable is response 1, and let Hz be defined similarly. It is natural to assume that the a priori probabilities P (HI) and P (Hz) are each a half. (As we shall see, for this simple. situation the particular assumption made about the a priori probabilities is of no real importance.) In the present case, the "evidence events" are easy to describe and amount essentially to a complete confirmation of one of the two hypotheses. Let us define the evidence event E, as the event of being shown that the nonsense syllable is associated with the response i. It should be clear how to define the conditional probability P (Ej!H i) which is called the likelihood of Hi when E, is observed. In the present simple case the likelihoods must either be 0 or 1. The likelihoods are 1 when i=j, and 0 when i=fI We may then compute thea posteriori probabilities P (HiIEj) according to the usual Bayes formula:
Again in the present case the computation of these a posteriori probabilities is simple and immediate. If i = j the a posteriori probability is 1 and if i # j the a posteriori probability is O. Thus,
P(Hzl £1) = 0
P(Hzl £z) = 1.
28
PATRICK SUPPES
Note that the results are independent of the a priori probabilities P (HI) and P (Hz) as long as these probabilities are in the open interval (0,1). In the present case, then, how is the application of a Bayesian approach related to the two psychological models already sketched for the learning process? The answer I think is obvious. The oneelement stimulusassociation model yields exactly the same predictions as the Bayes model, ifit is assumed that the conditioning parameter c has the value 1, that is, if it is assumed that the subject always learns in one trial a correct association between nonsense syllable and response key. There are too many experiments to need detailed citations here to show that the assumption that c = 1 is not a reasonable one for pairedassociate experiments. There is no doubt that the Bayesian model does not provide a good account of actual behavior in these experiments. Its generalization in the form of the oneelement model is much more satisfactory. Advocates of a Bayesian approach as the first approximation to actual behavior will be quick to retort that they have in mind the application of the model to situations in which the subject can utilize his full resources of memory and reasoning. Some may wish to point out that indeed c would equal 1 if the subject were permitted to use pencil and paper in the course of the experiment, and simply to write down the correct association between the stimulus and response once it has been shown to him. However, any serious consideration of the general purpose and intention of such a pairedassociate experiment quickly shows that this defense of the Bayesian approach as an explanatory model of actual behavior is not really satisfactory at all. The pairedassociate experiment is defined and set up in the manner that it is in order to provide an extremely simple paradigm of learning. The simplicity of that paradigm is destroyed once a subject is permitted such recording devices as a pencil and paper. A scientific hope of such experiments is that an adequate fundamental theory of the learning process can be developed for learning stripped of complicated processes of memory, association and reasoning that are utilized in everyday decisions. If the fundamental theory is genuinely correct, then it will lead ultimately to extensions to more complicated situations including the sort in which the learning problem confronting the subject is not one that he can trivialize by the use of some additional simple devices. Some of my subsequent examples will in fact be instances of this kind of experiment. A second kind of objection that might be offered by Bayesians to a comparison of the three models is that the real purpose of the Bayesian approach is to prescribe a normative course of behavior and not describe actual
CONCEPT FORMA nON AND BAYESIAN DECISIONS
29
behavior. In spite of the persuasiveness of the important distinction between normative and descriptive theories, this argument is too facile by half. The kind of situations which decisionmakers are continually confronted with is precisely the kind of situation in which we place the subject. The subject faced with the pairedassociate problem could, if he were given paper and pencil, readily and quickly solve the problem, but the point is that he is not given these additional aids. The decisionmaker, whether it be an executive faced with a major policy decision, a logistics expert deciding on the next quarter's inventory, or a legislator deciding on how to present a crucial and controversial bill, is in a situation analogous to that of our subject, for the complexities of the one correspond to the simple restrictions of the other. In certain cases, given unlimited budget for computing purposes and unlimited staff to furnish scientific information, it might be possible for the decisionmaker to act rather completely like a Bayesian strategist. When this is not possible, as it usually is not, the decisionmaker must make a large number of rough and ready judgments that do not easily fit within the frame of a detailed normative theory. Indeed, a primary aim of this paper is to show by the consideration of several simple examples that the attempt to structure the decisionmaking process entirely within the Bayesian framework will lead to serious miscalculations about actual performance and, in certain cases, to bad advice on normative performance. I now turn to a first simple example of concept identification. Let us suppose that a subject is to be shown triangles of various sizes, and let us also suppose that the instructions are meant to bias him in the direction of paying attention to size only. We tell him that he is to classify the triangles primarily according to size into Class A or Class B. In actual fact, in addition to having triangles of three different areas, each triangle will have the property of having an angle less than 15° or no angles less than 221°. Let us call the three sizes a, band c and the two angle properties sand t. Suppose we fix that Class A will consist of the triangles with the property as and bt. Class B will then consist of the complement of Class A, that is, of the combinations at, bs, cs and ct. I have picked the angle property because it is a feature of triangles that does not have much saliency for untrained subjects, whereas size is ordinarily a highly salient property. With these instructions and in this situation it is quite probable that many subjects would have a Bayesian distribution of prior probabilities that are nonzero only in terms of hypotheses about size. It does not really matter what specific prior probabilities we assume on hypotheses about size. Eliminating the hypothesis that all sizes belong in Class A and also the hypothesis of no sizes in Class A, there are six
30
PATRICK SUPPES
size hypotheses remaining, which we may write in terms of sizes assigned to Class A, as Ha, Hb, Hab, Hc, Hac and Hbc" We may assume a positive probability for each of the six. It is also a condition of the experiment that the six possible types of triangles are presented by the experimenter on an equally likely basis. It is easy to see that the hypotheses Ha, Hb and Hab will each have a probability of twothirds of being correct. The explicit situation is shown in the following table, where the entry l indicates a correct classification, and 0 an incorrect classification, under each of the six hypotheses for the six types of figures.
Type of figure
A B B A B B
as at h~s
ht ('s ct
Hypotheses
Ha
Hb
n;
n,
Hac
1 0
1 0 0 1
0 1
L 0
0
0 L 0 1
I 1
1
1
t
t
t
t
0 0 0
t
0 0 0
n.; 0
t
0 1 0 0
In the table the four cases for which each of the three hypotheses HQ' Hb and Hab is correct are indicated. The remaining three hypotheses, namely, H n H"c and H bc , will asymptotically each have a probability O. Within the framework of the six hypotheses the subject can do no better than indifferently select among HQ' Hab and Hb . It is clear from this analysis that from the Bayesian standpoint any subject who begins with his entire prior distribution weighted on the six hypotheses concerned with size alone will not be able to solve the problem completely. It may of course be objected that the assumption made about prior probabilities is not a reasonable one. The issue is complicated and I do not mean to suggest that I think definitive arguments can be given in support of the kind of assumption made. There is, however, a certain amount of evidence both from the behavior of subjects, and interrogation ofthem about their behavior, to show that in experiments in which an ultimately relevant concept has a very small degree of saliency, the subject begins initially by completely ignoring this concept or property. For such situations there would seem to be
CONCEPT FORMATION AND BAYESIAN DECISIONS
31
only a Pick wickean sense in which a strictly positive distribution over hypotheses involving this concept can be postulated. Let us now consider how we would approach the analysis of the subject mastering the problem in terms of some of the ideas of concept formation that have been developed in the last couple of years. The theoretical account I shall give will be somewhat more elaborate than the models much tested in recent experiments on concept formation (as for example, Bourne and Restle [1959], Bower and Trabasso [1964] and Suppes and Ginsberg [1962a], [1962b], [1963]). As the first stage of learning, let us assume that the subject, following the verbal cue given him by the experimenter, samples only the three size stimuli, which we have designated as a, band c. Initially he does not know how each of these stimuli should be connected or associated with Class A or Class B. We may thus postulate that they are in the unconditioned state. When a stimulus presentation is given which permits a sampling of one of the three stimuli, then in terms of the correction procedure given, that is, the statement as to whether or not the figure shown belongs to Class A or Class B, we may postulate a probability c that the size stimulus sampled will become conditioned to one of the two classes, that is, to one of the responses A or B. Notice that we are postulating initially that the subject samples with probability 1 whichever one of the size stimuli is available on a given trial. On this basis, the learning for stimulus c is particularly simple. We may just apply the oneelement model described above for pairedassociate learning. This stimulus starts in the unconditioned state and with probability c on each occasion on which it is sampled it enters the conditioned state, in this case, conditioning to Class B or Response B. When stimulus c is conditioned to Response B, on every occasion in which Response B is made on the presentation of this stimulus, the classification is proved to be correct, and therefore there are no grounds for the subject's changing or modifying this conditioning. In a complete sense, the conditioning of stimulus c exemplifies, as postulated here, the oneelement model for pairedassociate learning described above. (The use of 'c' to refer both to the stimulus and its probability of conditioning should not be a source of confusion, as the specific reference intended is always clear from the context.) The situation is considerably more complicated for stimuli a and b. Half the time that stimulus a or b is sampled the presented figure that has size property a or b can be classified in Class A, and the other half of the time, on a random basis, Class B. Intuitively when the subject finds that he cannot use stimulus a or b to make a correct classification, he will be led to sample other
32
PATRICK SUPPES
properties or aspects of the figure. Before he is led to make this additional sampling, there is often one strategy he will try. He may judge that his initial response connection for one of the stimuli was incorrect and he will reverse the association. For example, if on the first occasion that stimulus b is sampled, it turns out that the figure is classified as Class A, but on the second occasion that stimulus b is sampled, the figure is put in Class B by the experimenter, he may reverse the association and not yet be led to sample other stimuli. Let us designate the probability of such a reversal of the association by r, and let us postulate that he will sample a new property with probability s, when it turns out that the association that he has established is wrong. Extending the kind of assumptions that went into the development of the oneelement model for pairedassociate learning, we may postulate a fourstate Markov process describing this stage of the subject's learning. He begins in state U representing the fact that stimulus a, let us say, is unconditioned. We may pass from state U to either state A or B representing two possible responses to which stimulus a may be conditioned. After reaching state A or B, he will on each trial that a is sampled be incorrect with probability t. The matrix is then constructed so as to postulate that with probability }s he enters state N, the state in which he samples a new property, and with probability tr he reverses the stimulus association from response A or response B or vice versa as the case may be. It is of course a constraint of the model that r + s;;:; t. The complete matrix then is as follows:
N
B A U
N
B
A
U
t }s
0
0
0 0 0
ts 0
1 1S  tr tr 1c
tr 1 ts  1r tc
_.

lc
Note that state N is the absorbing state of this chain, because we are postulating that the subject will always be led, on the basis of his failure with the association established for the size stimuli a and b, to the sampling of a new property. It should also be remarked that this matrix represents the situation for stimulus b as well as for stimulus a. We are, as in the fashion of pairedassociate learning, postulating that the process of being led to state N when stimulus a is sampled is statistically independent of the process of being led to state N when stimulus b is sampled. No doubt in actual practice this assumption is probably slightly violated but it makes the quantitative treatment of the concept identification considerably more simple, and is therefore,
CONCEPT FORMATION AND BAYESIAN DECISIONS
33
a desirable feature of a first approximation. For fast learners we may postulate that s = 1 and c = 1. The matrix then assumes the following simple form: N B A U
N
B
A
1
0 2
0 0
1
Z .i.
11
0
1
0 Z
1
2
U
"
0 0 0 0
In a given experiment more detailed knowledge may be obtained by looking at the actual sequence of presentations of figures and observing their classification. If, for example, the subject was always wrong in classifying a figure with stimulus a in initial trials, he could be in state N on the third trial that a figure with size property a is presented. It may also be noted that we have postulated that in this conceptidentification task the subject is learning only on trials on which he makes an error. Bower and Trabasso [1964] present impressive evidence that for the kind of experiment described here this is roughly the situation. I shall have more to say on this point later. Upon entering state N the subject is now in a position to sample a new property. Note the difference from the Bayesian formulation. Up to this point the probability of sampling any property other than a size stimulus has been O. It is only due to the failure of the size stimuli to lead to the correct solution that the subject has been forced to change his initial distribution and sample other properties. Suppose for instance that the subject now samples stimuli connected with the orientation of the base of the triangle. We may suppose that the base of the triangle varies from the horizontal in three different angles, namely 0°, 15° and 30° (these numerical values are taken for purposes of illustration only). We also shall suppose that the occurrence of figures with these respective orientations is randomly assigned independent of other characteristics and therefore any particular orientation will occur in Class A figures approximately half of the time and in Class B figures the other half. Various things can be postulated at this point. We can assume that the subject disregards size stimuli entirely and samples only orientation stimuli, or we can also postulate that he samples a sizeorientation pattern combining stimuli exhibiting both properties. For many situations this latter pattern assumption has been shown to be a sound one. But whichever sampling procedure he adopts at this point, that is, concentration only on orientation or pattern sampling of orientation and size together, he will be led to the same results as before and will once again enter state N, and be required to select a new property for sampling.
34
PA TRICK SUPPES
Parenthetically it may be remarked that for those readers who fear that the process of concept identification as described here is too slow to describe what actually takes place, it may be said that it is not difficult to cite experiments in which a large number of trials is required by subjects to master what may appear to an experimenter or an observer with full knowledge of the situation as absurdly simple identification problems. When the number of trials to complete mastery of the problem is on the order of a hundred, many opportunities are presented for sampling different properties of the stimulus display presented. There are two important factors I have ignored in this analysis but which would in all likelihood enhance the rate of learning or the rate of concept identification. One is the factor of memory. When a new property is sampled, in many cases it is sampled and rejected simply on the basis of its ability to account for correct classification of items already seen and whose classification is remembered. On the other hand, it is not an unrealistic assumption to suppose that the transition matrix described above is essentially the sort of one that is used in testing from memory new sampled properties. The experimental difficulty of course is that itis not a simple or direct matter to elicit behavioral data giving evidence on this point. The second related phenomenon that I have ignored is the undoubted fact that when a given property is being sampled and used as a basis of classification it is often the case that simultaneously other properties are being sampled and silently rehearsed, meaning by this that their ability correctly to classify is being noticed even though they are not the properties used by the subject in making his classification on the given trial. Again it is not unreasonable to suppose that the process of rehearsal may be represented by a transition matrix very similar to the one given above. There is considerable indirect experimental evidence of the efficacy of rehearsal from the standpoint of learning. Several experimental studies have shown the positive effects of an increased amount of study time on the rate of learning of pairedassociates. As I understand the matter, no simple Bayesian approach to information processing and decision making would take explicit account of these two aspects of concept formation and learning, namely, the effects of memory and rehearsal. It should be emphasized that in another sense memory may be taken account of in Bayesian procedures. Modern empirical Bayes procedures have in many cases been developed on the assumption of a finite memory, but the kind of use of memory suggested here is of a different sort, namely, memory of what happened on preceding trials is used in a new way on trial n to check out the efficacy of a property not considered or sampled prior to
CONCEPT FORMATION AND BAYESIAN DECISIONS
35
trial 11. From the Bayesian standpoint the use of this property on trial n in terms of items from memory would require the assumption that the property or concept had a positive prior distribution on earlier trials. The relatively simple conceptidentification problem we have been considering is already beyond the resources of the standard systems of inductive logic, because the subject in the experiment is not told what are the relevant elementary properties. Although in principle inductive logics of the Carnapian variety have a method for handling questions of relevance, in practice they do not deal with the kind of thing that arises in any conceptidentification experiment when the subject is not told what is relevant  to pick a very simple example, the relevant aspects might turn out to be relations rather than properties. Also, methods of constructing such logics as matters now stand do not provide any guidelines for enumerating large sets of properties among which the relevant ones are likely to lie. Focusing on conceptidentification experiments makes it possible to draw an important distinction between Bayesian theory and the Carnapian sort of inductive logic. To make the standard inductive logic apply it is necessary to codify explicitly in the language of the logic all the evidence of past experience the subject considers pertinent to the experiment, but this I would claim is always a hopeless task. It is difficult enough to narrow the situation down to a manageable set of properties and relations, but it is humanly impossible to layout all the evidence that went into the selection of this set and the beliefs held about its members. To put it in simplest terms, it is at the least the problem of having a limited, finite memory. The Bayesian approach, on the other hand, is not bedeviled by this difficulty, because past experience can be encoded in the a priori distribution over the selected set of properties and relations. Once again, it is a question of a realistic conception of rationality. If we want to explicate the concept of rational human behavior, and not that of omniscient rational behavior, limitations on memory and computing power must be taken seriously. Taking such limitations seriously is of course imperative in attempting to apply an inductive logic. (The fact that these limitations are fundamental is why within the domain of deductive logic the theory of recursive functions is of quite restricted use in theorizing about or applying actual computers.) 3. Some common problems of Bayesian and stimulussampling models. The discussion of the last section to a certain extent overemphasizes the differences between Bayesian and stimulussampling models for decisions, particularly when the decisions involve concept identification. In the present section I
36
PATRICK SUPPES
want to emphasize some of the commonality between the two kinds of models and to point out some of the problems that beset them both. A particular point of this section is to show that many of the differences often emphasized in discussions of the cognitive or Bayesian approach as opposed to the stimulusresponse approach is a difference primarily in terminology, and not so much in something that is sharply defined and empirically observable. A convenient place to begin is with the classical case of a twochoice problem with noncontingent reinforcement. The problem for the subject on each trial is to predict which one of two lights will flash. Using familiar notation, let us call £1 the reinforcing event corresponding to the flashing of the left light and £2 the reinforcing event corresponding to the flashing of the right light. The response that consists of predicting that the left light will flash is designated A 1 and the response that consists of predicting that the right light will flash is designated A 2 • The noncontingency of the situation is defined by making the probability of an £1 reinforcement on each trial equal to tt. and the probability of an £2 reinforcement In. It is understood that the events £1 and £2 are mutually exclusive and exhaustive, that is, on each trial exactly one of the two lights flashes, and the probability of which will flash is fixed by the parameter n, Everything that we have to say in what immediately follows applies, mutatis mutandis, to other more complicated reinforcement schedules, but the basic principles are precisely the same. Let us begin by considering some Bayesian models for this situation. In the first place these Bayesian models shall be defined in terms of several sets of hypotheses, and we shall call an exhaustive set of hypotheses, that is a set of hypotheses that covers every contingency, a strategy, It is understood that in the ordinary Bayesian terminology what] am now calling strategies would very often be called hypotheses, but the present language is suggestive of gametheoretic language, as well as of the kind of language that has been used by various people interested in cognitive models of the learning process. For simplicity let us begin with four hypotheses: h, : h i>: h2 : h2 , :
an an an an
£1 £1 £2 £2
reinforcement reinforcement reinforcement reinforcement
is followed is followed is followed is followed
by by by by
an an an an
£1; £2; £2; £1
Given the above four hypotheses and the fact that hI and hI' (and h 2 and h2 ,) are incompatible, a strategy for the subject consists of believing, or acting as if he believed, one of the following four pairs of hypotheses: (hI' h 2 ) ,
CONCEPT FORMATION AND BAYESIAN DECISIONS
37
(hI' h 2 · ) , (hI" h 2 ) , (hI" h 2 · ) . Thus the strategy (hI' h 2 ) requires that an Al
response will be made if on the preceding trial an £1 reinforcement occurred, and an A 2 response will be made if an £2 reinforcement occurred on the preceding trial. As is apparent from what has already been said, the four strategies correspond to the four hypotheses relevant in the sense of a Bayesian model. Granted only a positive a priori probability for each of the four strategies, it is clear what is the asymptotic prediction of the Bayesian model, that is, what the asymptotic a posteriori probabilities of the strategies will be. Namely, P(h 1,h 2 ) = P(h t , h 2 . ) = P(h 1·,h 2 ) = P(h t.,h 2 ,) =
n(ln) n2 (1 n? (1 n)n.
The Bayesian decisionmaker with unlimited memory will then choose strategy (h t , h2 ,) with probability 1 for n>t. On the face of it this familiar Bayesian result leading to selection of event £1 with probability 1 seems very much in conflict with the standard theoretical results obtained in stimulussampling theory, which predicts that an At response will be made as a prediction of an £1 reinforcement with asymptotic probability tt, and we have the wellknown matching law, first formulated by W. K. Estes. From what has been said thus far it is easy enough to formulate the stimulussampling model with N stimulus elements in the noncontingent situation. On each trial the organism is sampling one stimulus. It becomes conditioned to the response that is reinforced with probability c, and with probability 1 c its conditioning does not change. Among the N stimuli exactly one is sampled on each trial. This sampling takes place on a random basis, that is, there is a probability 1/N of any particular stimulus' being sampled, independent of what else may have occurred on past trials. When a stimulus is sampled the response is made to which that stimulus is conditioned. In terms of these theoretical assumptions, the behavior of the subject may be defined in terms of the parameters c, Nand n. This description of stimulussampling theory seems quite different from the Bayesian approach. I now want to show how closely related they actually are, and how easily a formal isomorphism between models of the two theories may be set up. To bring the two together let us first examine the Bayesian model under a highly restricted memory assumption. In particular,
38
PATRICK SUPPES
let us suppose that the subject, although he is a Bayesian, is only able to remember what happened the last time a test of any particular hypothesis in his strategy was made. Whenever the outcome of this test is negative, he immediately changes his strategy by replacing the incorrect hypothesis by the correct one. For example, if his strategy is (hi' hl ) and he finds on trial n that the £1 reinforcement occurring on trial n 1 is followed by £1' he then immediately changes his strategy to (hi" hl ) . In other words, he is making no use of any evidence concerning trial pairs for which an £1 reinforcement is followed by an £1 before trial n. His memory is of minimal length to use any of the evidence relevant to the hypotheses at all. The transition matrix for this Bayesian model with memory of length one is then the following: (hi' hl') (hl,h l,)
(hi' h l) (hl"h z,) (hl"h z)
ti
n(1  n) n
Z
0
(hi' hl)
(l;?
1  2n(1  n) 0 n
Z

(hl"h l,) n(1  n)
(hi" h l ) 0
'', .... " ..._,,_.
1  (n
l
0
+ (1 
n(1  n)
n?)
n(1  n) (1  n)Z 1n
It is a simple matter to show that the asymptotic probability of an Al response with this transition matrix and the states as defined above is that of the familiar matching law, namely, tt: Let us now look at a formulation of the analog of the Bayesian model in terms of stimulussampling theory. As has already been indicated, its theoretical assumptions are formulated along the following lines. There is available a set of N stimuli which the subject has conditioned or associated with various possible responses. At the beginning of a trial the organism is in a certain state of conditioning. A set, possibly a proper subset of the N stimuli, is presented to him and he samples on a random basis, that is, with a uniform distribution. Exactly one of the stimuli is sampled and then the organism responds in terms of the association bond that stimulus has with one of the possible responses (in case the stimulus sampled is not conditioned to any response a guessing response is made of the kind already described in discussing pairedassociate learning). After the response is made, the reinforcement is given and the stimulus sampled changes its conditioning with probability c to the response reinforced, in case the response made was incorrect. To apply these postulates and develop a model corresponding to the Bayesian model we shall assume there are exactly two stimuli. One of them is the £1 reinforcing event occurring on the preceding trial and the other is the
39
CONCEPT FORMATION AND BAYESIAN DECISIONS
£2 event occurring on the preceding trial. Thus, on each trial the subject has available exactly one of the two stimuli to sample, and the whole sampling process is thereby trivialized. Quite apart from this identification of the two stimuli, on the assumption of two stimuli (i.e., N = 2), there are exactly four states of conditioning corresponding to the four possible subsets of stimuli conditioned to the A 1 response. The complement of each subset is the set of elements conditioned to the A 2 response (in the present analysis we shall assume on every trial each stimulus is conditioned to exactly one response and that therefore there are no unconditioned stimuli on any trials). Representing the states of conditioning by the subset of elements conditioned to A 1 we then have the following notation for the four states: {S1' sz}, {S1}' {sz}, 0, where 0 designates the empty set. To show how the assumptions of stimulussampling theory are used to derive a transition matrix in terms of these four states, we may draw the tree of possibilities with the probabilities for each branch shown when we begin in a typical state, let us say {sz}.
c
rc : £,
1c
A2
~S2\
£,=5,
~52t
\ 152:£2=5
152 t
rt :
1Tt
2
rt
:£,
A, 1Tt : £2
0
c
1c
40
PATRICK SUPPES
By drawing trees for the other three states of conditioning to show what possibilities may arise at the end of the trial for each of the three when we start from that state, we may upon completing the trees, collect terms and obtain the following transition matrix:
{SI,S2} {sd {S2}
0
{SI,S2} le(ln) en(1  n) , en
"._
0
{sd e (1;? 12en(ln) 0
en 2
{S2} en(l  n)
   _ .. ,     
0 0
en(1  n) 1e(n 2+(1n)2) e(l  n)2 1 en en(1  n) 0
Casual inspection shows that this transition matrix is not the same as the one derived for the Bayesian models. On the other hand, if we let c= 1, we obtain the following special case of the stimulussampling model:
{S2}__ .._    0 {SI,S2} {SI}    n    (1 ny n(ln) 0 {SI,S2} n(l  n) 1  2n(1  n) o n(1n) {SI} 1  (n 2 + (1  n)2) (1 _ n)2 n2 0 {S2} o o n(1n) 1n      , ...
And it is immediately apparent that the entries in this matrix are the same as those for the Bayesian model. The identity between the two matrices also suggests, what should already have been apparent, that is, the formal isomorphism between the states of the Bayesian model and the states of the stimulussampling model. The following correspondence
{h1,h z·} + {SI'SZ} {h 1, h 2 } + {sd {h 1·, h 2·} + {S2} {h1·,hz}+O may be used to establish the isomorphism with the restriction that c= 1. Without this restriction the stimulussampling model is a slight generalization of the Bayesian one. H may be remarked that the stimulussampling model, with c estimated from experimental data, does not appear to fit data very well (for some results in this connection, see Suppes and Atkinson [1960], Ch. 10). Of course, many Bayesians would almost be pleased that the stimulussampling model did not fit well for they could say "I would hardly expect the simple Bayesian model you have defined to provide any sort of decent fit to human
CONCEPT FORMATION AND BAYESIAN DECISIONS
41
prediction data". The reply to this is straightforward. This same kind of Bayesian model may easily be extended to memories of finite length greater than 1, but immediately a common problem of the most essential sort for either the Bayesian or stimulussampling models arises. The problem is that we very quickly find ourselves in a combinatorial jungle out of which it is not easy to find a path. Consider, for example, the Bayesian model with finite memory of length 4 and again let us concentrate only on the pattern of reinforcement ignoring, although it is unrealistic, the pattern of preceding responses. For a finite memory of length 4 there will be 16 patterns of preceding reinforcements and thus a strategy will consist of a 16tuple telling the Bayesian what to do when each of the 16 patterns is realized in the preceding 4 4 trials. This means there are (2)2 or (2)16 strategies to consider and thus this many states in the associated Markov process. The stimulussampling model with the additional parameter c has the same sort of difficulty. It is tedious but not impossible to obtain some results for models with this number of states. As the number of states increases, it rapidly becomes more difficult. The generalized conditioning models applied to a variety of data in Suppes and Atkinson [1960], Suppes and SchlagRey [1962a] and Suppes and SchlagRey [1962b] may prove useful in examining in more detail the relationships between Bayesian and stimulussampling ideas. The essential idea of these models is to generalize the probability c of conditioning to let the probability of conditioning depend upon preceding responses and reinforcements. For the kind of application discussed particularly in Suppes and SchlagRey [1962a], one gets a formulation of conditioning models that is very similar to a kind of probabilistic Bayesian model with finite memory. It would seem to be primarily a choice of language and not of concepts as to how one prefers to describe these models. We described them as conditioning models but it is a simple matter to translate this description into Bayesian language. Some additional remarks about these models are made in the next section. 4. The structural problems besetting the theory of concept formation. In Section 2, I tried to make the point that any simple Bayesian approach to decisions, actions or choices, encounters considerable difficulty in explaining or predicting behavior of human subjects even in simple conceptidentification experiments. I also tried to describe there some approaches that seemed promising from the standpoint of mathematical learning theory and, in particular, the version which originates with Estes and is ordinarily called stimulussampling theory. In order not to draw the distinction between
42
PATRICK SUPPES
Bayesian models and stimulussampling models in too absolute a fashion, in the third section 1 tried to work out some of the formal similarities between the two approaches and in this case I chose for discussion a familiar paradigm in experimental psychology of recent years, namely, the twochoice situation with a noncontingent probabilistic schedule of reinforcement. In discussing various Bayesian models and the formally similar stimulussampling models that may be used to analyze the noncontingent case, 1 tried to sketch some of the combinatorial problems that quickly arise when more complicated and subtle models are considered. To mention these combinatorial models first in connection with the noncontingent case is almost a mistake, for fairly simple stimulussampling models of a rather different sort than the kind considered in the preceding section give quite a good account of much data from noncontingent experiments. I have in mind the kind of pattern stimulussampling models first discussed in Estes [1959], and also discussed in Suppes and Atkinson [1960] Ch. 10, and Atkinson and Estes [1962]. By considering the standard pattern model of stimulussampling theory, it is possible to bypass some of the combinatorial problems [ mentioned that arise for Bayesian models and the particular stimulussampling models corresponding to these Bayesian models. The point of this section is to examine empirical situations, or simplified experimental situations roughly corresponding to the empirical situations in which it does not seem possible to avoid these combinatorial problems. It is a fundamental thesis of this paper that it is in dealing with situations in which new concepts must be formed that the standard formulations of the Bayesian approach are most inadequate. To give the discussion some definiteness and concreteness, I shall primarily restrict myself to description of a class of experiments in which the problem facing the subject is to learn the grammar of a set of strings. It is to be emphasized, however, that in dealing with this grammatical example, I think of the problem of concept formation as being of a quite general nature. The difficulties besetting this example, particularly those of a combinatorial nature, apply equally well to any attempt to understand how humans learn to play well a complicated game like chess or make decisions rapidly when confronted with an incredibly wide choice of alternatives. To fix our ideas quite specifically, let us consider initially the thirtytwo strings of length five made up of l's and O's. From a formal standpoint we may define a grammar for this set of strings as a subset of the set of thirtytwo strings. The number of such grammars is then 2 3 2_2, where we exclude the universal and the empty grammar. Let us suppose that the subject is shown
CONCEPT FORMATION AND BAYESIAN DECISIONS
43
the cards one at a time and is asked to classify them as codes or noncodes, where we think of a code as being a grammatical string and a noncode as being a nongrammatical string. The theoretical problem is now to describe how the subject proceeds to find the correct grammar. A simple Bayesian approach would be to attempt to describe a subject's a priori distribution on the 2 3 2  2 possible grammars, and then to change this distribution as information is given to the subject concerning the classification of strings. It is just possible that for strings of length five something can be made of this Bayesian approach. For strings of length seven or eight or for anything approaching the complexity of chess, we must turn to the imposition of a considerable structure on the set of all possible grammars. It is, I would take it, the central problem of a theory of concept formation to provide such a structure and to state the laws by which organisms use the structure to solve the problem confronting them. One way of approaching the problem of characterizing the structure of the space of all grammars is the following. The idea is to express any possible concepts for solving the problem of classification as a point in the space of properties associated with the stimulus material of the problems. A new concept is formed by moving to a new point in the property space. In these terms the theory of concept formation relevant to solving a given set of problems consists of two parts: first, characterizing the appropriate space of properties, and, secondly, characterizing the laws of motion in the space. In terms of the kind of formulation of stimulussampling theory considered in earlier sections, it may be thought that the phrase, "laws of motion", is too grandiose, and that what is described are simply the assumptions for sampling properties or stimuli. My reply to this possible objection has already been stated. The usual formulations of sampling assumptions neither assume nor impose any substantial structure on the set of stimuli (or concepts). The point of the present formulation is to impose such a structure. The space of properties is conceived as a multidimensional space with each dimension corresponding to a property. (Admittedly in many applications the space will consist of a finite set of points and thus will not satisfy the usual mathematical definition of a multidimensional space, but that is not a matter of serious concern here. I shall use the word "dimension" the way it is used in the psychological literature of concept formation and not in a mathematical sense.) It is only after a space is postulated (i.e., a set with a structure), that it is possible to talk about motion in the space. The concept of motion in an arbitrary set with no postulated structure is not well defined. On the other hand, it is precisely the imposition of structure that seems to be necessary to
44
PATRICK SUPPES
bring some order and constraints to the discouragingly large number of possible concepts that may be considered in solving even a relatively simple problem. Once such a structure is imposed, laws of motion for the space, particularly when formulated as laws governing random walks, can be formulated. To illustrate some of the possibilities for constructing the basic properties we may look at the problem when the set of strings is only of length two and, as before, at each position in the string there occurs either a 1 or a O. According to the computations already indicated above it is immediately apparent that there are then 14 possible grammars for the set of four possible strings, excluding as before the universal grammar and the empty grammar. A simple ideographic space for this problem is the fourdimensional one, with one dimension for each card. The value on a given dimension is 1 if that string belongs to the hypothesis, and 0 otherwise. It is then trivial to represent any hypothesis as a point in this fourdimensional space. Such an ideographic space is not too unwieldly when the number of possible strings is small, but as has already been remarked, when this is the case the whole apparatus of a property space and the imposition of structure on this space is scarcely necessary. We may just as well use a straightforward stimulussampling or Bayesian model. A more natural space of properties, which would generalize to longer strings, is the following. Dimension 1 characterizes the first position. The value 1 on the first dimension indicates that symbol 1 must occur in the first position of a string, and the value 2 that symbol 0 must occur in this position. The value 0 on this dimension indicates that either a 1 or a 0 may occur in the first position. The second dimension is defined similarly in terms of occurrence of symbols in the second position of a string. The third dimension is defined in terms of agreement or difference between occurrence of symbols in the two positions of the string. The value 1 in the third dimension is taken to indicate that the symbols occurring in the first and second positions of the string must be the same, the value 2 to indicate that the first and second positions of the string must be occupied by different symbols, and as before, the value 0 that the first and second positions may be occupied by the same or different symbols. We have selected the 0 value for all dimensions to indicate that this dimension is not intuitively relevant to the concepts, hypothesis, or grammar in question. The first thing of course to be noticed about this threedimensional space is that there are a number of points that can be occupied by no concept that is nontrivial. Thus the concept to be represented by the coordinates (1,2, 1) is the empty grammar, for it is not possible for a string to
CONCEPT FORMATION AND BAYESIAN DECISIONS
45
have a 1 in the first position, a 0 in the second position, and yet to have the first and second positions occupied by the same symbol. Property spaces for other grammars or concepts connected with strings of this character indicate that this phenomenon is not easily eliminated. There does not seem to be a natural and simple way of defining orthogonal dimensions, but this does not seem to be an immediately crucial problem. Still another way of looking at the space of properties is in terms of properties of a given string rather than of the grammar of the set of strings. In this case the grammar is represented by a certain subset in the space of properties rather than as a point. Corresponding to the space just constructed a space of this sort is easily described for the strings oflength two, but I shall not go into details, because the present stage of our analysis of these problems, reinforced by some preliminary experimental evidence, indicates that this latter method is not the most desirable theoretical approach. The space [ did describe above for the strings of length two is deceptively simple. The extension of this same kind of description to strings of lengths greater than two soon becomes rather awkward if sufficient dimensions are required to locate with precision any grammar (or concept) in the set of all grammars. From experiments now being undertaken with Madeleine SchlagRey and some related experiments being conducted with elementaryschool children in conjunction with Irene Rosenthal, it appears that for purposes of initial simplification of analysis we may in the case of strings of length three reduce the dimensions of the property space to a fairly small number, and lump the remaining unusual and notlikelytobethoughtof properties together. To give some rough indications of what we are finding, let me describe briefly the situation. In dealing with strings of length three, with the strings being built up from two symbols, there are 2 8  2 = 254 possible nontrivial grammars. We have found, however, that about 8085 percent of the grammars conjectured by subjects may be classified under six main property headings and so we have restricted the analysis to these six properties together with a catchall seventh category in which we place the remainder. The main point of our investigation at the present time is to find out to what extent the behavior of subjects in selecting and rejecting grammars (or more generally concepts or hypotheses) may be accounted for in terms of the application of stimulussampling models to the seven properties. To use the physical language mentioned earlier we are attempting to characterize the motions of the subjects' changes in concepts or hypotheses in terms of random walks with respect to the most salient properties, as for example, the occurrence of a 0 or 1 in one of the three positions, or the occurrence of a
46
PATRICK SUPPES
matching pair in the first and second, the second and third, or first and third positions. It is too early yet to decide whether or not this particular approach to concept formation will prove to be a fruitful one. Before concluding, I do want to indicate how the generalized conditioning models studied in Suppes and SchlagRey [1962a], which were mentioned earlier, have a bearing on finding the properties which are most salient for subjects in structuring their approach to the solution of a problem. The experiment analyzed was one with a probabilistic reinforcement schedule in which the reinforcement in a twochoice situation, on a given trial, depended or was contingent upon the subject's own preceding two responses. We were particularly concerned to analyze the experimental data to find the nature of the patterns to which subjects seem most likely to condition their responses. Basic data examined in the experiment were the conditional probabilities of an Ai response given the reinforcements and responses of the two preceding trials. Ten different models, each postulating that the conditioning of the responses depended on a different pattern, were considered. In Class T of the models the sequential dependence or conditioning was defined in terms of the two physical sides 1 and 2 of the key and light apparatus (of course, for some subjects side 1 was on the left side and for some subjects on the right). The point is that the conditioning parameters in Class I were defined in terms of the side. The five special cases considered in this Class were defined by restricting the dependency of Al,n+l to: (a) the response and reinforcement that occurred on trial n; (b) the two preceding reinforcements; (c) the two preceding responses ; (d) the two preceding reinforcements and the immediately preceding response; (e) the two preceding responses and the immediately preceding reinforcement. In Class II the conditioning parameters were defined, not in terms of the sides 1 and 2, but in terms of successful and unsuccessful responses, rewarding and punishing reinforcements, repetition or alternation responses, etc. In particular the five special cases were defined by: (a) the reinforcement on trial n was punishing or rewarding; (b) the reinforcements on trials nl and n were punishing or rewarding; (c) the reinforcement on trial n was punishing or rewarding, and the response of trial n indicated anticipation of a repeating or alternating reinforcing event; (d) the reinforcement on trial n was punishing or rewarding, and the response on trial n was a repetition or alternation of the response on trial n 1 ; (e) the reinforcements on trial nl and n were punishing or rewarding, and the response on trial n was a repetition or alternation of the response on trial n  1.
CONCEPT FORMATION AND BAYESIAN DECISIONS
47
From the standpoint of the present paper these ten models provided an opportunity for gathering information on the kind of structure subjects tend to impose in such a probabilistic situation. The observed transition probabilities and the goodnessoffit tests for the ten models of Class I and Class II are given as Table 2 of Suppes and SchlagRey [1962a), and will not be reproduced here. The most important observation about the results of the goodnessoffit tests, however, is that with the same net number of degrees of freedom the fits of Class II models were uniformly better than Class 1. In addition, the assumption that the conditioning can be explained in terms of the last reinforcement's being punishing or rewarding yields a better fit than did any of the Class I assumptions with four parameters. The uniformly better results of the Class II models in comparison with the Class I models supports the hypothesis that subjects are in many cases more likely to sample patterns of stimuli defined in terms of complex relational properties than in terms of relatively concrete single events. Detailed information about the relative saliency of such relationdefined patterns is one of the most important things needed to move ahead with an empirically adequate theory of concept formation. The ideas about concept formation set forth in this paper are meant to be suggestive rather than definitive. I do hope, however, that the various kinds of examples considered present adequate evidence for maintaining that any theory of complex problem solving cannot go far simply on the basis of Bayesian decision notions of information processing. The core ofthe problem is that of developing an adequate psychological theory to describe, analyze and predict the structure imposed by organisms on the bewildering complexities of possible alternatives facing them. The simple concept of an a priori distribution over these alternatives is by no means sufficient and does little toward offering a solution of any complex problem. Moreover, understanding the structures actually used is important not only for an adequate descriptive theory of behavior but also for any normative theory intended to be applicable to human beings with finite powers of memory and computation. As the standard literature of inductive logic comes to grips with more realistic problems the overwhelming combinatorial possibilities that arise in any complex problem will make the need for higherorder structural assumptions selfevident. References ATKINSON,
R. C. and W. K.
ESTES,
1963, Stimulus sampling theory, in: Handbook of
48
PATRICK SUPPES
Mathematical Psychology, vol. 2, eds. R. R. Bush, R. D. Luce and E. Galanter (John Wiley and Sons, Inc., New York) pp. 121268 BOWER, G. H., 1961, Application of a model to pairedassociate learning, Psychornetrika, vol. 26, pp. 255280 BOWER, G. H. and T. TRABASSO, 1964, Concept identification, in: Studies in Mathematical Psychology, ed. R. C. Atkinson (Stanford University Press, Stanford, California) BOURNE, L. E. and R. RESTLE, 1959, Mathematical theory of concept identification, Psychological Review, vol. 66, pp, 278296 ESTES, W. K., 1959, Component and pattern models with Markovian interpretations, in: Studies in Mathematical Learning Theory, eds, R. R. Bush and W. K. Estes (Stanford University Press, Stanford, California) pp. 952 SAVAGE, L. J., 1954, Foundations of statistics (John Wiley and Sons, Inc., New York) SUPPES, P. and R. C. ATKINSON, 1960, Markov models for multiperson interactions (Stanford University Press, Stanford, California) SUPPES, P. and R. GINSBERG, I 962a, Application ofa stimulus sampling model to children's concept formation with and without an overt correction response, Journal of Experimental Psychology, vol. 63, pp. 330336 SUPPES, P. and R. GINSBERG, 1962b, Experimental studies ofmathematical concept formation in young children, Science Education, vol. 46, pp. 230240 SUPPES, P. and R. GINSBERG, 1963, A fundamental property ofaliornone models, binomial distribution of responses prior to conditioning, with application to concept formation in children, Psychological Review, vol. 70, pp. 139161 SUPPES, P. and M. SCHLAGREY, 1962a, Test of some learning models for double contingent reinforcements, Psychological Reports, vol. 10, pp. 259268 SUPPES, P. and M. SCHLAGREY, 1962b, Analysis ofsocial conformity in terms ofgeneralized conditioning models. in: Mathematical Methods in Small Group Processes, eds. J. Criswell, H. Solomon and P. Suppes (Stanford University Press, Stanford, California) pp. 334361
PROBABILISTIC INFERENCE AND THE CONCEPT OF TOTAL EVIDENCE* PATRICK SUPPES Stanford University, Stanford, California
1. Introduction. My purpose is to examine a cluster of issues centering around the socalled statistical syllogism and the concept of total evidence. The kind of paradox that is alleged to arise from uninhibited use of the statistical syllogism is of the following sort. The probability that Jones will live at least fifteen years given that he is now between fifty and sixty years of age is r. Jones is (1) now between fifty and sixty years of age. Therefore, the probability that Jones will live at least fifteen years is r. On the other hand, we also have: The probability that Jones will live at least fifteen years given that he is now between fiftyfive and sixtyfive years of age is s. (2) Jones is now between fiftyfive and sixtyfive years of age. Therefore, the probability that Jones will live at least fifteen years is s. The paradox arises from the additional reasonable assertion that ri=s, or more particularly that r>s. The standard resolution of this paradox by Carnap [1950J p. 211, Barker [1957] pp. 7677, Hempel [1965] p. 399 and others is to appeal to the concept of total evidence. The inferences in question are illegitimate because the total available evidence has not been used in making the inferences. Taking the premises of the two inferences together, we know more about Jones than either inference alleges, namely, that he is between fiftyfive and sixty years of age. (Parenthetically I note that if Jones happens to be a personal acquaintance what else we know about him may be beyond imagining, and if we were asked to estimate the probability of his living at least fifteen years we might find it impossible to layout the total evidence that we should according to Carnap et al. use in making our estimation.)
* The writing of this paper has been supported by a grant from the Carnegie Corporation of New York.
50
PATRICK SUPPES
There are at least two good reasons for being suspicious of the appeal to the concept of total evidence. In the first place, we seem in ordinary practice continually to make practical estimates of probabilities, as in forecasting the weather, without explicitly listing the evidence on which the forecast is based. At a deeper often unconscious level the estimations of probabilities involved in most psychomotor tasks  from walking up a flight of stairs to catching a ball  do not seem to satisfy Carnap's injunction that any application of inductive logic must be based on the total evidence available. Or, at the other end of the scale, many actually used procedures for estimating parameters in stochastic processes do not use the total experimental evidence available, just because it is too unwieldy a task (see, e.g., the discussion ofpseudomaximumlikelihood estimates in Suppes and Atkinson [1960] Ch 2). It might be argued that these differing sorts of practical examples have as a common feature just their deviation from the ideal of total evidence, but their robustness of range, if nothing else, suggests there is something wrong with the idealized applications of inductive logic with an explicit listing of the total evidence as envisioned by Carnap. Secondly, the requirement of total evidence is totally missing in deductive logic. If it is taken seriously, it means that a wholly new principle of a very general sort must be introduced as we pass from deductive to inductive logic. In view of the lack of a sharp distinction between deductive and inductive reasoning in ordinary talk, the introduction of such a wholly new principle should be greeted with considerable suspicion. I begin my critique of the role of the concept of total evidence with a discussion of probabilistic inference. 2. Probabilistic inference. As a point of departure, consider the following inference form:
peA I B) = r PCB) = p peA) ~ rp.
_.
(3)
In my own judgment (3) expresses the most natural and general rule of detachment in probabilistic inference. (As we shall see shortly, it is often useful to generalize (3) slightly and to express the premises also as inequalities, P(AIB)~r
PCB)_~ p peA) ~ rp.
_
...
(3a)
PROBABILISTIC INFERENCE
51
The application of (3a) considered below is to take r = p = 1 s.) It is easy to show two things about (3); first, that this rule of probabilistic inference is derivable from elementary probability theory (and Carnap's theory of confirmation as well, because a confirmation function c(h, e) satisfies all the elementary properties of conditional probability), and secondly, no contradiction can be derived from two instances of (3) for distinct given events B and C, but they may, as in the case of deductive inference, be combined to yield a complex inference. The derivation of (3) is simple. By the theorem on total probability, or by an elementary direct argument
peA) = peA I B)P(B) + peA I li)P(B) ,
(4)
whence because probabilities are always nonnegative, we have at once from the premises that P(AIB) = rand PCB) = p, peA) ~ rp. Secondly, from the four premises
p(AIB) = r PCB) = P p(AIC) = s P(C)=a,
we conclude at once that P (A)~max(rp, sa), and no contradiction results. Moreover, by considering the special case of P (B)=P (C) = 1, we move close to (1) and (2) and may prove that r = s. First we obtain, again by an application of the theorem on total probability and observation of the fact that P (li) = 0 if P (B)= 1, the following inference form as a special case of (3)
p(AIB) = r PCB) = 1 peA) = r.
(5)
The proof that r=s when P (B)=P (C)= 1 is then obvious: (1)
(2)
(3)
(4) (5) (6) (7)
p(AIB) = r PCB) = 1 p(AIC) = s p(c) = 1 peA) = r peA) = s r=s
Premise Premise Premise Premise 1,2 3,4 5,6.
(6)
The proof that r = s seems to fly in the face of statistical syllogisms (1) and (2)
52
PATRICK SUPPES
as differing predictions about Jones. This matter I want to leave aside for the moment and look more carefully at the rule of detachment (3), as well as the more general case of probabilistic inference. For a given probability measure P the validity of (3) is unimpeachable. In view of the completely elementary  indeed, obvious  character of the argument establishing (3) as a rule of detachment, it is in many ways hard to understand why there has been so much controversy over whether a rule of detachment holds in inductive logic. Undoubtedly the source of the controversy lies in the acceptance or rejection of the probability measure P. Without explicit relative frequency data, objectivists with respect to the theory of probability may deny the existence of P, and in similar fashion confirmation theorists may also if the language for describing evidence is not explicitly characterized. On the other hand, for Bayesians like myself, the existence of the measure P is beyond doubt. The measure P is a measure of partial belief, and it is a condition of coherence or rationality on my simultaneously held beliefs that P satisfy the axioms of probability theory (forceful arguments that coherence implies satisfaction of the axioms of probability are to be found in the literature, starting at least with de Finetti [1937]). It is not my aim here to make a general defense of the Bayesian viewpoint, but rather to show how it leads to a sensible and natural approach to the concept of total evidence. On the other hand, I emphasize that much of what I have to say can be accepted by those who are not fullfledged Bayesians. For example, what I have to say about probabilistic inference will be acceptable to anyone who is able to impose a common probability measure on the events or premises in question. F or the context of the present paper the most important thing to emphasize about the rule of detachment (3) is that its application in an argument requires no query as to whether or not the total evidence has been considered. In this respect it has exactly the same status as the rule of detachment in deductive logic. On the other hand it is natural from a logical standpoint to push for a still closer analogue to ordinary deductive logic by considering Boolean operations on events. It is possible to assign probabilities to at least three kinds of entities: sentences, propositions and events. To avoid going back and forth between the sentenceapproach of confirmation theory and the eventapproach of standard probability theory, I shall use eventlanguage but standard sentential connectives to form terms denoting complex events. For those who do not like the eventlanguage, the events may be thought of as propositions
PROBABILISTIC INFERENCE
53
or elements of an abstract Boolean algebra. In any case, I shan use the language of logical inference to talk about one event implying the other, and so forth. First of all, we define A + B, as A v B in terms of Boolean operations on the events A and B. And analogous to (3), we then have, as a second rule of detachment:
P(B+ A) ~ r P(B) ~ P __ _.

.
(7)
:.P(A)~r+pl.
The proof of (7) uses the general addition law rather than the theorem on total probability:
P(B+A) = P(B v A) = P(B) + P(A)  P(B &A)
whence, solving for
P(A),
~r,
P(A) ~ r  P(B) + P(B &A) ~r(1p) ~r+pl,
as desired. The general form of (7) does not seem very enlightening, and we may get a better feeling for it if we take the special but important case that we want to claim both premises are known with near certainty, in particular, with probability equal to or greater than 1E. We then have
P(B+A) ~ 1 P(B) ~ 1 :.P(A)
E E
(8)
~ 1 2E.
It is worth noting that the form of the rule of detachment in terms of conditional probabilities does not lead to as much degradation from certainty as does (8), for
p(AIB) ~ 1  E P(B) ~ 1  E .'. P(A) ~ (1  E)2 ,
(9)
and for E > 0, (1 E)2 > 1 2E. It is useful to have this wendefined difference between the two forms of detachment, for it is easy, on casual inspection, to think that ordinarylanguage conditionals can be translated equivalently in terms of conditional probability or in terms of the Boolean operation
54
PATRICK SUPPES
corresponding to material implication. Which is the better choice I shall not pursue here, for application of either rule of inference does not require an auxiliary appeal to a court of total evidence. Consideration of probabilistic rules of inference is not restricted to detachment. What is of interest is that classical sentential rules of inference naturally fall into two classes, those for which the probability of the conclusion is less than that of the individual premises, and those for which this degradation in degree of certainty does not occur. Tollendo ponens, tollendo tollens, the rule of adjunction (forming the conjunction), and the hypothetical syllogism all lead to a lower bound of 1 2e for the probability of the conclusion given that each of the two premises is assigned a probability of at least 1 e. The rules that use only one premise, e.g., the rule of addition (from A infer A v B), the rule of simplification, the commutative laws and De Morgan's laws assign a lower probability bound of 18 to the conclusion given that the premise has probability of at least Ie. We may generalize this last sort of example to the following theorem. THEOREM 1. IfP (A)~ le and A logically implies B then P (B}~le. PROOF: We observe at once that if A logically impliesB then AU B=X, the whole sample space, and therefore Ar;;B, but if Ar;;B, then P(A)~P(B), whence by hypothesis P (B)~ Ie. It is also clear that Theorem 1 can be immediately generalized to any finite set of premises. THEOREM 2. If each of the premises At, ... , An has probability of at least 1  e and these premises logically imply B then P (B) ~ 1 ne. Moreover, in general the lower bound of Ine cannot be improved on, i.e., equality holds in some cases whenever 1 ne ~ O. PROOF: By hypothesis for i= 1, ... , n, P (Ai)~ Ie. We prove by induction that under this hypothesis P (At & '" & An) ~ 1 118. The argument for n = 1 is immediate from the hypothesis. Suppose it holds for n. Then by an elementary computation P(A1& ... &An&A n+ 1 } = 1(1P(A 1& .. ·&A,,))(IP(A n+ I ) )
+ P((A 1 & ... &A n) & An+ t)
+
~ 1  (t  P(A t & ... &A n))  (I  P(A n+ 1 ) ) ~ 1  ne  e
~1(11+1)e,
as desired. (Details of how to handle quantifiers, which are not explicitly treated in the standard probability discussions of the algebra of events, may be found in Gaifman [1965] or in the article by Krauss and Scott in this
PROBABILISTIC INFERENCE
55
volume. The basic idea is to take as the obvious generalization of the finite case P((3x)Ax) = sup {P(Aa l v Aa z v ... v Aa,,)}, where the sup is taken over all finite sets of objects in the domain. Replacing sup by infwe obtain a corresponding expression for P((Vx)Ax). Apart from details it is evident that however quantifiers are handled, the assignment of probabilities must be such that Theorem 1 is satisfied, i.e., that if A logically implies B then the probability assigned to B must be at least as great as the probability assigned to A, and this is all that is required for the proof of Theorem 2.) The proof that the lower bound 1  ne cannot in general be improved upon reduces to constructing a case for which each of the n premises has probability 1 s, but the conjunction, as a logical consequence of the premises taken jointly has probability 1  ne, when 1 ne ~ O. The example I use is most naturally thought of as a temporal sequence of events A \, ... , An. Initially we assign P(A I )
=
1 e
P(A\) = e.
Then
1 2e
and more generally
P(AzIA\)=1 e P(A z I A\ ) = I ,
P(A"IA,,_\An_z···A\) =
1  ne 1  (n  1)e
P(AnIAnIA,,_z ... A I ) = 1
in other words for any combination of preceding events on trials 1 to nl the conditional probability of A" is 1, except for the case A,,_IA,,_Z ... AI. The proof by induction that P (An) = le and P (AnA,,t ... A 1)= lne is straightforward. The case for n = 1 is trivial. Suppose now the assertion holds for n. Then by inductive hypothesis P(A,,+ IAn ... AI) = P(A n+ I I An ... AI)P(A n ... AI) 1(n+l)e
=
1  ne
(1  ne)
=1(n+l)e,
56
PATRICK SUPPES
and by the theorem on total probability P(A n+ I) = P(A n+ [I An ... A[)P(A n ... At) + + [P(A n+ [I An A1)P(A n At) + ... + P(A n+ 1 I An A[)P(An At)J.
+
By construction all the conditional probabilities referred to in the bracketed ex pression are 1, and the unconditional probabilities in this expression by inductive hypothesis simply sum to ne, i.e.. l(lne), whence P(A n+ I ) =
1  (n 1
+ l)e
I
'~1nlo)+ne=1e,
 ne
which completes the proof. It is worth noting that in interesting special cases the lower bound of Ine can be very much improved. For example, if the premises At, ... , An are statistically independent, then the bound is at least (1  e The intuitive content of Theorem 2 reflects a commonsense suspicion of arguments that are complex and depend on many premises, even when the logic seems impeccable. Overly elaborate arguments about politics, personal motives or circumstantial evidence are dubious just because of the uncertainty of the premises taken jointly rather than individually. A natural question to ask about Theorem 2 is whether any nondeductive principles of inference that go beyond Theorem 2 arise from the imposition of the probability measure P on the algebra of events. Bayes' theorem provides an immediate example. To illustrate it with a simple artificial example, suppose we know that the composition of an urn of black (B) and white (W) balls may be exactly described by one of two hypotheses. According to hypothesis H" the proportion of white balls is r, and according to H., the proportion is s. Moreover, suppose we assign a priori probability p to H, and 1 p to H,. Our four premises may then be expressed so:
y.
P(WIHr ) = P(WIHs ) = P(Hr ) = P(H,) =
r s
P
1  p.
Given that we now draw with replacement, let us say, two white balls, we have as the likelihood of this event as a consequence of the first two premises
P(WWIHr ) = r 2 P(WWIH,) = S2,
PROBABILISTIC INFERENCE
57
and thus by Bayes' theorem, we may infer P(H I WW) r
r2p
= 2p+s2(1_p)' . r
(10)
and this is clearly not a logical inference from the four premises. Logical purists may object to the designation of Bayes' theorem as a principle of inference, but there is little doubt that ordinary talk about inferring is very close to Bayesian ideas, as when we talk about predicting the weather or Jones' health, and such talk also has widespread currency among statisticians and the many kinds of people who use statistical methods to draw probabilistic inferences. The present context is not an appropriate one in which to engage upon a fullscale analysis of the relation between logical and statistical inference. I have only been concerned here to establish two main points about inference. First, in terms of standard probability theory there is a natural form of probabilistic inference, and inference from probabilistically given premises involves no appeal to the concept of total evidence. Second, all forms of such probabilistic inference are not subsumed within the forms oflogical inference, and two examples have been given to substantiate this claim, one being the rule of detachment as formulated for conditional probability and the other being Bayes' theorem. 3. The statistical syllogism reexamined. There is, however, a difficulty about the example of applying Bayes' theorem that is very similar to the earlier difficulty with the statistical syllogism. I have not stated as explicit premises the evidence WW that two white balls were drawn, and the reason I have not provides the key for reanalyzing the statistical syllogism and removing all air of paradox from it. The evidence WW has not been included in the statement of the premises of the Bayesian example because the probability measure P referred to in the premises is the measure that holds before any taking of evidence (by drawing a ball) occurs. The measure P does provide a means of expressing the a posteriori probability after the evidence is taken as a conditional probability, but the hypothetical or conditional nature of this assertion has been too little appreciated. Using just the measure P there is no way to express that infact two white balls were drawn, rather than, say, a white ball and then a black ball. Using conditional probabilities we can express the a posteriori probabilities of the two hypotheses under any possible outcomes of one, two or more drawings. What we cannot express in these terms is the actual
58
PATRICK SUPPES
evidence, and it is a mistake to try. (It should be apparent that these same remarks apply to Carnapian confirmation functions.) Commission of this mistake vitiates what appears to be the most natural symbolic formulation of the statistical syllogism  the inference form (5) as a special case of (3). We can symbolize statistical syllogism (1) as follows, where e(x) is the life expectancy of person x and a(x) is the age of person x, and let} =Jones:
P(e(j) ~ 15j50 < a(j) < 60) 50 < a (j) < 60 .. P(e(j) ~ 15) = r ,
=
r (11)
Now let us schematize this inference in terms of hypothesis and evidence as these notions occur in Bayes' theorem P (hypothesis I evidence) = r evidence
(12)
., P (hypothesis) = r ,
and the incorrect character of this inference is clear. From the standpoint of Bayes' theorem it asserts that once we know the evidence, the a posteriori probability P (HIE) is equal to the a priori probability P (H), and this is. patently false. The difficulty is that the measure P cannot be used to assert that P (50 nnj' The language under consideration allows the representation of this statement as sd:
(1)
The length of sd according to this convention is equal to n  n j + 1, which is obtained by counting the atomic statements x=ai (i=nj + 1, ... , n) and the atomic statement , Px. It can be shown that there is no description which is shorter in this sense. The principle of description leading to the construction of this expression is something like the following: the statement asserts that, as a rule, the individuals have the predicate P; but that there are several exceptions, namely the individuals anI + I , . . . an> which have the predicate , P. So the length of the expression is determined by the number of exceptional individuals which disturb the uniformity of the state described. Before considering the implications of this measure together with the principle of simplicity, I briefly present two other explications of simplicity. One observes that it is essential for the length of state descriptions in Kiesow's sense that the language used contains the equivalence symbol Nn ) ( N ) ( Npo/1s+1 : Npo+l ,
and (10')
Nn ) Nplns+1 ,Npl'ns'I n . ( N' PI  /1S + 1 t: By division we obtain the simpler inequalities:
Npo  ns
Npo
~~~
N p~  /1S' + 1 N p~ +1
< 1>
Np~
 ns'
Np~
:,
N Po  ns + 1 N Po + 1
(10")
and Np ,  ns
s'
 /1S'
S
Np~
+1
Np'l  ns' s < 1 >Npl  ns + 1 s'
The inequalities (l0") are readily transformed into
and
N Pos' < N p~s Np~s < Npos'
+ s ~ N p~s + 1 + s' ~ Npos' + 1
Np,s' < Np~s < Npis'
+ s ~ Np~s + 1 ~ + S' ~ NPls' + 1. ~
Np~s
) ~
(11)
(12)
Add Npos to both of the inequalities in (11), and Np 1S to both of the inequalities in (12). We then obtain
and
We conclude that
Npo < Ns Ns < Npo
+1 +1
(11')
Npl < Ns Ns < Np,
+1 + 1.
(12')
1 IPoslc) become smaller. The parameter a thus reflects our expectations
105
SEMANTIC INFORMATION AND INDUCTIVE LOGIC
concerning the occurrence of different kinds of individuals: the greater IX, the more kinds are expected to be instantiated. The choice of this parameter determines how rapidly experience affects our a priori expectations. EXAMPLE 2. Take the situation described in Example 1, and let a = K = 4. Then the absolute values of hypothesis C w for the different values of ware
w
o 1 2 3
4
0.0000 0.0015 0.0235 0.1191 0.3765
9.411 5.427 3.070 1.403
1.0000 0.9985 0.9765 0.8809 0.6235
Let again w = c = 3; then the hypothesis C3 receives the following relative values: """
n
C(C3, e)
inLx(C3Ie)
cont,,(C3Ie)
inLx(e)
cont., (e) _.
3 4 5 6 7 8 9 10
0.428 0.500 0.571 0.640 0.703 0.760 0.808 0.849
1.220 1.000 0.807 0.643 0.507 0.397 0.306 0.237
0.00588241 0.00147060 0.00036765 0.00009191 0.00002298 0.00000574 0.00000144 0.00000036
6.226 8.420 10.088 11.930 13.661 15.351 17.032 18.688
__
...•..

0.98970579 0.99705879 0.99914215 0.99974469 0.99992256 0.99997610 0.99999251 0.99999762
We observe that an information receiver characterized by a = K cannot be said to be as ready to learn from experience as one with a = 0 (cf. Example 1). Obviously, most people's inductive behavior will correspond to values larger than a = K.
6. As was indicated in the first section of this paper, perhaps the most interesting question that comes up in the theory of semantic information concerns the role of the notion of information in the rational choice of hypotheses and theories. In recent literature the importance of this question has been emphasized especially vigorously by Sir Karl Popper [1935]. Popper has asserted that science does not aim at highly probable hypotheses and theories but rather "at a high informative content, well backed by experience". One can ask what light our concepts of information can throw on this claim. It is not entirely clear what exactly is implied by Popper's thesis. He has sometimes explained himself by saying that one ought to prefer the least
106
J. HINTIKKA AND J. PIETARINEN
probable hypothesis, and indicated that the notion of probability involved here is logical probability a priori. This suggests that he presupposes a concept of information not unlike our concept of absolute information. In Popper's view a statement is the more informative the more possibilities it excludes, i.e. the less probable it is. We agree with his view in the case of a priori probability which corresponds to absolute information, but not in the case of a posteriori probability and the corresponding notion of relative information, for it is perfectly possible that the degree of confirmation of a hypothesis should be inversely related to its relative information. There are hypotheses with a high a posteriori probability which add more information to our knowledge expressed by the evidence than hypotheses with a lower degree of confirmation 10. Here Popper's thesis of the inverse relation of probability and information seems to be inapplicable. There is another point, too. In so far as Popper wants to formulate his point in terms of a priori probability, what his notion of information amounts to is absolute information. Now there is no need for the absolute information and for the a posteriori probability to be related inversely in the same way as absolute information and a priori probability are inversely related. In fact, we can show that in certain important cases the quest of high absolute information leads to the same results as the quest of high a posteriori probability. This holds e.g. for strong generalizations, as can be shown as follows: From (15) and from the corresponding formula for the content measure conta(C w )
=
1
1 K
(
L ~J
j=O
)
_.
(jjwt
(17)
it is seen that a strong generalization is the more informative the fewer kinds of individuals it admits of. ln other words, the most informative (absolutely speaking) strong generalization compatible with evidence is the one which says that as few kinds of individuals as possible are instantiated. This is obviously the constituent C; which says that only those kinds of individuals are instantiated in the world at large as already have been instantiated in experience. When the evidence is extensive enough and (X C, there is at least one value of i such that the exponent is positive. When n grows without limit, there must be at least one n j which also goes to infinity. For these values of i andj (22) grows without a limit, which implies that (20)* decreases to zero. When w = c, the exponent of (22) is either negative or zero at all values of i, and it is negative at all values but i=O. For all the other values (22) approaches zero when n grows indefinitely, since there must then be at least one nj which also grows without a limit. This means that (20)* approaches one when n goes to infinity (while C remains constant). This shows that the degrees of confirmation of generalizations behave asymptotically in this case (a=O, A a numerical constant) exactly as we might hope them to behave. Since the only thing that is affected when a changes is the distribution of a priori probabilities to constituents and since this does not affect the asymptotic behavior of (13), the same holds for other (finite) values of a. 3 This generalizes what was found in Hintikka [1965b] for one particular system. It is to be noted, however, that although the asymptotic behavior of the degrees of confirmation of
3
A TWODIMENSIONAL CONTINUUM OF INDUCTIVE METHODS
125
A further observation can be made on the basis of (22) if we rewrite it into the form ]«l/(C+i nl /W)}.
C
[
Il n
j=l
(22)*
j
Since the n/s are subject to the condition l.,nj=n, it is seen from (22)* what for a constant n is the best sort ofevidence for that particular constituent with w=c for whichp(Cwle) approaches one when n~oo. The best evidence is the one which maximizes Iln j, and this happens when the numbers nj are all as close to each other and to n] c as possible. In others words, the best situation we can hope for in making inductive generalizations is one in which the n observed individuals are distributed evenly among the c Qpredicates which are exemplified in experience. Then we have as much evidence for the generalization which says that no other Qpredicates are exemplified in the world as we can hope to have on the basis of having observed no more than n individuals. All this is very natural, for if the observed individuals were unevenly distributed, this fact would lead us (intuitively speaking) to wonder whether there might be similar unobserved irregularities which will cause one of the previously unobserved Qpredicates to be instantiated. It means that a form of the requirement of the (quantitative) variety of instances is enforced in our systems. All this of course holds only asymptotically, i.e. for sufficiently large values of the numbers of observed individuals. Furthermore, our argument holds only if our present assumptions are satisfied (A a finite numerical constant greater than zero). 12. We asked earlier what happens when IX~ 00, i.e. when generalizations are judged on a priori grounds only. It may be asked likewise what happens when A~OO, i.e, when singular inductive predictions are judged solely on a priori grounds. In order to see this, we may consider (7)* which may be written as follows:
~(i +l)(~+D···(~+ y)
K
i~O
(K) Ki(iK +~1)(K +12) ... (K +A1)· i
i
i
(23)
IX 
generalizations is quite natural in all the different systems one obtains by giving ex different values, the preasymptotic behavior of the degrees of confirmation need not be equally natural.
126
J AAKKO HINTIKKA
When A+OO, this approaches (w/Kt
itO( ~)(~~)~.
(24)
In the same way we can see that (10) approaches (l/w)" when },+oo. Substituting these val ues into (13) and simplifying we obtain for the degree of confirrnation of C", the following expression (25) This is a direct generalization of the expression for the degree of confirmation of a constituent which was obtained in Hintikka [1965b]. It reduces to the old result when IX=O. It is also seen without much difficulty that the characteristic function one obtains here is in the case of an infinite universe the same as in the system presented in Hintikka [1965b]. As I pointed out in the earlier paper, in the case IX =0 we obtain an expression for the degree of confirmation of a constituent which corresponds to wildly overoptimistic inductive behavior, i.e. yields too large values when n is small. Now we can see that this is only natural, for putting IX=O means that we let outselves be guided very strongly by what we find in experience. Hence the results we obtain in this case are not disconcerting at all but rather to be expected. They can be easily corrected now that we have the parameter IX at our disposal. The larger this parameter is, the more suspicious we are of prospective inductive generalizations. The role ofz in (25) is seen more vividly by considering an example. Let us recall the situation sketched in section 1 above. There we had but two primitive predicates, R(x) and B(x) (you may read them "raven" and "black"). If we assume that all the Qpredicates that one can form here have been instantiated in our experience except for the Qpredicate R(x) & ~ B(x), we have k = 2, K = 4, C = 3. The only question which is then left open by our evidence as far generalizations involving Rand B are concerned is whether the missing Qpredicate also fails to be instantiated outside our experience or not, i.e. whether all ravens are black or not. Corresponding to the positive and negative answers to this question we have two constituents C 3 and C4 which are the last two constituents written out in section 1. From (25) it is seen that the degree of confirmation of C 3 is then (26)
A TWODIMENSIONAL CONTINUUM OF INDUCTIVE METHODS
127
From (26) it is seen that ex is in this case the number of individuals we must observe before we are willing to make an even bet that all the remaining ravens will also be black. Observations of this kind may make it possible to relate our parameter ex to the actual inductive behavior of different people in different circumstances 4. It was also pointed out in Hintikka [1965b] that one does not obtain very natural results in this case for singular inductive predictions. In fact, it is easily seen that the hypothesis which says that an unobserved individual instantiates a certain Qpredicate which is already instantiated in experience has the asymptotic value lie; see Hintikka [1965b] p. 287. This is not surprising any more in view of the fact that the present special case was obtained by putting A= 00, i.e. by deciding to judge singular hypotheses on as completely a priori grounds as possible. In fact, the value lie is just the natural consequence of this decision. We cannot obtain the value 11K, for this would be incompatible with the degrees of confirmation which our constituents receive in this case. Hence lie is as close to a priori predictions as we can get, and hence just what we might expect as a consequence of putting A= O. 13. All the special cases of our twodimensional system so far considered are obtained by giving suitable numerical values to ex and to A. The value of the latter may depend on K, however. We obtain further systems by making). dependent on w, A=A(W). Here we consider only the simplest system obtained in this way, which is clearly the one obtainable by putting A= w. This generalizes in a sense the choice A= K which gives rise to Carnap's c*. (In determining the a priori probabilities of constituents, by the same token we then have to put A=A(K)=K.) 4 An illustration of this kind is closely related to what Good calls "the device of imaginary results", see Good [1950] pp. 35, 70, and Good [1965] pp. 19,29. Some of Good's uses of this device (e.g. the lastmentioned one) seem to be based on an implicit atomicity assumption: the possibility that our evidence might serve to establish a strict generalization is disregarded (as may be entirely appropriate for the purposes Good has in mind). One is apt to view imaginary results of the kind Good envisages in a different light if this possibility is taken seriously. Kemeny's criticism of Carnap is an instructive case in point: "If the evidence, e, states that we drew balls from an urn, always replacing them, and the first hundred balls were all small and white; and if our hypothesis, h, predicts that the next hundred balls will also be small and white; then c*(h, e) [i.e. the degree of confirmation of the hypothesis in Catnap's c*system] is approximately t. The reviewer would nevertheless refuse to bet 7: 1 against it, because he would strongly suspect that all the balls in the urn are small and white" (Kemeny reviewing Carnap [1950] in the Journal of Symbolic Logic vol. 16 (1951) p. 207; the italics are mine). The last clause brings out the reason why Kemeny's intuitions differ from Good's, cf. Good [1965] p. 29.
128
JAAKKO HINTIKKA
In this simple case we obtain from (11) after obvious simplifications
I e) = (a + W l)!jK£C(K  c) (a +c_~~_n!
p(C w
i~O
(n+wl)!
i
(n+c+il)"
(27)
This is a straightforward generalization of the system presented in Hintikka [1965c], to which it reduces when a=O. The fact that in this "combined system" one is again expected to jump to generalizations rather easily is predictable on the basis of the choice a = 0 on which it is based. This overoptimism is again eliminated by choosing the value of a sufficiently large. In the same way one obtains
p(h
I e) =
n1+1
KiC(K  c)(a + c + i  l)!(n + c)! i~O i (n+c+i)!
 ,
n + c Ki c i~O
(K c) (a+_c +_i i
 
l)!(n + c  I)! (n+c+il)!
(28)
where h is the hypothesis that an unobserved individual satisfies Cl i l (x). This is again exactly the same as was found in the combined system (in an infinite universe) except for the presence of the parameter a. Our results may be compared with Carnap's results in the same way as in Hintikka [1965c]. In Hintikka [1965c] it was also shown for a=O that this system is obtained (in the limiting case of an infinite universe) by a threestage procedure: First, equal a priori probabilities are given to the constituents; then the a priori probability of a constituent is divided evenly among the structuredescriptions that make it true; and finally the a priori probability of a structuredescription is divided evenly among all the statedescriptions that make it true. In fact, we can obtain the more general system in the same way except that the a priori probability of each constituent C w has to be chosen to be proportional to (a+wl)!/(wl)!, i.e. proportional to (7) with A=K. This illustrates the naturalness of the system, and also its close relation to Carnap's old favorite system based on c*. The same relation is illustrated by the fact that when aHX), our system approaches Carnap's c*system. In fact, by the same limit argument as we have used earlier, it can be shown that when aHI), p(hle) approaches (111 + 1)/(11 + K), which is exactly the characteristic function of a Carnapian system based on c*. Other choices of A give rise to other systems, most of which remain to be investigated. It might be especially interesting to see what would happen if A were made dependent on 111, 11z, ... , nco
129
A TWODIMENSIONAL CONTINUUM OF INDUCTIVE METHODS
14. By way of summary, we can exhibit the relation of some clearcut systems which are members of our twodimensional continuum in the form of the following table:
system
Catnap's Acontinuum a system based on Carnap's c+ a system based on Carnap's c* straight rule Hintikka's Jerusalem system [1965b) the same generalized a system based on (20)(21) Hintikka's "combined system" [l965c) the same generalized
parameter et
).
CD
A
CD
CD
CD
K 0
et
(does not matter) 0
CD
et
w
0 0 a
w w
).
The occurrence of et in the first column or A in the second column means that they are free parameters in the (kind of) system in question.
15. In Hintikka [1965b] it was conjectured that one can make a distinction between two essentially different tasks, namely, between singular inductive inferences and inductive generalizations. What has been found in the present paper bears out to a certain extent this conjecture. We have been able to arrange into a twodimensional continuum several interesting inductive procedures in such a way that one of the parameters describes the character of one's assumptions about singular inductive predictions and the other describes one's assumptions about inductive generalization. In a sense we have thus separated the two from each other. In Hintikka [1965b] it was likewise suggested that this distinction is relevant to some aspects of the CarnapPopper controversy in that Popper's claims seem to be best justified when applied to inductive generalizations while most of Carnap's fully developed work deals with singular inductive inference. This suggestion also derives some support from what we have found. Indeed, it was found that Carnap's whole A.continuum deals with the case IX = 00 in which inductive generalizations are dealt with on a priori grounds only, and is hence informative about singular inductive predictions only.
130
JAAKKO HINTIKKA
In contrast to this, some of our remarks on inductive generalizations can be related to certain ideas Popper [1959] has put forward. One way of seeing this is to ask what sense one is to make of Popper's remarks about the ease or difficulty that there might be about falsifying a hypothesis or theory. What is meant by these remarks is obviously different from what is probably their most literal sense. If a hypothesis is true, it is simply impossible to falsify it. When Popper says that we ought to prefer, ceteris paribus, theories that are easily falsifiable he therefore does not mean theories that are easily falsifiable in our actual world. What he means might perhaps be explained by saying that the preferable theories would be easy to falsify in a completely atomistic world. On the other hand Popper also makes remarks to the effect that the preferable theories have a low absolute logical probability. What we have just seen implies that this absolute logical probability has to be understood as a probability in an atomistic universe. This suggests a way of measuring the absolute logical probability, which is essentially our a priori probability, of different generalizations in a Popperian spirit. This task reduces at once to the task of assigning a priori probabilities to the different constituents. In a Popperian view, these probabilities should be indicative of the probability of the different generalizations in an atomistic universe. Now if this universe is infinite, these a priori probabilities are all zero, except in the case of a logically true generalization. Hence we have to use as a measure some finite atomistic universe  a universe, say, with IX members  if we want to have any a posteriori probabilities that can be applied in large universes. In this way we are led to something very much like our assignment of a priori probabilities to generalizations in terms of (7). I n fact, we are led exactly to (7), I believe, if we decide that the probabilities that underlie our completely atomistic universe are to be Carnapian. Thus that part of our approach which turns on the use of the new parameter IX is related very closely to Popper's ideas of the inverse relation of a priori probability on one hand and on the other hand the acceptability of theories and the ease in falsifying them. It may not be entirely clear what I have meant here and in certain earlier remarks by a completely atomistic universe. An explicit explanation can now be given after we have first formulated a theory by using this heuristic idea in an informal way. A completely atomistic universe may in any view be characterized as one in which general laws hold only by virtue of chance  in which the only regularities, statistical or otherwise, that obtain, pertain to the interrelations of individuals. From the point of view which we have reached this means that an atomistic universe is one in which the best inductive
A TWODIMENSIONAL CONTINUUM OF INDUCTIVE METHODS
131
policy is obtained by putting IX = 00. This definition illustrates the possibilities of conceptualization that the use of the new parameter IX offers us. In a way we can perhaps also partially justify certain Popperian ideas in those systems in which IX is not infinite. These systems favor, it has been found, that constituent which has the smallest number of Qpredicates and yet is compatible with the evidence. They favor it in the sense of giving it the highest degree of confirmation when the number n of observed individuals is large enough. It is easily seen that this constituent also has, when IX>], the lowest a priori probability among all those that are compatible with the evidence we have and which is therefore in the Popperian sense the easiest to falsify among all of them. It is true that Popper does not think of this preferability as taking the form of a high a posteriori probability; and I agree that it does so only in the case of constituents, not in the case of other generalizations. But if it should take this form, even in the case of constituents, itis not surprising that it should do so only when IX is finite. 1n fact, the smaller IX is, the faster we switch our allegiance (measured by the degree of confirmation) from initially more probable to initially less probable constituents. Thus a small IX is in a sense an indication of one aspect of that intellectual boldness Sir Karl has persuasively advocated.
References CARNAP, R., 1950, The logical foundations of probability (University of Chicago Press, Chicago; second edition, 1963) CARNAP, R., 1952, The continuum of inductive methods (University of Chicago Press, Chicago) CARNAP, R., The philosopher replies: V. Probability and induction, in: The Philosophy of Rudolf Carnap (The Library of Living Philosophers), ed. P. A. Schilpp (Open Court, La Salle, Illinois) pp. 966995 CARNAP, R. and W. STEGMULLER, 1959, Induktive Logik und Wahrscheinlichkeit (SpringerVerlag, Wien) GOOD, I. J., 1950, Probability and the weighing of evidence (Hafner, New York) GOOD, 1. J., 1965, The estimation ofprobabilities, An Essay on Modern Bayesian Methods, Research Monograph no. 30 (The MIT Press, Cambridge, Mass.) HINTIKKA, J., 1965a, Distributive normal forms in firstorder logic, in: Formal Systems and Recursive Functions, Proc. Eighth Logic Colloquium, Oxford, 1963, eds. J. N. Crossley and M. A. E. Durnmett (NorthHolland Pub!. Co., Amsterdam) pp. 4790 HINTIKKA, J., 1965b, Towards a theory of inductive generalization, in: Proc. 1964 Intern. Congress for Logic, Methodology, and Philosophy of Science, cd. Y. BarHillel (NorthHolland Pub!. Co., Amsterdam) pp. 274288 HINTIKKA, J., 1965c, On a combined system of inductive logic, in: Studia LogicoMathematica
132
JAAKKO HINTIKKA
et Philosophica in Honorem Rolf Nevanlinna, Acta Philosophica Fennica, vol. 18, pp.2130 JOHNSON, W. E., 1932, Appendix (edited by R. B. Braithwaite) to Probability: deductive and inductive problems, Mind N. S., vol. 41, pp. 421423 KEMENY, J. G., 1963, Carnap's theory of probability and induction, The Philosophy of Rudolf Carnap (The Library of Living Philosophers), ed. P. A. Schilpp (Open Court, La Salle, Illinois) pp. 711738 POPPER, K. R., 1959, The logic of scientific discovery (Hutchinson, London)
ON INDUCTIVE GENERALIZATION IN MONADIC FIRSTORDER LOGIC WITH IDENTITY RISTO HILPINEN University of Jyvdskylii, Finland
1. Jaakko Hintikka has in a recent article [1965b] sketched a new system of inductive logic. When compared with the system of inductive logic constructed by Carnap [1950] the main advantage of Hintikka's system is that it gives fairly natural degrees of confirmation to inductive generalizations, whereas Carnap's confirmation function c* enables one to deal satisfactorily with singular inductive inference only. According to Carnap's system, general sentences which are not logically true receive nonnegligible degrees of confirmation only if our evidence contains a large part of the individuals in the whole universe. In particular, in infinite domains of individuals a system of inductive logic based on c* gives all general sentences which are not logically true a zero probability independently of the amount of evidence. Hintikka's system is in principle applicable to all firstorder languages. However, in the paper mentioned above degrees of confirmation of generalizations are actually calculated only in the case in which all our predicates are monadic. In the present paper we shall inquire what kinds of results one obtains by means of the system in question when the relation of identity is used in addition to monadic predicates. 2. Let us assume that our language contains k primitive monadic predicates Pi (i=l, 2, ... , k). By means of these predicates and propositional connectives it is possible to define K = 2k different kinds of individuals. These kinds of individuals are specified by certain complex predicates Ct j (j= 1,2, ... , K) which we shall call attributive constituents. In the present case in which we have in our language only monadic predicates, attributive constituents are simply Carnap's Qpredicates in a new guise". By specifying ofeach attributive constituent Ct, whether it is instantiated or not it is possible to define constituents. Constituents describe all the different kinds of "possible worlds" 1
For Qpredicates, see Carnap [1950] pp. 124126.
134
RISTO HILPINEN
that can be specified by means of our monadic predicates, quantifiers, and propositional connectives". If we have in our language, in addition to propositional connectives, quantifiers, and k monadic predicates, also the sign of identity, it is possible to specify of each attributive constituent Ct.; not only whether it is instantiated or not, but also how many individuals exemplify Ct j in the universe of which we are speaking. If the maximal number of layers of quantifiers in the sentences of our language is q, it is possible to specify of each attributive constituent whether it is exem pIified by 0, 1, ... , q  I or ~ q individuals. When at most q layers of quantifiers are used, the number of constituents in our language is (q+ l)K. Constituents correspond to different partitions W=(Wq, Wql> ... , Wi> Wo) of Ctpredicates, where each subclass Wj , j=O, I, ... , qI, is the class of those attributive constituents that are exemplified by exactly j different individuals and where the class Wq is the class of those attributive constituents that are exemplified by at least q individuals in our universe. The maximal number of layers of quantifiers in a sentence is also called its depth. We shall first consider a simple case in which q = 2. In this case it is possible to define 3 K different constituents. These constituents correspond to partitions W=(W2 , WI' Wo) of attributive constituents, where (1) W2 is the class of those attributive constituents that are exemplified by at least 2 individuals, (2) WI is the class of those attributive constituents that are exemplified by exactly one individual each, and (3) Wo is the class of such attributive constituents as are not instantiated at all. Let the numbers of Ctpredicates in the subclasses W2 , WI and Wo be respectively w 2 , WI and woo Because W is a partition of K Ctpredicates, \V 2
+ WI + W o = K.
(4)
For simplicity, the constituent corresponding to the partition W will also be called W. If attributive constituents are numbered in a suitable way, the constituent W can be written as follows:
W = (Ex) (Ct., (x) & (Ey)(y =I x Sc Ct.; (y») & ... & :l Constituents and attributive constituents have been characterized in greater detail in a number of papers by Hintikka. See e.g. Hintikka [1965a].
ON INDUCTIVE GENERALIZATION
(Ex) (Ct;w,(x) & (Ey)(y # x &Ctiw,(y»))& (EX)(Ct iw2+ 1 (x) &(y) (Ct iw2+ 1 (Y)::::J y = x)) & ... & (Ex) (Ct;w 2+wI (x) &(y)(Ct;w2 +w, (y) ::::J Y = x)) & (x)(Ct il (x) v ... V CtiW2+ W, (x)).
135
(5)
Let us assume that the whole domain of individuals of which we are speaking contains N individuals, and that we have observed n individuals sampled from the universe in question. These n observed individuals constitute our evidence. Our evidence can be represented as a partition of Ctpredicates, viz. as partition C=(C z, C l , Co), where (6) C z is the class of those attributive constituents that are exemplified in the evidence by at least 2 different individuals, (7) C 1 is the class of such Ctpredicates as are exemplified by exactly one individual in the evidence, and (8) Co is the class of those attributive constituents that are not instantiated in our evidence. In the sequel we shall use the expression "C" to refer both to the evidence and to the partition of Ctpredicates corresponding to it. Suppose again that the numbers of attributive constituents in the classes C z , C l and Co are respectively Cz, C l and Co. Because C is a partition of K Ctpredicates, (9) C z + C l + Co = K. An arbitrary constituent compatible with C will be called U. It corresponds to a partition U =(Uz , Vi> V o) of Ctpredicates, where the subclasses V z , VI and Uo are defined in the same way as the subclasses Wz , W1 and Wo of the partition W. A constituent V is compatible with C if and only if the class U z contains all Ctpredicates in the class C z and possibly other attributive constituents as well, and V 1 contains all Ctpredicates that belong to the class C 1 but not to Vz, and possibly other Ctpredicates. In other words, the partitions U have to satisfy the following requirements in order to be compatible with C: (10) Cz n V 1 = 0 C znVo=0
(11)
C l n Uo=0.
(12)
where "0" refers to the empty class. Which constituent W has the highest degree of confirmation with respect to C? This question may also be formulated as follows: How are the par
136
RISTO HILPINEN
titions Wand C related to each other when the degree of confirmation of W with respect to C assumes its greatest value? To answer this question we shall consider the crosspartition D=Cn W; D=(D 2,2, D1,2' D O,2, D 2,l' D1. 1 ' Do, I. D 2 , 0 , Dl,o, Do,o), where C2
n
W2 = D 2,2
(13)
C1
n
W2 = D 1 , 2
(14)
Co
n
W 2 = D o, 2 , etc.
(15)
Co
n
Wo
=
(16)
Do,o·
For instance, in the evidence C each of the Ctpredicates in D l , 2' is exemplified by only one individual, but according to Wthere nevertheless are in the whole domain of individuals other individuals that exemplify the attributive constituents in question. It is obvious that the degree of confirmation of W with respect to C is different from zero only if W is compatible with C. We are here not interested in constituents incompatible with our evidence. W is compatible with C if and only if it is one of the constituents U, i.e. if D 2, 1
= 0,
(17)
D 2 ,0
= 0,
(18)
Dl,o
= 0.
(19)
Suppose now that the numbers of Ctpredicates in the classes D 2,2, ... , Do,o are respectively d 2,2, ... , do,o. Because of (17)(19), and because D is a partition of K Ctpredicates, d 2 ,2
+ d l,2 + dl,1 + d O, 2 + d O, 1 + do,o =
(13)(19) obviously imply
+ d l , 2 + d o, 2
K.
(20) (21)
11'2
= d 2 ,2
11'1
= dl,l + dO, 1
(22)
11'0
= do,o
(23)
C2
= d2 ,2
(24)
C1
= dl,l + d 1 , 2
(25)
Co =
do,o
+ do,1 + d O,2'
(26)
ON INDUCTIVE GENERALIZATION
137
Given the evidence C, the respective numbers of unobserved individuals that exemplify each Ctpredicate in each subclass of D are according to the constituent W subject to the following restrictions: D 2 , 2 : no restrictions
(27)
D 1 , 2 : at least 1
(28)
D1,1: 0
(29)
D O,2 : at least 2
(30)
DO,l: 1
(31)
Do,o: O.
(32)
The restrictions (28)(32) are obviously implied by the definitions of the classes D 1 • 2 , ••• , Do,o. 3. In Hintikka's system of inductive logic, a priori probabilities are first distributed among constituents. The probability of each constituent is then divided evenly among the statedescriptions that make the constituent in question true. In the simplest case, we may assume that all constituents have received an equal a priori probability. In the sequel we shall mainly restrict our considerations to this simple case. Because the number of different constituents in our language with q=2 is 3\ each constituent has in this case 1j3 K for its a priori probability. A posteriori probabilities, or degrees of confirmation, are computed for the constituents according to Bayes' formula
P(W)P(CI W) P(W I C) = fp(U)p(CI u)'
(33)
u
where the sum in the denominator is taken over all constituents U compatible with C. Because we assumed that all constituents have an equal a priori probability, (33) reduces to
P(W I C) =
{;1~~Z)'
(34)
u
If the number of those statedescriptions which make the constituent W true, given the evidence C, is expressed by m(W), and the corresponding number in the absence of any evidence is expressed by M (W), and if the
138
RISTO HILPINEN
corresponding numbers for a constituent Dare m( D) and M (D), we have
and
P(CI W) = m (W) M(W)
(35)
P(CI D) = m (V). M(V)
(36)
According to (34), (35) and (36), the degree of confirmation of W with respect to C is thus
m(W)jM(W)
r
(37)
= Lm(U)/M(U)'
peW I C)
u
According to the elementary principles of combinatorial analysis, we obtain (N  n)' meW) = ,_ . _. [W/ n  do. , + ()(I] (38) (N  n  dO,l)l and (39) and ()(2 in formulas (38) and (39) represent correction terms that are necessary because of the restrictions (28) and (30) and the definition of the class Wz . The exact values of these correction terms are
()(l
()(I
=
d
l,2+f ,2
1
1=1
X
()(z
=
(
tY
±
(,d 1,2,)
k=O
lk
( N  n  d 0,1' )'
(N
;]:)1
 n  o.i
r
t)'
(d
(do z) k
1I L ( IY (w.2)W2L (WZ. I
j=O
1 (_
j
i)
1)j
j=O
+ d I,Z + d O,Z 
2,Z
W2)
1=1
i
o 2
d .
(N  w)'
'
I 
j
(d
k)
O' 2 . 
j
X
.)Nndo t j
1_', ,(Wz  i (NwII).
,
(40)
irr:'. (41)
They are correction terms in the sense that they do not affect the limit of (35) when NHXJ, as one can check. The case in which we are speaking of an infinite universe is the easiest to deal with, and it also seems the most interesting from the point of view of inductive logic. Therefore we shall assume in the sequel that we are considering an infinite (or very large) universe. When N grows without limit,
ON INDUCTIVE GENERALIZATION
139
m ( W)/M (W) approaches the value m (W) ( 1 )(W~I'I) M(W) = N d 1. ;
wf'
(42)
The terms in the sum in the denominator of (37) correspond to all the constituents U compatible with the evidence C. For each constituent U we may again define the crosspartition E= C n U, where the subclasses E 2,2' ... , E o.o are defined in the same way as the subclasses D 2,2' ... , Do,o in the case of the crosspartition D. If the numbers of attributive constituents in the classes E 2,2' ... , Eo,o are respectively e 2.2, ... , eo,o, we obtain in analogy to (21)(26) U2 = e 2,2 + e 1,2 + eO,2
(43)
U1
= e 1 , 1 + eO, l
(44)
Uo
=
eo,o
(45)
C2
= e2 ,2
(46)
C1 =
e1 , 1
+ el,2
(47)
Co =
eo,o
+ eO,l + eO,2'
(48)
The unobserved individuals have to satisfy conditions corresponding to (27)(32) also for all crosspartitions E. Accordingly, we obtain corresponding to (42)
m (U) ( 1 )(U~I'I) M(U) = Ne", ~; .
(49)
(37), (42) and (49) together entail
P(W I C) = i.e.
/ 1 )(Wg1'I)j\( 1 )(U;I.I) (Ndl" .~f L Ne", . u; ,
(50)
u
II(
e" N d l")(w)n ~ (u ,) N 1.1
P (W I C) = 1  eU
U2
_2. W~I,l
(51)
Because the sum in the denominator of (51) contains terms corresponding to all constituents U compatible with C, there occurs also a term corresponding to such a constituent U in which e 1 1 =0. When N grows without limit, p(WIC) approaches therefore a value different from zero only if (52)
140
RISTO HILPINEN
When (52) holds, all terms with el,1 >0 in the denominator of (51) approach zero when N grows without limit. (51) thus becomes (53) According to (21), (24), (25), (43), (46), (47), and because el,1 =0 in all the terms of the sum in the denominator of (53), (53) can be written as follows:
P(W I C) = 1
/I( u
C
I
+ C2 _ +_ d 0;2 )n
c 1 + c 2+eO,2
(54)
P (WIC) assumes its largest value when
d O , 2 = 0,
(54) thus becomes P(W[C)=
1/\( L u
C
c1
(55) 1
+c 2
~)n
+ c 2 + eO,2
(56)
The terms in the sum in the denominator of (56) correspond to all the constituents U with e 1,1 = 0, e O,2;:;: 0 and eO,l ;:;: O. (60) may thus be expressed as follows:
P(WI C) =
1/(2CO+ i~1 (C;)2COiCl~:2C: J)
(57)
When n grows without limit, (57) approaches P(WIC)= I/Yo.
(58)
The principal results of our considerations so far are seen from formulas (52), (55) and (58). According to (52), C gives the highest degree of confirmation to a constituent W in which the class DI,1 is empty, i.e, in which all the attributive constituents in C 1 belong to W2 • According to this constituent, we should assume that although there are in our evidence Ctpredicates exemplified by one individual only, there are in our universe also other individuals that exemplify each of these attributive constituents. A case in which a certain attributive constituent is exemplified by exactly one individual, or by certain fixed finite number of individuals could be called a singularity. According to (52), it is not reasonable to assume that the singularities in our evidence are real singularities in the whole universe. Formula (55) says that it is not advisable to expect on the basis of the evidence C that there exist in our universe an arbitrary number ;:;: q of
141
ON INDUCTIVE GENERALIZATION
individuals that exemplify Ctpredicates not instantiated in our evidence. This result is similar to that obtained in Hintikka's (Hintikka [1965b] pp. 284285) system of inductive logic when the sign of identity is not used. However, the situation is different with respect to possible singularities. According to (57) and (58), the degree of confirmation of W is independent of the value of do,t. The degree of confirmation of a constituent which says that some of the Ctpredicates not instantiated in C are exemplified by exactly one unobserved individual is equal to the probability of a constituent which denies this. No amount of evidence can therefore distinguish all constituents of our language as far as their a posteriori probabilities are concerned. Given any value of n, no matter how great, there always remain 2c O constituents which have an equal degree of confirmation with respect to C. The probabilities of all these constituents approach the values P (WjC) = 1/2 c O when n grows without limit. The constituents in question are those with W2=C2+Cl and O~do,t~co' We shall discuss these results in greater detail later. 4. In the preceding section we have considered the probabilities of such constituents as can be defined by means of at most two layers of quantifiers, i.e. constituents for which q = 2. Our results can be easily extended to the general case in which the maximal number of layers of quantifiers is any finite number q. In this case our constituents W correspond to partitions W=(Wq , Wq  t, ... , Wt, Wo) of attributive constituents and our evidence C can be described as a partition C=(Cq , C q  t , ... , C t , Co), Each subclass Wj' j = 0, 1, ... , q 1 of the partition W is the class of such attributive constituents as are exemplified by exactly jindividuals in our whole domain of individuals, and the class Wq is the class of such Ctpredicates as are exemplified by at least q individuals. Correspondingly, each subclass Cj,j= 1,2, ... , ql, is the class of those Ctpredicates which are exemplified by exactly j observed individuals, and the class Cq is the class of such attributive constituents as are exemplified by at least q observed individuals. The numbers of Ctpredicates in the subclasses Wq , ... , Wo and Cq , ... , Co are respectively wq , ..• , Wo and cq , •.. , Co' We define again the crosspartition D = C n W, where each subclass Di,j (i=0, 1, ... , q;j=O, 1, ... , q) is defined by
o., = c, n W
(59)
j ,
and where the number of Ctpredicates in each subclass
Di,j
is di,j'
142
RISTO HILPINEN
An arbitrary constituent compatible with C is again called U, and it corresponds to a partition U =(Uq , Uq  1 , .•. , U 1 , Uo) with respective numbers uq , uq  1 , _.. , u 1 , U o of Ctpredicates in the subclasses. We define the crosspartition E = C n U in the same way as we defined D above. To be compatible with C the partitions U have to satisfy the condition All subclasses
Ei,j
= C, n
U, with j < i are empty.
(60)
(60) is a generalization of the requirements (10)(12). Now P (WIC»O only if W is one of the constituents U, i.e. All subclasses
with j < i are empty.
Di,j
(61)
(59) and (61) together imply the generalizations of the formulas (21)(26): Wq
q
I di,q i=O
=
ql
wq 
I
1
=
W1
=
di,ql
1
L: el
i,l
i=O
= elo,o
Wo
and
i=O
I
=
C1
Co
j=ql
=
i.e. in short
q
(63.2) (63.q)
dO,j'
(63.q+ 1)
L: s.,
(64)
I
j=O j
i=O
I,]
q
j
+ 1)
,j
q
w= j
c, =
d q 1,j
I ell' j=l
=
(62.q)
(63.1)
q
Cql
(62.2)
(62.q
= dq •q
Cq
(62.1)
I=i d i •j •
(65)
In the same way we obtain as generalizations of (43)(48) j
u j. = '\' L. e,',j. i=O
(66)
143
ON INDUCTIVE GENERALIZATION
and q
c,
I
=
j=i
(67)
ei,j'
Given the evidence C, the remaining N  n individuals may exemplify Ctpredicates in different subclasses of D according to following restrictions: (68) Every attributive constituent in each class Di,q with i=O, 1, ... , ql, q, is exemplified by at least q  i individuals. (69) Every attributive constituent in each class Di,i with i=O, 1, ... , q1, is exemplified by none of the unobserved individuals, and (70) Every attributive constituent in each class Di,j with i <j (i =0,1, ... , q 2; j= 1,2, ... , q1) is exemplified by exactly ji unobserved individuals. The conditions (27)(32) are a special case of the general conditions (68)(70). Corresponding restrictions hold of course also for the classes Ei,j" Now we are ready to calculate the a posteriori probabilities of constituents. We shall again consider only the case in which our universe is infinite or very large. In the latter case our formulas give approximate values only, because our calculations concern the case in which N grows without limit. Now we have instead of (38) and (39)
[ ( l m(W)=
q
2
q
3
0)
])
(N  n) ... N  n  i~O d i•i+ l + 2 i~O d i,i+2 + ... + (q l)i~/i,i+q1 + 1 { .   .q3     q4  ... .       ~x O di,i+231,1: di,i+3 ••• (q _ 2)1.1: d i,i+ _ l),d ' 1: 2 010 otO 010 • wq
X
I
Q'2('q
q2 q3 [Nn(,~ di,i+l+2,~ di.i+2+, 1=0
1=0
,Q "
t
.. +(q2j,~ di,i+q2+(qljdO,qtll 1=0
+ I.Jl..l N
(71)
and
M(W)=
~N(N 1) ... [N    
(
  
2
(to a., + 2
x
1
Wq
2
[N(,~ di.l+2~ 1=0
s., + ...
+(q
l)~t~
q2
3
i,231.1: d',3 21.1: 010 d 010
,
itO
di,q'l)
+ 1] (
,, ~ X
1=0
•••
i,q2(q (q _ 2)',1: , ,  0d q2
d,,2+ .. ·+(q2j,~
t=O
ql
_ ql
d',q2+(ql)~
1=0
1)',1: di,qI olQ
d"q·tll
+
N
1hZ
(72)
where !Xl and !X2 again represent correction terms which do not affect the limit of m(W)/M (W) when N>oo. When N grows without limit, we obtain
144
RISTO HILPINEN
according to (71) and (72)
(W) M(W)
m
q~l
ql
ql
q~l
. ~ d"i+2~ d2,i+,··+(ql)dq"q,)'Wn.W (.~ d"i+2.~ d 2,i+ .. ·+(ql)dq1,q,) N  ( 1""1 l='2 q q 1=1 1=2 2
q3
q4
3
'2fl.~ di,2.~ d,,'+2)31l.~ d,,3.~ d,,'+3) .1=0
10
.,0
10
•••
ql
(q _
1)1(.~ di.q,do,q) • 10
(73)
which is a generalization of (42). The corresponding generalization for (49) is obtained from (73) by substituting e for d and u for w everywhere in the formula. According to (37), (73), and the corresponding formula for m(U)jM (U) the degree of confirmation of W with respect to C is P(W I C)
= qlql
N
L
( ~ j
~
qI
[
N
dj,')
i=j
)=1
( ~ j
(
'wn.w
qIql
~
j
j=J
~
dj,') qI
i=j
•
i
q' ~

ej,i)
j=li=j
._._.1
q'
~
_.
( ~ j ~ ej,') ql 'Uq~n'Uqj=li=j
l .~ e",+2~ 1 .~
(~ei,j

qj'
.TIj!i=:O
(74) can be written also as follows:
u':' q'
j
dl,,+2.~
,=2 .1
~
lil
d2,,+ .. ,+(ql)dql,q'
qt
q_1
Nt=1
1=2
e2,,+· .. +(ql)eq1,q1
] ei,i+j)
(74)
i=O
w; v:
= f, i..J         n U
di,i+j)
i=D
j=2
j=2
1 
qjl
(~d',j
jO' j;:o;O
q q '
U
P(W I C)
TI
x
The terms in the sum in (75) again correspond to all constituents U compatible with C. Therefore there is obviously also a term corresponding to a partition U in which ql
ql
i= 1
i==2
L e1,i + 2 L e
2 ,i
+ ... + (q 1)eq 
l,q l
= O.
(76)
145
ON INDUCTIVE GENERALIZATION
Now P (WIC) approaches a value different from zero when N grows without limit only if qI
I
i= 1
dl,i
qI
+2I
i=2
d 2,i
+ ... + (q 1)dq_ l,q_1 = 0
(77)
i.e., because the numbers di,j are always nonnegative integers, dl,1
= 0;
d l ,2
= 0;
d2 ,2
= 0;
d2 ,3
= 0;
d1,q1 =
... ,
d 2 ,q l
(78.1)
0,
=0
(78.2) (78.q  1)
dq1,ql = O. (78.1)(78.q1) can be expressed in a concise form by
di,j
= 0 for i = 1,2, ... ,q
= 1,2, ... ,q  1 where
 1 andj
i ~j.
(79)
(79) is obviously the generalization of the condition (52). (52) is obtained from (79) by substituting the number 2 for q. Because of (79), (75) reduces to
P(WjC)
=
1j'(C + ~ + u
I
C2
CI
C2
(80) assumes its greatest value when dO,q
+ + cq + dO,q)n + + cq + eO,q
= O.
(80)
(81)
(55) is obviously a special case of (81). Because of (81), we obtain
(82)
It is easy to see from (82) that the value of P (WI C) is independent of all classes DO,j;j co, It is obvious that even if there were an attributive constituent exemplified by exactly one individual in our infinite universe, our chances to find the individual in question are minimal. This is clearly seen also from the formula (42): If we assume that the singularities found in our evidence represent real singularities of our universe, P (C1 W)'>O when N'> co. The probability that we should have in our evidence such a Ctpredicate instantiated as is exemplified by only one individual in the whole universe approaches zero when the total number of individuals in our universe grows without limit. What we just said holds of course also for any finite number of individuals. as is seen from (79). On the other hand, it is of course possible that our finite evidence contains singularities although in the universe there are no such singularities. Our second main result (ii) seems somewhat problematic. According to (ii), all constituents that say that in our universe there are unobserved singularities are equally probable. In addition, they are as probable as such a constituent as denies the existence of singularities. It is clear that in the case discussed above a reasonable man would choose such a generalization as says that the singularities in question do not exist. This generalization would be the simplest one, and in choosing it one would not postulate the existence of any unobserved kinds of individuals. Even ifit were not true in the sense that in our universe there in fact are singularities denied by the generalization in question, the risk that our future experiences would contradict it is minimal. However, such a choice cannot be defended in terms of our inductive logic. What is the reason for this discrepancy between our intuitions and our formal results? One could say that the generalizations we have considered are somehow unnatural. It seems rather strange to specify in one's generalizations numbers of individuals up to a given point, as we have done. In fact, these numerical assumptions are absolutely unverifiable when an infinite universe
ON INDUCTIVE GENERALIZATION
147
is concerned, and, as we have seen, they are not confirmable either. It is not possible to explain anything by postulating such singularities. Moreover, even if it were possible to express in our language the existence of singularities, our result (ii) suggests that we perhaps should not distribute a priori probabilities among the constituents according to the method used above. According to this method, each constituent with a fixed depth q received an equal a priori probability. Some of these constituents differed from each other only because of unconfirmable numerical assumptions, and the equality of their a priori probabilities was reflected again in the equality of their a posteriori probabilities. It would perhaps be advisable to give a relatively high a priori probability to such constituents as deny the existence of singularities, and lower a priori probability to other constituents. In Section 8 we shall inquire whether this can be done in some simple and natural way. To some extent, both the results (i) and (ii) may be taken to reflect the limitations of the language systems we are here studying, rather than the limitations of the basic ideas of our inductive logic. It does not seem reasonable to expect that the exact numbers of the different kinds of individuals we can distinguish from each other can in any case be accounted for in terms of purely qualitative concepts, i.e. monadic predicates. Our results may perhaps be taken to justify this pessimism. 6. If N is a finite number, i.e. if we are speaking of a finite domain of individuals, we obtain results different from those obtained in the previous case. In particular, if n is not negligible in comparison to N, (i) and (ii) do not hold any more. Our formulas in the sequel are rough approximations and presuppose that both Nand n are large in comparison with K. For the sake of simplicity, we shall restrict our remarks to the case in which q = 2. Formulas (38) and (39) hold for the finite case as well as for the infinite one. If n is not negligible in comparison with N, and both are large in comparison with K, P( CI W) is instead of (47) approximately
m(W) ~ (N  n)d
M(W)
D
N
•J
(_1)(W~I~) . w;
Because the value of the denominator
Ndl.l
I
(85)
P( CI U) of the formula for
P (WIC) is independent of the choice of W, we shall in the sequel inquire which constituent W has the highest degree of confirmation with respect to C
by considering formula (85) only. When (85) assumes its greatest value,
P (WIC) ass~mes its greatest value, too.
148
RISTO H1LPINEN
It is easy to see from (85) that the value of P (C1 W) with dO,l
(86)
=0
is larger than any of the values which it assumes when do,1 >0. When (86) holds, (85) reduces to
1P(CI W) = (  d N 1.1
,) )(Wd'.
_2_. W~
(87)
Because of (21), (24) and (25), (87) can be written as follows:
(N~1'[)(~:2++C~1~~~~~1,~++d~:~;:'1
(88)
Because we assumed that n > K and thus n > d 1,1' (88) assumes its greatest value when (89) d O,2 = 0 (88) thus becomes 1 P(CI W) = N d , . , (c2 + C 1  dl,l)dt.,n. (90) How should we now choose dl,l so as to make (90) and therefore also P(WIC) as large as possible? The required choice of d1,1 depends on how the numbers n, N, C 1 and C 2 are related to each other. Let us consider two different possibilities. We may choose (91) and obtain (92) or alternatively (93) whence (94) By comparing formulas (92) and (94) one can see that if n>log~ cz
+Ct
C2
(

N
)C I,
(95)
a constituent with (93) has a higher probability than one with (91). Con
ON INDUCTIVE GENERALIZATION
versely, if n Bx),
(1)
and a positive instance of it will accordingly have the form Ra·Ba.
(2)
Now (i) is logically equivalent to its "contrapositive":
(3) a positive instance of which has the form ~
Bb
~
Rb,
(4)
176
MAX BLACK
and is also logically equivalent to what I shall call its "comprehensive",
(x)[(Rx v
~
Rx)::::>
(~ Rx v
Bx)] ,
(5)
a positive instance of which has the form
(Rc v
~
Rc).( ~ Rc v Be).
(6)
On the assumption that logically equivalent hypotheses are "confirmed" (Hempel's term, which I shall discuss later) by the same instances, we are led to the following concl usions: (iv) Any nonblack nonraven confirms the raven hypothesis (cf. formula (4) above). (v) Any nonraven also confirms the same hypothesis (because any object, c, for which ~ Rc is the case, will satisfy formula (6) above). (vi) Any black thing will also confirm the hypothesis (for Be will logically imply (6) above). Thus it would seem that the raven hypothesis is, or would be, "confirmed", for instance, by the existence of a white handkerchief (cf. (ivj], by the existence of a stone (cf. (v») and by the existence of a black pearl (cf. (vij). These conclusions are certainly startling. The foregoing arguments purport to show that any object, 0, without exception, is relevant to the raven hypothesis, in the sense of either confirming it or falsifying it. For if Bx) is compatible with (x) (Rx ::::> ~ Bx). But cf. the discussion below of "ordinary" uses of conditionals. 20 This point has been well made by Nelsen Goodman and, following him, by Scheffler [1963]. 21 One may recall the disputes between Spearman and those psychologists who rejected his definition of "intelligence" as inadequate. Spearman and his defenders used to say that by "g" they meant "g" (as technically defined, e.g. in connection with factor analysis). If anybody wanted to identify "g" with "intelligence" that was his affair!
186
MAX BLACK
We might (a) reject Hempel's original argument as unsound in some identified respect; or, we might (b) accept the soundness of his argument. If the latter course seems right, we shall need to explain why the argument's conclusions should seem "paradoxical". In so doing, we might rely (b i ) upon the temptation to invoke "background knowledge"; or (b 2 ) upon disparities between the technical notion of "material implication", as used in the paradoxgenerating arguments, with "ordinary" uses of "ifthen"; or (bj) upon confusion between low (or negligible) confirmation and irrelevance; or, finally, (b 4 ) upon some other significant differences between "confirmation" and other notions with which it might easily be confused. (The last four are obviously compatible strategies.) I shall now make brief comments upon these possible solutions, reserving (b 2 ) and (b 3 ) for more extended treatment later. (a) Rejecting the paradoxengendering argument. Given the relative simplicity and perspicuity of Hempel's original exposition, there seem to be only three ways in which this might plausibly be done. (a.) We might try to reject the assumption that logically equivalent propositions receive exactly the same confirmation from given data. This is a decidedly uninviting stratagem. If "logically equivalent propositions" are understood to be such as are, of logical necessity, rendered true or false by precisely the same statesofaffairs, the conclusion seems inescapable that they cannot be confirmed in different degrees by the same evidence. (This verdict remains correct if we substitute "empirical support" for "confirmation" .) (a 2 ) We might try to reject the remaining assumption that an instance of a simple generalization always confirms it 22 . This is somewhat more plausible (given what we know, at the back of our minds, about the lurking notion of "empirical support"). However, I find it hard to see any reasonable way in which this assumption could be denied, once a formal definition of confirmation had been adopted. (a 3 ) We might try to question the alleged logical equivalence between the original raven hypothesis and its "contrapositive" (formula (3) above) or between it and its "comprehensive" (formula (5) above). There can be no question, however, that the three propositions, as expressed, are indeed logically equivalent. (Any lingering doubts about the equivalence of these That is to say, that a truth of the form AaBa always confirms the corresponding hypothesis (x) (Ax =:> Bx).
col
NOTES ON THE "PARADOXES OF CONFIRMATION"
187
expressions  or, rather, the ordinarylanguage expressions corresponding to them  might better be dismissed under one of the subheadings that immediately follow.) On the whole, then, it seems to me that Hempel's argument, taken as intended, must be regarded as perfectly sound. There is no prospect of finding an internal flaw in it: if we are startled by its conclusions, the fault must lie in some stubborn confusion or prejudice. (b) Accepting the paradoxengendering argument. Hempel has made a strong case for the view that the common sense principle of limited relevance, which I mentioned earlier, arises only from a "misleading intuition": he claims that "the impression of a paradoxical situation is not objectively sound; it is a psychological illusion" (Hempel [1945] p. 18). We tend, he suggests, to confuse "practical" with "logical" considerations: for the very form of the raven hypothesis reveals a practical interest in the application of the hypothesis to ravens  and a relative lack of interest in its bearing upon nonravens, nonblack things, etc. 23 Yet, in this case, as in others like it, " ... the hypothesis nevertheless asserts something about, and indeed imposes restrictions upon, all objects (within the logical type of the variable occurring in the hypothesis ... )" (Hempel [1945]). The raven hypothesis, we need to see, is indeed an assertion about 24 every physical object in the world, claiming of each such thing that it is either a nonraven or a black thing; and, for similar reasons, any generalization whatever is about each and every thing of the appropriate logical type in the universe. Once we grasp this point, it should no longer appear startling or paradoxical that every physical object, without exception, bears one way or the other upon the truth of the raven hypothesis. On the whole, Hempel's analysis impresses me as attractively straightforward and persuasive, by contrast with some of the more elaborate explanations of the "psychological illusion" that some subsequent writers have proposed. If further explanation seems needed in order to account for our proneness to overlook the simple logical point that Hempel mainly relies upon (the logical equivalence of formulas (1), (3) and (5) above), we have a choice be23 Is it always true  or even generally true  that the subject of an assertion "interests" us more than the predicate, and that what is explicitly mentioned interests us more than what is implicit? How would "interest", in the relevant sense, be defined or detected? 24 There is surely a covert extension here of the common notion of about? Even philosophers don't talk "about" everything whenever they utter generalizations.
188
MAX BLACK
tween a number of plausible options (see (b 1)(b4 ) above). Since enough has already been said about the possible influence of "background knowledge" (i.e. (b 1 ) ) , I shall proceed at once to discuss the possible influence of the logical gap between material implication and ordinary uses of "ifthen" (option (b z)). 6. Some relevant pecularities of material implication. (a) Ordinary uses of singular conditionals. I wish to recall some familiar features of ordinary uses of sentences of the form If A then C, or of similar sentences obtainable from such sentences by changes of mood. The "logic" of such ordinary singular conditionals is closely related to the "logic" of general statements of the form All A are B and may be expected to throw some useful light on the latter. We shall find it convenient to distinguish between indicative singular conditionals, such as "If the temperature falls, we shall have snow", subjunctive singular conditionals, such as "If you were to touch that plate you would get burned", and counterfactual singular conditionals, such as "If [ had betted on Excelsior, I would have won". I can think of no other types that are relevant to the present discussion. It has often been observed that when a speaker asserts an indicative singular conditional, he normally implies some connection between the antecedent and the consequent. Suppose [ say, "If you interrupt Robinson now he will be angry". [f you do proceed to interrupt Robinson and he does become angry, that will not necessarily show that my original assertion was true: for if Robinson became angry because somebody entered the room at the moment you interrupted him, we should have to say that the truth of my original assertion remained unsettled. Thus the force of my original remark was approximately the same as that of "If you interrupt Robinson he will become angry because you interrupt him". In this more explicit form, the word "because" expresses the intended presence of some reason (often, though not always of a causal sort) why the antecedent and consequent should have the same truthvalue. The character of the imputed connection between antecedent and consequent varies from case to case: antecedent and consequent may be intended to be both true in virtue of some common cause, or the implied link may be supplied by sornebody's promise, decision, and so on. The general formula seems to be that the truth of the antecedent A is such as to provide a reason, of some sort, for the truth of B. (Hence, somebody who in ordinary life says "If A then B" can always properly be asked, in the kind of case I here have
NOTES ON THE "PARADOXES OF CONFIRMATION"
189
in mind, why the truth of A should make B also true.) When a singular sentence is used in this familiar way, I shall speak of the statement as a connected singular conditional. For the reasons I have explained, a connected singular conditional statement is a stronger statement than the corresponding material conditional, symbolized as "A::::J B". Although ifthen sentences are normally used in the way I have described, there are I believe, special and exceptional occasions when a speaker wishes to be understood as making only the weaker statement. When I say "If that penny comes down heads when tossed now, so will that other penny", I cannot mean that the truth of the antecedent will constitute a reason for the truth of the consequent: my intended meaning is simply that if A is made true, B will as a matter of fact, and not for any specifiable reason, also be true  just that and nothing more. In such a case one might speak of an unconnected or, perhaps an accidental singular conditional. Of course, a connected singular conditional logically implies the corresponding accidental singular conditional  but not vice versa. I do not wish to argue here that "ifthen" has different meanings or senses in the types of cases I have called "unconnected" and "accidental". If I had to choose, I would say that the same meaning was involved each time. (b) Truthconditions and direct verification of accidental conditionals. It is obvious that an accidental singular conditional is directly verified by A·B and is directly falsified by A· ~ B. If you toss both pennies and both show heads, my original assertion was true (since I made no further claim about there being any connection between the two states of affairs); if you toss them and the first shows heads, but the second tails, my original assertion was false. But suppose you throw the first penny into the fire as soon as I have made the prediction, so that the antecedent A remains unfulfilled. Then it seems that the truthvalue of the conditional remains open; and the original assertion has received and can henceforward receive no direct test. (If this is correct, there is a sharp contrast with the truth conditions for a material implication of the form A::::JB.) We need not abrogate the law of excluded middle in such a case: even if you refuse to make the test, I can still sensibly maintain, "What I said was true  if you had tossed both pennies, the second would have come down heads if the first did". But in the absence of direct test, any further argument about the conditional's truthvalue will have to rest upon indirect evidence. (If I had some hidden device that would allow me to produce heads at will, ( might have a good reason for reaffirming the
190
MAX BLACK
truth of the original accidental conditional, in the absence of direct verification.) Let us now compare these results with the corresponding results for the contrapositive, If notB then notA. It is seen at once that while this proposition is directly verified by ~ B· ~ A and directly falsified by ~ B· A, it has no direct verification or falsification if B is true. Thus we see that whereas the original proposition and its contrapositive are both falsified by the same complex stateofaffairs, A· ~ B, A· B, which directly verifies the original proposition, leaves its contrapositive's truthvalue open, while ~ B· ~ A, which directly verifies the contrapositive, leaves the truthvalue of the original proposition open. In the case of ~ A • B, neither of the two propositions receives direct verification. If we write P for the original proposition and Q for its contrapositive, we shall obtain the following summary: P is verified by A· B, falsified by A· ~ B, left open by ~ A • B and by ~A·~B.
Q is verified by ~ A . ~ B, falsified by A • ~ B, left open by ~ A • B and by A·B. Thus P and Q have different ranges of direct verification: if two men betted on P and Q respectively, then if one lost so would the other, but one might win while the other neither lost nor won. It seems, therefore, that in ordinary uses an accidental singular conditional and its contrapositive are not logically equivalent. This point can be clinched by showing that situations can arise in which one of the two propositions is directly verified while the other is actually false. Suppose P is "If you now press the switch, the light will go on" and Q is the corresponding contra positive, "If the light does not now go on, you will not in fact have pressed the switch", where both are intended to be taken "accidentally". Then if you do not press the switch and the light does not go on (i.e. if .~ A • ~ B is the case) Q will be directly verified. But this result is compatible with the falsity of P: we might know, for instance, that the lamp was broken and therefore be in a position to assert, in retrospect, "If you had pressed the switch the light would not have gone on", and hence to derive the falsity of P. Such results as these are so unlike the corresponding results for material conditionals, that conclusions based, as in Hempel's arguments, upon theorems of the standard propositional calculus must be interpreted with great caution. Before leaving this topic, we may notice the following simple way of representing "accidental singular conditionals" in terms of the familiar symbolism of the propositional calculus. Using the technique of Carnap's "reduction
NOTES ON THE "PARADOXES OF CONFIRMATION"
191
sentences" we may write the following formulas:
== B) B ::J (Q == ~ A)
A ~
::J
(P
which highlight the indeterminacy of direct verification previously noticed. These expressions can, in turn, be "solved" for P and Q, respectively yielding: P==A·Bv
~A·X
Q==~A·~BvB.Y
where X and Yare to be taken as indeterminate parameters  propositions of unspecified truthvalues 25. As I have already suggested, values of X or Y, respectively, can sometimes be supplied by indirect reasoning from similar cases or, what comes almost to the same thing, by indirect reasoning from relevant generalizations. (c) The verification of restricted accidental generalizations. Let us now apply the results already obtained to the verification of the general statement All the white balls in this urn are solid (P', say). I have chosen a statement that, to common sense at least, seems to be about a finite, although unknown, number of objects, viz., the white balls contained in the urn in question. The corresponding "restricted" contrapositive may be taken to be All the nonsolid balls in this urn are nonwhite (Q', say). Our generalization, p', may reasonably be construed as a finite conjunction of an indefinite number of accidental singular conditionals: it says of each ball in the urn that if it is white, then, as a matter of fact (and not for any special "connection" or reason) that ball is also solid. The asymmetry between the conditions for direct test of an accidental singular conditional and its contrapositive will obviously reappear in the present case. When P' is directly tested by examining each ball in the urn separately, a given ball may be found to agree with P' by being both white and solid (W.S) or it may disagree with it by being white and not solid (W. ~ S), whereupon the testing process will terminate; but if it should be nonwhite, the instance will be dismissed as irrelevant. If, however, the restricted contrapositive, Q', is 2& It is easy to see that these parameters must be functions of A and B. Suppose we have Pi == A·B v ~ A,Xl and P2 == A·~ Bi V ~ A,X2. We want Pi. and P2 to be contradictories, which will require ~ A :::J ~ (Xl,X2) to be the case. Thus the "parameters" cannot be chosen altogether freely, if the ordinary conventions for "ifthen" are to be respected. I shall not pursue this topic here.
192
MAX BLACK
being directly tested, different judgments will be in point (except in the case of falsification). Let a be the class of whiteandsolid balls in the urn; b the class of whiteandnonsolid balls in the urn; c the class of nonwhiteandsolid balls in the urn; and, finally, d the class of nonwhiteandnonsolid balls in the urn. Then P' is partially verified by each member of a, is falsified by each member of b, and is unaffected by each member of c and by each member of d; while the restricted contrapositive, Q', is unaffected by each member of a, is falsified by each member of b, is unaffected by each member of c, and is partially verified by each member of d. We have, once again, the pattern previously observed of different though overlapping verification conditions, but identical falsification conditions. There is, however, the following new point. In order to establish P' directly, we must eventually examine each ball in the urn, in order to record it as partially verifying the hypothesis, or as irrelevant to it. If P' is in fact true (and not void on account of the absence of any white balls) completion of the entire process of direct testing will also, thereby, supply all the data we need for direct verification of the contrapositive, Q'. It is easily seen that if we have found by direct test that P' is true, then Q', if not vacuous (through the absence of nonsolid balls in the urn) must likewise be true. We may therefore say, without doing any violence to our "intuitions", that any direct test of P' will be an indirect test of Q', and vice versa. This will sometimes make it plausible to examine cases of nonsolid balls, even if we are setting out to test the direct generalization, P'. For example, if we knew that we could locate three balls that were known to be the only nonsolid ones in the urn, examination of each of them for their color would provide us with a rapid way of indirectly testing P', without the lengthy and tedious routine of successively examining each ball in the urn. This might be regarded as a scrutiny of all the possible negative instances  those that might be of the form ~ S· W. On the other hand, it would never be appropriate to consider cases that are ~ W· S  whether we were testing P' or Q'. [I' we knew that we could extract a certain subset of the balls, each known to be both nonwhite and solid ( ~ W· S), we should at once discard them as being irrelevant to the testing of either P' or Q'. If we now apply these results to a modified form of Hempel's example, such as the hypothesis, All ravens in the New York Zoo are black (R, say), we shall want to say that each black raven in the Zoo directly confirms R, each nonblack raven in the Zoo falsifies it, each white or other nonblack object is irrelevant (so far as direct test goes) and each black thing in the Zoo
NOTES ON THE "PARADOXES OF CONFIRMATION"
193
either confirms it or is irrelevant, depending on whether or not it is a raven. Furthermore, each thing outside the New York Zoo must count as irrelevant. But suppose Hempel, or someone who agrees with his approach, asks us to consider instead the original raven hypothesis, All ravens are black? If this is intended to be understood as an unrestricted accidental generalization, to the effect that each raven (past, present and future) is in fact black, it is doubtful whether the notion of "direct verification" with its implication of successive scrutiny of every physical object, without exception, continues to make sense 26 . At any rate, the chance of such a tremendous cosmic "accident" occurring is so small, on general grounds, that there would be something odd about saying that observation of a single black raven should count as "partial" verification. Here, perhaps, "verification" assumes the negative sense of "absence of falsification" and with this understanding the sting is removed from the paradoxes. There is certainly nothing paradoxical about saying that both a white shoe and a black cat fail to falsify the raven hypothesis  although even here we might wish to make a distinction between an object, like the first, that might have falsified the hypothesis, and one like the second that could not have done so. Paradoxical suggestions would be conveyed by this manner of description only if we were led to suppose that an appropriate method of testing an unrestricted accidental generalization might reasonably consist of an unsystematic and exhaustive scrutiny of every object of the appropriate logical type in the entire universe. (d) Transition to the case of connected conditionals. I have been arguing that an "accidental" singular conditional of the form If A then B, does not have the same direct truthconditions as its contrapositive, Ifnot B then notA, and may therefore be treated as a distinct proposition. I now wish to consider whether a similar point can be made with respect to "connected" singular conditionals. Take, as an example, If I press the switch the light will go on (If A then Bor P) with the intended implication that fulfillment of the antecedent will be a reason for the fulfillment of the consequent. The possibilities for direct testing of this strong conditional are even more restricted than in the case of the corresponding weak, accidental, conditional. For even if I do press the ~6 In the case of the balls in the urn, discussed above, the examination of a single ball is one step forward in a procedure which I know, in advance, will terminate at a known instant. But to undertake to scrutinise "everything in the universe" would be to start something which could never be known to have been accomplished.
194
MAX BLACK
switch and the light does go on (A. B), that might still have been just a coincidence, and more evidence of an indirect sort (e.g. concerning the mechanism of the switch) will be needed before the assertion can be regarded as established. Similar remarks apply to the contrapositive, If the light will not go on then I will not have pressed the switch (Q). SO, with respect to both P and Q we now have the situation that each of them is falsified by A· ~ Band neither is directly verified by any of the three remaining possibilities, A .B, ~ A • B, and ~ A • ~ B. It looks as if the asymmetry of the truthconditions upon which I previously relied has vanished. Common sense, however, will still wish to make a distinction between the bearing upon P of the case A . B and the bearing upon it of the cases ~ A •B or ~. A • ~ B. Consider the first: If I press the switch, then for alII know the light may not go on, in which case P would be false; if, therefore, the light does go on, I have, to be sure, obtained no conclusive evidence of P's truth, but I have obtained some relevant information. A natural way to describe this would be to say that I have obtained some partial verification of P. (This would be all the more natural if we were to think of P as a conjunction of an accidental conditional and an assertion about the imputed "reason" or "connection"; the observation of A· B then directly verifies the first conjunct, while leaving the second open.) On the other hand, ifI do not press the switch, then there is no chance of my falsifying P, so nothing that subsequently happens, whether the light goes on or not, can give me any direct information. These cases, one is inclined to say, are irrelevant. If we apply similar considerations to Q, we get the following patterns of truth conditions for the original assertion and its contrapositive: P is falsified by A· ~ B, is partially verified by A· B and is unaffected by ~ A . B and by ~ A • ~ B, Q is falsified by A • ~ B, is partially verified by ~ A • ~ B, and is unaffected by ~A·B and by A·B. Thus we get a modified asymmetry, somewhat resembling what we found in the case of accidental conditionals, and can proceed as before. We must notice, however, that the logical relations between "strong" conditionals such as P and Q differ from those of the corresponding "weak" conditionals. It is easy to see, indeed, that if P is true, Q must also be true; and if P is false, Q cannot be true. Thus we must admit a relation of logical equivalence between the two propositions. It follows that any case that partially verifies P by direct test (A. C) will also indirectly and partially verify its contrapositive Q. Our analysis must accordingly be modified as follows:
NOTES ON THE "PARADOXES OF CONFIRMATION"
195
TABLE OF TRUTH CONDITIONS
A· ~ B directly falsifies P, directly falsifies Q. A·B partially and directly verifies P; and hence partially and indirectly verifies Q. ~ A· ~ B partially and directly verifies Q; and hence partially and indirectly verifies P. ~ A· B leaves P unaffected; leaves Q unaffected. (e) Conclusions. I believe that a prima facie case has now been made for thinking that the discomfort produced by the paradoxical cases of confirmation is partly due to the logical gap between material implication and "ordinary" implication. However, it is hard to be sure of this in the absence of any thorough and comprehensive examination of the discrepancies between the two concepts 27. 7. Bayesian approaches. I have left for the last a type of "solution", involving considerations of "inverse probability", that has been astonishingly popular P', considering the notorious difficulties that have generally discredited the classical "Bayesian" approach to the confirmation of empirical generalizations 29. The argument fastens upon the circumstances that the number of ravens in the universe is very much smaller than the number of nonblack things. This being admitted, an attempt is made to show that the "prior" or antecedent likelihood of finding a raven to be black is smaller than the likelihood of finding a nonblack thing to be a nonraven. Indeed, if the class of ravens is much smaller than the class of nonblack things, the first likelihood is much smaller than the second. (Call this last contention Step one.) It is now urged that, on the basis of Step one, the increase in degree of confirmation produced by finding a raven to be black is much greater than the increase produced by A valuable contribution to this neglected task is a recent paper by Adams [1965]. See, for instance, HosiassonLindenbaum [1940]. This contains one of the best attempts to deal with the paradoxes in the way now to be explained. For critical comments, see Hempel [1945] p. 21, footnote 2. A recent attempt of this sort is Mackie [1963], which contains references to other essays in the same vein. 29 There is a useful summary of such difficulties in Von Wright [1957] pp. 112117. Von Wright says that" ... such uses of inverse probability as those of determining the probability that the sun will rise tomorrow or that the next raven will be black are illegitimate" (p. 115). Anybody who agrees will reject the "Bayesian approach" to the paradoxes at the very outset. 27
2B
196
MAX BLACK
finding a nonblack thing to be a nonraven (call this last claim Step two). We now explain our initial reluctance to see that a white shoe, or any other nonblack nonraven, is an authentically confirming instance of the raven hypothesis, as arising from a confusion between low confirmation and irrelevance: intuitively grasping, as we are supposed to do, the negligible contribution to the confirmation of the hypothesis made by a "paradoxical" instance, we mistakenly suppose that it makes no contribution at all. Once this error has been detected, there is no further reason to reject such an instance as irrelevant, even though it makes a trifling contribution in practice to the support of the hypothesis that interests us. J believe I can here dispense with a detailed examination of the ingenious arguments by which this approach has been supported, since the following considerations seem sufficient to show its inadequacy. (i) The defense of what I have called Step two (the crucial link in the argument offered) is admittedly intricate and problematic 30. (ii) Hence, even if Step two is correct (which I doubt), common sense  if it does rely upon the supposedly different contributions made by the two sets of instances  must be using a fallacious argument. (iii) No reason has been given to believe that "common sense" does in fact believe in Step tH'O  indeed the empirical evidence (and this is an empirical question l) suggests that common sense simply holds the paradoxical cases to be irrelevant. (iv) Consider the contrapositive, H', of the raven hypothesis, i.e. All non30 One might be inclined to think it obvious that an instance antecedently less likely to arise supports the hypothesis (H, say) more strongly than does an instance that is antecedently more likely to arise. (Roughly speaking, the less surprising an observed consequence of a law under empirical test, the less support such an observation gives to the law.) But any careful attempt to calculate the relevant degrees of "confirmation" quickly reveals the implicit fallacy. Call the positive instance (a black raven) a and the contrapositive instance (e.g. a white shoe) b. Let the probability of a being observed if H is true, P(aIH), be PI and similarly, let P(bl H) = qi; let P(a] ~ H) = P2; and, finally P(bl ~ H) = qz. Then, the observation of a raises the antecedent odds in favor of H in the ratio pl!P2; and the observation of b raises those same odds in the ratio of ql!q2. Whether the positive instance, a, supports H more strongly than the contrapositive instance, b, therefore depends on whether p ilp« is greater than QJ/q2. Clearly, more is involved than the sizes of the classes of ravens and nonblack things respectively. What is at stake may, with some simplification, be said to be whether a predominance of contrapositive instances over positive instances is more likely on the supposition that H is true than upon the supposition that H is false. It is hard to see how this question could possibly be answered. As Peirce said, universes arc not as plentiful as blackberries; and hence speculation about the number of white shoes to be expected if not all ravens are black is bound to be idle.
NOTES ON THE "PARADOXES OF CONFIRMATION"
]97
black things are nonravens (or Nothing that is not black is a raven). On the solution proposed, "common sense" ought to treat direct instances of H' (nonblack things that are nonravens) as irrelevant'". (For the subject class of H', the nonblack things, is supposed to have enormously more members than the complement of its predicate class.) This is as paradoxical as what is supposed to be explained. On the whole, the Bayesian approach seems to me wrong in principle and ineffective in practice.
8. Postscript. Considering the amount of sophisticated discussion that the paradoxes have received, the lack of some generally acceptable solution is disappointing. The preceding remarks have no claim to serve as a satisfactory basis for such a solution. If they have any merit, it may be that of drawing attention to some subtleties that have been overlooked in the past. I have been concerned throughout to stress the gap between the syntactical notion of confirmation and the common notion of "empirical evidence" (see especially section 4). But the latter notion is still shrouded in unnecessary obscurity. References ADAMS, E. W., 1965, The logic of conditionals, Inquiry, vol. 8, pp. 166197 HEMPEL, C. G., 1945, Studies in the logic of confirmation, Mind, vol. 54, pp. 126, 91121 HOSIASSONLINDENBAUM, Janina, 1940, On confirmation, J. Symbolic Logic, vol. 5, pp. 138148 MACKIE, J. L., 1963, The paradox ofconfirmation, British Journal for Philosophy of Science, vol. 13, pp. 265277 SCHEFFLER, I., 1963, The anatomy of inquiry (Alfred A. Knopf, New York) VON WRIGHT, G. H., 1957, The logical problem of induction (Macmillan, New York)
31
This point has been well made by Professor I. Scheffler.
A BAYESIAN APPROACH TO THE PARADOXES
OF CONFIRMATION * PATRICK SUPPES
Stanford University, Stanford, Calif
1. Introduction. What I have to say about the paradoxes of confirmation from a Bayesian standpoint is rather simple. The ideas have been implicitly expressed several times, probably first by HosiassonLindenbaum [1940]. Perhaps the only virtue of the present paper is to make the Bayesian ideas very explicit. The remarks in the last section on the different probabilistic forms of causal and noncausal laws are very likely the most original aspect of the analysis. The paradoxes arise from two "facts". First the sentence
(\7' x)(Ax > Ex)
(1)
is logically equivalent to its contrapositive:
(\7' x)(i3x
>
Ax),
(2)
where ",, is the symbol of negation (and later of set complementation). Second, the singular sentence Aa&Ba (3) seems to confirm (1) in a way that the singular sentence
Aa&Ba
(4)
does not, but with respect to (2) the roles of (3) and (4) are reversed, even though (1) and (2) are logically equivalent.
2. Bayesian approach. On a Bayesian approach, we first look at the four classes and assign each a prior probability in the universe of objects  exactly how this universe is to be characterized I leave open for the moment.
* r am indebted to Ernest W. Adams and Paul Holland for several helpful comments on an earlier draft of this paper. The writing of this paper has been partly supported by the Carnegie Corporation of New York.
THE PARADOXES OF CONFIRMATION
J99
Using the familiar notation '{x :Ax}' for describing the set of objects x that have property A, we then have in terms of four mutually exclusive and exhaustive classes p({x: Ax &Bx}) = PI P({x: Ax&Bx}) = pz p({x: Ax &Bx}) = P3 P({x: Ax&Bx}) = P4 and LPi = 1. Also for simplicity I assume throughout that Pi=FO, for i= J, 2, 3, 4. If we take the familiar example and let' Ax' be 'x is a raven' and' Bx' be 'x is black', then P4 should be much larger than PI' Pz and P3 for any very broadly construed universe. The central question is why we are right in our intuitive assumption that we should look at randomly selected ravens and not randomly selected nonblack things in testing the generalization that all ravens are black. We may consider the general case, representing classes by 'A' and 'B' in the obvious way: A = {x :Ax}, etc. First of all, we note that
peA) = PI + Pz,
(5)
PCB) = PI + P3'
(6)
and thus in terms of conditional probability
PCB I A) = __PJ __
(7)
P4. P ( A I B) = pz + P4
(8)
PI
+ pz
Now we want to justify the sampling rule that we look at A's rather than
nonB's if P(BIA) P(RIA)
(15)
P( cancer Ismoking) > P(cancer Inonsmoking).
(16)
The first thing to note is that the obvious form of the paradox of confirmation disappears for in general
P(RIA)
=1=
P(AIB),
i.e., the direct analogue of contraposition is not valid in terms of conditional probability. On the other hand, it reappears in another form, which is innocuous in many applications. We need the usual 2 x 2 contingency table to bring out the point. The distribution of the population (or sample) is shown by the numbers nij' B 
A nl1
n 12
A
n 22
n21
(17)
206
PATRICK SUPPES
We may use this table to show that (15) holds if and only if
P(AIE) > P(AIB),
(18)
and (18) is a sort of probabilistic contrapositive of (15). Using (17), we have
P(BIA) > P(BIA) nil 111 1
+ 1112
>
11 1 11122 >
11111122
+ 11211122> 1122 11 1Z
+ 112 2
>
if and only if
1121 n21
+ 1122
if and only if
11121121 11121121
+ 11211122
112 1 1111
if and only if
+ 112 1
if and only if if and only if
P(AIE) > P(AIB),
which establishes the desired equivalence. Tn terms of smoking and cancer, we have:
P( cancer Ismoking) > P( cancer Inonsmoking) if and only if P( nonsmoking I noncancer) > P( nonsmoking Icancer), and not only does this seem reasonable, but it also seems reasonable to sample either the causes (smoking) or the effects (cancer) and their absences in establishing a probabilistic causal law. We may sample by looking at smokers and nonsmokers, or by looking at persons with cancer and those without cancer. (For detailed design of an experiment, the question of precisely what class seems a priori most appropriate to sample or, more realistically, in what proportions classes of individuals should be sampled, would follow the same line of analysis pursued earlier in discussing the raven example, and will not be considered in detail again.) However, a subtle point has been illegitimately smuggled in, and the situation changes when we consider something closer to the raven case, i.e., a noncausal law. We may entertain the noncausal probabilistic law: Most ravens are black.
(19)
The natural probability expression of this hypothesis is not the analogue of (15): (20) P(BIR) > P(BJR), but rather (21) P(BIR) > P(EIR),
THE PARADOXES OF CONFIRMATION
207
and without further assumption the apparent "contrapositive" probability analogue of (21) is not necessarily equivalent to it. To be explicit, (21) is not necessarily equivalent to
P(RIB) > P(RIB),
(22)
as may be seen from using table (17) as before, and with this observation, the paradoxes of confirmation vanish for (19). (It may be argued that the bare inequality of (21) does not reflect the exact meaning of most and that a stronger form of inequality should be used, but meeting this criticism is not crucial for the present discussion.) As far as I know, the relevance for the paradoxes of confirmation of the sharp distinction between causal and noncausal laws, particularly the relevance of the different probabilistic forms of such laws, has not been previously noticed. It should be apparent that the kind of causal law pertinent to this discussion is probabilistic rather than deterministic in character, and is of the sort ordinarily tested in biological, medical and psychological experiments and reported in contingencytable data. A certain lack of clarity in the distinction between causal and noncausal laws is also to be found in the terminology used in the statistical literature. Statisticians have developed measures of association for contingencytable data and the probabilistic causal laws tested by the tables. It would seem more natural to reserve the term association for testing the noncausal laws, but such tests are not ordinarily discussed in the same detailed fashion, undoubtedly because of the greater importance of causal laws from both a practical and conceptual standpoint. I do not mean to suggest that inequality (15) offers a very profound analysis of the probabilistic notion of cause. My limited objective in this paper has been to point out the conceptually sharp distinction between causal and noncausallaws when they are expressed in a probabilistic form. The ideas used here go no deeper than what I would call the level of naive causes. The identification of genuine causes, which to me seems necessarily relative to a particular conceptual scheme, requires a more elaborate probabilistic structure than I have introduced here. But the introduction of additional structure would not change what I have said about the nonexistence of the paradoxes of confirmation for either causal or noncausallaws of a probabilistic sort. References HOSIASSONLINDENBAUM, 1940, On confirmation, J. of Symbolic Logic, vol. 5, pp.133l48
THE PARADOXES OF CONFIRMATION* G. H. VON WRIGHT The Academy of Finland, Helsinki, Finland
1. We consider generalizations of the form "All A are B". An example could be "All ravens are black". We divide the things, of which A (e.g. ravenhood) and B (e.g. blackness) can be significantly (meaningfully) predicated into four mutually exclusive and jointly exhaustive classes. The first consists of all things which are A and B. The second consists of all things which are A but not B. The third consists of all things which are B but not A. The fourth, finally, consists of all things which are neither A nor B. Things of the second category or class, and such things only, afford disconfirming (falsifying) instances of the generalization that all A are B. Since things of the first and third and fourth category do not afford disconfirming instances one may, on that ground alone, say that they afford confirming instances of the generalization. If we accept this definition of the notion of a confirming instance, it follows that any thing which is not A ipso facto affords a confirming instance of the generalization that all A are B. This would entail, for example, that a table, since notoriously it is not a raven, affords a confirmation of the generalization that all ravens are black. A consequence like this may strike one as highly "paradoxical". it may now be thought that a way of avoiding the paradox would be to give to the notion of a confirming instance a more restricted definition. One suggestion would be that only things of the first of the four categories, i.e. only things which are both A and B, afford confirmations of the generalization that all A are B. This definition of the notion of a confirming instance is sometimes referred to under the name "Nicod's Criterion". According to this criterion, only propositions to the effect that a certain thing is a raven and is
* The treatment of the Paradoxes of Confirmation which is suggested in this paper is
substantially the same as the one given in my essay, in Theoria, vol. 31 (1965), pp. 254274. The nonformal parts of the discussion in the two papers are largely identical. The formal argument, as presented here, is more condensed and also, I hope, more perspicuous than in the Theoria paper.
THE PARADOXES OF CONFIRMATION
209
black can rightly be said to confirm the generalization that all ravens are black. But if we adopt Nicod's Criterion as our definition of the notion of a confirming instance we at once run into a new difficulty. Consider the generalization that all notB are notA. According to the proposed criterion we should have to say that only things which are notB and notA afford confirmations of this generalization. The things which are notB and notA are the things of the fourth of the four categories which we distinguished above. But, it is argued, the generalization that all A are B is the same as the generalization that all notB are notA. To say "all A are B" and to say "all notB are notA" appear to be but two ways of saying the same thing. It is highly reasonable, not to say absolutely necessary, to think that what constitutes a confirming or disconfirming instance of a generalization should be independent of the way the generalization is formulated, expressed in words. Thus any thing which affords a confirmation or disconfirmation of the generalization g must also afford a confirmation and disconfirmation respectively of the generalization h, if "g" and "h" are logically equivalent expressions. This requirement on the notion of a confirming instance is usually called "The Equivalence Condition". To accept Nicod's Criterion thus seems to lead to conflict with the Equivalence Condition. This conflict constitutes another Paradox of Confirmation. 2. Before we proceed to a "treatment" of the paradoxes which we have mentioned, the following question must be asked and answered: Are confirmations of the generalization that all A are B through things which are notA always and necessarily to be labelled "paradoxical", and never "genuine"? Simple considerations will show, I think, that the answer is negative. Let us imagine a box or urn which contains a huge number of balls (spheres) and of cubes, but no other objects. Let us further think that every object in the urn is either black or white (all over). We put our hand in the urn and draw an object "at random". We note whether the drawn object is a ball or a cube and whether it is black or white. We repeat this procedurewithout replacing the drawn objects  a number of times. We find that some of the cubes which we have drawn are black and some white. But all the balls which we have drawn are, let us assume, black. We now frame the generalization or hypothesis that all spherical objects in the box are black. In order to confirm or refute it we continue our drawings. The drawn object would disconfirm (refute) the generalization if it turned out to be a white ball. If it is a black ball or a white cube or a black
210
G. H. VON WRIGHT
cube, it confirms the generalization. Is any of these types of confirming instance to be pronounced worthless? It seems to me "intuitively" clear that all the three types of confirming instance are of value here and that no type of confirmation is not a "genuine" but only a "paradoxical" confirmation. (Whether confirmations of all three types are of equal value for the purpose of confirming the generalization may, however, be debated.) I would support this opinion by the following ("primitive") argument: What we are anxious to establish in this case is that no object in the box is white and spherical. Not knowing, whether there are or are not any white balls in the box, we run a risk each time when we draw an object from the box of drawing an object of the fatal sort, i.e, a white ball. Each time when the risk is successfully stood, we have been "lucky". We have been this, if the object which our hand happened to touch was a cube (and, since we could feel it was a cube, need not be examined for colour at all); and we have been lucky, if the object was a ball which upon examination was found to be black. To touch a ball, one might say, is exciting, since our tension (fear of finding a white ball) is not removed until we have examined its colour. To touch a cube is not exciting at all, since it ipso facto removes the tension we might have felt. But to draw from the box is in any case exciting, since we do not know beforehand, whether we shall, to our relief, touch a cube, or touch a ball and, to our relief, find that it is black, or touch a ball and, to our disappointment, find that it is white. Let "S" be short for "spherical object in the box", "C" for "cubical object in the box", "B" for "black", and" W" for "white". All things in the box can be divided into the four mutually exclusive and jointly exhaustive categories of things which are Sand B, Sand W, C and B, and C and W. It is not connected with any air of paradoxality to regard things of all the four types as relevant (positively or negatively) to the generalization that all S are B. All things in the world can be divided into the four mutually exclusive and jointly exhaustive categories of things which are Sand B, S but not B, B but not S, and neither S nor B. Things of the first category obviously bear positively and things of the second category negatively on the generalization. But of the things of the third and fourth category some, we "intuitively" feel, do not bear at all on the generalization, have nothing to do with its content  and therefore "confirm" it only in a "paradoxical" sense. The categories of things C & Band S & W differ from the categories of things ~ S & B and ~ S & ~ B in this feature: All things of the first two cate
THE PARADOXES OF CONFIRMATION
211
gories are things in the box, but some things (in fact the overwhelming majority of things) of the last two categories are things outside the box. The things which we "intuitively" regard as affording "paradoxical" confirmations of the generalization that all S are B are those things of the 3rd and 4th category which are not things in the box. I shall here introduce the term range ofrelevance of a generalization. And I shall say that the range of relevance of our generalization above that all spherical things in the box are black is the class of all things in the box. I now put forward the following thesis: All things in the range of relevance of a generalization may constitute genuine confirmations or disconfirmations of the generalization. The things outside the range are irrelevant to the generalization. They cannot confirm it genuinely. Since, however, they do not disconfirm it either, we may "by courtesy" say that they confirm it, though only "paradoxically". In order to vindicate my thesis I shall try to show, by means of a formal argument, that the irrelevance of the "paradoxical" confirmations consists in the fact that they are unable to affect the probability of the generalization. Showing this is one way, and a rather good one it seems to me, of dispelling the air of paradoxality attaching to these confirmations. 3. It is important to state explicitly the logicomathematical frame of probability within which we are going to conduct our formal argument concerning the confirmation paradoxes. The probability concept of the confirmation theories of Carnap and Hintikka is a twoplace functor which takes propositions (or, on an alternative conception, sentences) as its arguments. The probability concept used by us is a functor the arguments of which are characteristics (attributes, properties). Let "qJ" and "1/1" stand for arbitrary characteristics of the same logical type (order). The expression "P(cp/l/J)" may be read "the probability that a random individual is tp, given that it is l/J". Instead of "is" we can also say "has the characteristic", and instead of "given" we can say "on the datum" or "relative to". We stipulate axiomatically that, for any pair of characteristics which are of the same logical type and such that the second member of the pair is not empty, the functor "P( I )" has a unique, nonnegative numerical value. Furthermore, the functor obeys the following three axioms: AI. (Ex)l/Jx &(x)(l/JX4cpx)4P(cp!l/J) = 1, A2. (Ex)l/Jx4P(cpll/J)+P( cpll/J)= 1, A3. (Ex)(xx &cpx)4P(cp!X)' P(l/Jlx &cp)=P(cp&l/Jlx)·
212
G. H. VON WRIGHT
It is a rule of inference of the calculus that logically equivalent (names of) characteristics are intersubstitutable in the functor "P( I )" ("Principle of Extensionality"). The application of probabilities, which are primarily associated with characteristics, to individuals is connected with notorious difficulties. The application is sometimes even said to be meaningless. This, however, is an unnecessarily restricted view of the matter. If x is an individual in the range of significance of cp and t/J, and if it is true that P( cp!t/J) =p, then we may, in a secondary sense, say that, as a bearer of the characteristic t/J, the individual x has a probability p of being a bearer also of the characteristic ip,
4. If R is the range of relevance of the generalization that all A are B, and if this generalization holds true in that range, then it will also be true that (x)( Rx» (Ax~ Bx)).  This may be regarded as a "partial definition" of the notion of a range of relevance. For the sake of convenience, I shall introduce the abbreviation" Fx" for "Axs Bx", "F", we can also say, denotes the property which a thing has by virtue of the fact that it satisfies the propositional function "Ax~ Bx", [ define a secondorder property q/ R by laying down the following truthcondition: The (firstorder) property X has the (secondorder) property 1ftR, if and only if, it is universally implied by the (firstorder) property R. That X is universally implied by R means that it is true that (x)(Rx~Xx). The property w R, in other words, is the property which a property has by virtue of belonging to all things in the range R. A property which belongs to all things in a range can also be said to be universal in that range. Assume we can order all things of which A, Band R can be significantly predicated into a sequence XI' x 2 , ••• , x n , •••• Then we can define a sequence of secondorder properties .9"'\, :F 2 , ..• , .9'" n ... as follows: The (firstorder) property X has the (secondorder) property :Fn , if and only if, it is true that RXn~Xxn' The property :Fn, in other words, is the property which a property has (solely) by virtue of belonging to a certain individual thing, if this thing is in the range R. ("If" here means material implication.) For the sake of convenience, I introduce the abbreviation "l[Jn" for the logical product of the first n properties in the sequence :F\,:F 2, ... , :Fn , ." • "1[Jn", we can also say, denotes the property which a property has by virtue of the fact that it is not missing from any of those of the first n things in the world which also are things in the range R.
THE PARADOXES OF CONFIRMATION
213
Finally, let "e" denote a tautological secondorder property, i.e. a property which any first order property tautologically possesses  for example, the property of either having or not having the secondorder property qt R (or ff n)· 5. We prove the following theorem of probability: T. P(o/IRIO»O ~(P( 'PiRliPn+ 1)> P( 'PiRliPn)~P(ffn+ lliP n ) < 1). The firstorder property R trivially has the secondorder property 0 & 4>no For, that OCR) may become equated with the tautology that epn(R) v ~ iPn(R), and "epn(R)" is an abbreviation for "(Rx 1~RX1)&'" &(Rxn~RxS'. Consequently, it is logically true (for all values of n) that (EX)(O(X)&iPn(X». It follows immediately that it is logically true, too, that (EX)O(X) and (EX)epn(X). From A3 we derive, by substitution and detachment, that P( epn & o/tRIO) =P(iPnIO)' P('PiRIO &iPn). "epn&'Pi/' is logically equivalent with "ollR" alone. This follows from the way the secondorder properties were defined. That o/tR(X) means that (x)(Rx~Xx) and that iPn(X) means that (Rx 1~XX1)&'" &(Rxn~Xxn)' Similarly, "0 & epn" is logically equivalent with "4>n" alone. Substituting the simpler equivalents, the equality above reduces to that P( 'PiRIO) =P(epnIO)· P( 'PiRliP n). By an exactly analogous argument we derive the equality that P(qtRIO) =P(iPn+ 110)· P( 'PiRliPn+ 1)' Combining the two equalities we get thatP(iPnIO)'P(o/tRlepn)=P(epn+lI0)' .P('PiRlepn+ 1)' Now assume that P(°11 RIO) > O. Since probabilities are nonnegative, it follows that P( 'PiRliPn+ 1)> P( 0/1Rlepn)~P( epnl O) > P( epn+ 110). By repeated application of A3 we detach the equalities that P( epnlO) =P(ff 110) · ... ·P(ffnI0&4>n_l) and that P(epn+lle)=p('~110)"'" P(ffnI0&iPnl)·P(ffn+lI0&4>n). The assumption that PCOllRIO»O guarantees that all the factors of the products are different from O. Hence, after cancellation, we get that P(iP nI8)P('PiRliPn)~P(ffn+llepn)< 1. This completes the proof of T. Let us now see what this theorem amounts to in plain words. "P('PiRIO»O" says that the probability that a random property in the universe of properties is true of all things in the range R is greater than O. "P('PiRlepn+ l»P('PiRlepS' says that the probability that a random property is true of all things in the range R, is greater, given that it is true of
214
G. H. VON WRIGHT
those of the first n + 1 things in the world which fall in this range, than given (only) that it is true of those of the first n things which fall in this range. "P(J n + IlcP n ) < 1", finally, says that the probability that a random property is true of the (n + I yt thing in the world, if this thing belongs to the range R, is smaller than 1, given that this property is true of those of the first n things in the world which fall in that range. The theorem as a whole thus says the following: lfthe probability that a random property in the universe of properties is true of all things in the range R is not minimal (0), then the probability that this property is true of all things in the range is greater, given that it is true of those of the first n + 1 things which fall in the range, than given (only) that it is true of those of the first n things which fall in the range, if and only if, the probability that it is true of the(n+ 1)slthing, if this belongs to the range, is not maximal (1), given that it is true of those of the first n things which belong to the range R. Now apply the theorem to the individual property F. To say that F is true of all things in the range R is tantamount to saying that the generalization that all A are B is true in the range R. To say that F is true of those of the first n (or n+ 1) things in the world which are also things in the range amounts to saying that the first n (or n + 1) things afford confirming instances of the generalization that all A are B in the range R. To say that F is true of the (n + I yl thing, if this thing belongs to the range, finally, comes to saying that this thing affords a confirming instance of the generalization that all A are B in the range R. When applied to the individual property F, the theorem as a whole thus says the following: If, on tautological data ("a priori"), the probability that all A are B in the range R is not minimal, then the probability of this generalization is greater on the datum that the first n + I things in the world afford confirming instances of it than on the datum that the first n things afford confirming instances, if and only if, the probability that the (n + l.)" thing affords a confirming instance is not maximal on the datum that the first n things afford confirming instances. It follows by contraposition that, if this last probability is maximal (l), then the new confirmation of the generalization in the (n+ It instance does not increase its probability. The new confirmation is, in this sense, irrelevant to the generalization. 6. Now assume that the thing x n + 1 actually does not belong to the range
THE PARADOXES OF CONFIRMATION
215
of relevance R of the generalization that all A are B. In other words, assume that ",Rx n+ 1 . It is a truth of logic (tautology) that "'RXn+1~(Rxn+1~Exn+1)'Since "E" does not occur in the first antecedent, we can generalize the first consequent in "E". It is a truth oflogic, too, that", Rxn+ 1 ~(X)(Rxn+ 1 ~XXn+ 1)' By definition, ff n+1(X) can replace RXn+1~XXn+1' Thus it is a truth of logic that", Rxn+ 1 ~(X)ffn+ 1 (X). From this it follows trivially that", Rxn+ 1+ (X)( 4in(X)+'~n+1 (X)). According to axiom Al of probability (X) ( 4in (X ) +ff n+ 1 (X)) entails that P(ffn+ 114in) = 1  provided that at least one property has the (secondorder) property 4in The existential condition is satisfied, since the property R trivially has the property 4in • 4in(R) means by definition the same as (RX1 ~RX1) & ... &(Rxn+Rx n) which is tautology. Herewith has been proved that, if it is the case that", Rxn+ l' i.e. if the (n + l)'t thing in the world does not belong to the range R, then it is also the case thatP(ffn+ ll4i n)=I, i.e. then the probability that this thing will afford a confirmation of any generalization to the effect that something or other is true of all things in this range, is maximal. This probability being maximal, the confirmation which is trivially afforded by the thing in question is irrelevant to any such generalization in the sense that it cannot contribute to an increase in its probability. And this constitutes a good ground for saying that a thing which falls outside the range of relevance of a generalization can be said to afford only a "vacuous" or "spurious" or "paradoxical", and not a "genuine", confirmation of the generalization in question. 7. After all these formal considerations we are in a position to answer such questions as this: Is it possible to confirm genuinely the generalization that all ravens are black through the observation, e.g., of black shoes or white swans? The answer is that this is possible or not, depending upon which is the range of relevance of the generalization, upon what the generalization "is about". If, say, shoes are not within the range of relevance of the generalization that all ravens are black, then shoes cannot afford genuine confirmations of this generalization. This is so, because no truth about shoes can then affect the probability of the generalization that, in the range ofrelevance in question, all things which are ravens are black. So what is then the range of relevance of the generalization that all ravens are black? Here it should be noted that it is not clear by itself which is the range of relevance of a given generalization such as, e.g., that all ravens are black. Therefore it is not clear either which things will afford genuine and
216
G. H. VON WRIGHT
which only paradoxical confirmations. In order to tell this we shall have to specify the range. Different specifications of the range lead to so many different generalizations, one could say. The generalization that all ravens are black is a different generalization, when it is about ravens and ravens only, and when it is about birds and birds only, and when it is  ifit ever is  about all things in the world unrestrictedly. As a generalization about ravens, only ravens are relevant to it, and not, e.g., swans. As a generalization about birds, swans are relevant to it, but not, e.g., shoes. And as a generalization about all things, all things are relevant  and this means: of no thing can it then be proved that the confirmation which it affords is maximally probable relative to the bulk of previous confirmations and therefore incapable of increasing the probability of the generalization. When the range of relevance of a generalization of the type that all A are B is not specified, then the range is, I think, usually understood to be the class of things which fall under the antecedent term A. The generalization that all ravens are black, range being unspecified, would normally be understood to be a generalization about ravens  and not about birds or about animals or about everything there is. I shall call the class of things which are A the natural range ofrelevance of the generalization that all A are B. It would be a mistake to think, when the range of relevance of a generalization is unspecified, it must be identified with the natural range. If it strikes one as odd or unplausible to regard the genus bird, rather than the species raven, as the range of relevance of the generalization that all ravens are black, this is probably due to the fact that the identification of birds as belonging to this or that species is comparatively easy. But imagine the case that species of birds were in fact very difficult to distinguish, that it would require careful examination to determine whether an individual bird was a raven or a swan or an eagle. Then the generalization that all birds which are (upon examination turned out to be) ravens are black might be an interesting hypothesis about hirds. Perhaps we can imagine circumstances too under which all things, blankets and shoes and what not, would be considered relevant to the generalization that all ravens are black. But these circumstances would be rather extraordinary. (We should have to think of ourselves as beings who quasi put their hands into the universe and draw an object at random.) Only in rare cases, if ever, do we therefore intuitively identify the unspecified range with the whole logical universe of things. It would also be a mistake to think that the range of a generalization must become specified at all. But even when the range is left unspecified we may
THE PARADOXES OF CONFIRMATION
217
have a rough notion of what belongs to it and what does not  and therefore also a rough idea about which things are relevant to testing (confirming or disconfirming) the generalization. No ornithologist would ever dream of examining shoes in order to test the hypothesis that all ravens are black. But he may think it necessary to examine some birds which look very like ravens, although they turn out actually to belong to some other species. 8. In conclusion I shall say a few words about the alleged conflict between the socalled Nicod Criterion and the Equivalence Condition (cf. above, section 1). The Nicod Criterion, when applied to the generalization that all A are B, says that only things which are both A and B afford genuine confirmations of the generalization. Assume now that the range of relevance of the generalization in question is A, i.e. assume that we are considering this generalization relative to what we have here called its natural range. Then, by virtue of what we have proved (sections 46), anything which is notA cannot afford a genuine confirmation of the generalization. In other words: Within the natural range ofrelevance ofa generalization, the class ofgenuinely confirming instances is determined by Nicod's Criterion. But is this not in conflict with the Equivalence Condition? This condition, as will be remembered, says that what shall count as a confirming (or disconfirming) instance of a generalization cannot depend upon any particular way of formulating the generalization (of a number of logically equivalent formulations). Do we wish to deny then that the generalization that all A are B is the same generalization as that all not Bare notA? We do not wish to deny that "all A are B" as a generalization about things which are A expresses the very same proposition as "all notB are notA" as a generalization about things which are A. Generally speaking: when taken relative to the same range of relevance, the generalization that all A are B and the generalization that all notB are notA are the same generalization. But the generalization that all A are B with range of relevance A is a different generalization from the one that all notB are notA with range of relevance notB. If we agree that, range of relevance not being specified, a generalization is normally taken relative to its "natural range", then we should also have to agree that, the ranges not being specified, the forms of words "all A are B" and "all notB are notA" normally express different generalizations. The generalizations are different, because their "natural" ranges of relevance are different. This agrees, I believe, with how we naturally tend to understand the two formulations.
218
G. H. VON WRIGHT
Speaking in terms of ravens: The generalization that all ravens are black as a generalization about ravens, is different from the generalization that all things which are not black are things which are not ravens as a generalization about all notblack things. But the generalization that all ravens are black as a generalization about, say, birds is the very same as the generalization that all things which are not black are not ravens as a generalization about birds. (For then "thing which is not black" means "bird which is not black".) Within its natural range of relevance, the generalization that all A are B can become genuinely confirmed only through things which are both A and B and is "paradoxically" confirmed through things which are B but not A, or neither A nor B. Within its natural range of relevance the generalization that all notB are notA can become genuinely confirmed only through things which are neither A nor B and is "paradoxically" confirmed through things which are both A and B, or B but not A. Within the natural range of relevance, Nicod's Criterion of confirmation is necessary and sufficient. Within another specified range of relevance R, the generalization that all A are B may become genuinely confirmed also through things which are B but not A, or neither A nor B. And within the same range of relevance R, the class of things which afford genuine confirmations of the generalization that all A are B is identical with the class of things which afford genuine confirmations of the generalization that all notB are notA. Thus, in particular, if the range of relevance of both generalizations are all things whatsoever, i.e. the whole logical universe of things of which A and B can be significantly predicated, then everything which affords a confirming instance of the one generalization also affords a confirming instance of the other generalization, and vice versa, all confirmations being "genuine" and none "paradoxical".
ASSIGNING PROBABILITIES TO LOGICAL FORMULAS DANA SCOTT Stanford University, Stanford, Calif," and PETER KRAUSS University of California, Berkeley, Calif:
1. Introduction. Probability concepts nowadays are usually presented in the standard framework of the Kolmogorov axioms. A sample space is given together with a afield of subsets  the events  and a aadditive probability measure defined on this afield. When the study turns to such topics as stochastic processes, however, the sample space all but disappears from view. Everyone says "consider the probability that X2 0", where X is a random variable, and only the pedant insists on replacing this phrase by "consider the measure ofthe set {WEQ:X(W)20}". Indeed, when a process is specified, only the distribution is of interest, not a particular underlying sample space. In other words, practice shows that it is more natural in many situations to assign probabilities to statements rather than sets. Now it may be mathematically useful to translate everything into a settheoretical formulation, but the step is not always necessary or even helpful. In this paper we wish to investigate how probabilities behave on statements, where to be definite we take the word "statement" to mean "formula of a suitable formalized logical calculus". It would be fair to say that our position is midway between that of Carnap and that of Kolmogorov. In fact, we hope that this investigation can eventually make clear the relationships between the two approaches. The study is not at all complete, however. For example, Carnap wishes to emphasize the notion of the degree of confirmation which is like a conditional probability function. Unfortunately the mathematical theory of general conditional probabilities is not yet in a very good state. We hope in future papers to comment on this problem. Another question concerns the formulation of
* This work was partially supported by grants from the National Science Foundation and the Sloan Foundation.
220
DANA SCOTT AND PETER KRAUSS
interesting problems. So many current probability theorems involve expectations and limits that it is not really clear whether consideration of probabilities of formulas alone really goes to the heart of the subject. We do make one important step in this direction, however, by having our probabilities defined on infinitary formulas involving countable conjunctions and disjunctions. In other words, our theory is oadditive. The main task we have set ourselves in this paper is to carryover the standard concepts from ordinary logic to what might be called probability logic. Indeed ordinary logic is a special case: the assignment of truth values to formulas can be viewed as assigning probabilities that are either 0 (for false) or 1 (for true). Tn carrying out this program, we were directly inspired by the work of Gaifman [1964] who developed the theory for finitary formulas. Aside from extending Gaifman's work to the infinitary language, we have simplified certain of his proofs making use of a suggestion of C. RyllNardzewski. Further we have introduced a notion of a probability theory, in analogy with theories formalized in ordinary logic, which we think deserves further study. In Section 2 the logical languages are introduced along with certain syntactical notions. In Section 3 we define probability systems which generalize relational systems as pointed out by Gaifman. In Section 4 we show how given a probability system the probabilities of arbitrary formulas are determined. Tn Section 5 we discuss modeltheoretic constructs involving probability systems. In Section 6 the notion of a probability assertion is defined which leads to the generalization of the notion of a theory to probability logic. In Section 7 we specialize and strengthen results for the case of finitary formulas. In Section 8 examples are given. An appendix (by Peter Krauss) is devoted to the mathematical details of a proof of a measuretheoretic lemma needed in the body of the paper. 2. The languages of probability logic. Throughout this paper we will consider two different firstorder languages, a finitary language !E(w) and an infinitary language !E. To simplify the presentation both languages have an identity symbol = and just one nonlogical constant, a binary predicate R. Most definitions and results carryover with rather obvious modifications to the corresponding languages with other nonlogical constants, and we will occasionally make use of this observation when we give specific examples. The language !E(w) has a denumerable supply of distinct individual variables Vn' for each n < W, and !E has distinct individual variables v~, for each ~ <WI' where WI is the first uncountable ordinal. Both languages have logical
ASSIGNING PROBABILITIES TO LOGICAL FORMULAS
221
constants r., v, ', V, 3, and = standing for (finite) conjunction, disjunction, negation, universal and existential quantification, and identity as mentioned before. In addition the infinitary language se has logical constants 1\ and V standing for denumerable conjunction and disjunction respectively. The expressions of se are defined as transfinite concatenations of symbols oflength less than W 1, and the formulas of secw) and se are built from atomic formulas of the forms Rv~v~ and v~=v~ in the normal way by means of the sentential connectives and the quantifiers. Free and bound occurrences of variables in formulas are defined in the wellknown way. (For a more explicit description of infinitary languages see the monograph Karp [1964].) A sentence is a formula without free variables. We will augment the nonlogical vocabulary of our languages with various sets T of new individual constants tET and denote the resulting languages by secw)(T) and seCT) respectively. It is then clear what the formulas and sentences of secW)(T) and seCT) are. For any set T of new individual constants let Y and yeT) be the set of sentences of se and seCT) respectively, and let oCT) be the set of quantifierfree sentences of seCT). We adopt analogous definitions for the language secw). If L is a set of sentences and cp is a sentence, then cp is a consequence of L if cp holds in all models in which all sentences of L hold, and we write L 1= cpo cp is valid if it is a consequence of the empty set, and we write 1= tp, For both languages 2 Cw) and :e we choose standard systems of deduction, and we write L f cp if cp is derivable from L. cp is a theorem if cp is derivable from the empty set, and we write f tp, (For details concerning the infinitary language we again refer the reader to Karp [1964].) By the wellknown Completeness Theorem of finitary first order logic we have for every L jfEo(Tm)jf: J1m(q»=O} is isomorphic to d, and it is a wellknown fact that the probability m on d may be uniquely recaptured from the probability Il~( on o(Tm)jf. (See, e.g., Halmos [1963] pp. 64ff.) Thus the probability system ~ is, up to the obvious isomorphism, determined by the ordered pair (Tm, 11m), where Il'll is restricted to {}(T~l)jf. In general any ordered pair (T, Il), where T is a set of new individual constants and Il is a probability on {}(T)jf, uniquely determines a probability system ~. Indeed let A=T, let d be the quotient algebra of;;(T)jf modulo the aideal {q>jf:q>E{}(T), 1l(q»=O}, let m be the probability on d induced by u, and let Id(t, t') and R(t, 1') be the image of t= t'jf and R(t, t')jf under the canonical homomorphism of;;(T)jf onto d. Then ~=(A, R, Id, .9/, m) clearly is a probability system; it is easy to check that the valuation homomorphism h is the canonical homomorphism, and Il is the restriction of Ilm to {}(T)jf. Moreover, if ll(t=t')=O for all t, t'ETwhere Ii:.t', then ~ has strict identity. Thus we may also regard a probability system as an ordered pair (T, m), where T is a set of new individual constants, and m is a probability on {}(T)jf. The probability systems with strict identity are then characterized by the condition m(t=t')=O for all t, t'ETwhere ti=t'. This is the form in which Gaifman [1964] introduces the concept of a probability model and, whenever convenient, we will also adopt this terminology. From this new point of view we have the following extension theorem: 4.3. Let (T, m) be a probability system. Then there exists a unique
THEOREM
probability m* on ,5f'( T)jf which extends m and satisfies the Gaifman Condition: whenever 3Vq>EY' (T), then
(G)
m*(3 Vip) =
sup FET(W)
m*( V
where T(w) is the set of all finite subsets of T.
tEF
q>(t));
226
DANA SCOTT AND PETER KRAUSS
Proof: The existence of m* is clear from our considerations above. The uniqueness of the extension will be proved by transfinite induction. During the course of our proof we will make use of analogues of Lemma 7.9 which will be established separately, and of course independently, for the finitary language ::t'lwj (T) in Section 7 of this paper. For every ordinal ~ < WI' we shall define sets of sentences j~(T) S; Y' ~(T) S; YeT) by recursion: First let Jo(T)=J(T). Then if ~>O, let j~(T) be the closure of U .'/'~(T) under denumerable propositional combinations. For t1,,(v;,,)+ V
;,;,,(,,)
q>,,(tj):n<w}.
The following lemma is essentially due to Ehrenfeucht and Mostowski [1961]. LEMMA 6.6. Iflll=O. Consider VE M n N(fl; x o, ... , x n  1 ; e). Let c:5 =e max Iv(x;) fl(x;)l. By the StoneWeieri Dn») .
TI IFiI, where IFil is the number of elements We enumerate the Cartesian product set F= TI F say F= {.!ic:k is the given probability system. THEOREM 7.12 (DIRECTED UNION THEOREM). Let {(T;, m;>:iEI} be a ~ directed family of probability systems; that is,for all i,jEI there exists k e I such that both . Let T= UTi' and define,for iE I
252
DANA SCOTT AND PETER KRAUSS
every CPE,/W)(T), m(cp)=mi(cp), where cpEo(w)(T;). Then O. Finally, for every m, nc co let bmnE!!J. Then there exists a finitely additive probability measure v on !!J such that THEOREM
(i) Iv(x;)  v'(x;)1 < e for all i < n;
(ii) for every m < co, v( 1\ b mn) = lim v( 1\ b m;); n<w
(iii) vex)
= 11 (x)for all XEd.
ne co
i
By definition, vn+l(xnbn)=vn(xna~). In case n=O, let xEd o. Then n a;) E .91 and by the definition of a~,
x
Now suppose for every
XES~n'
vn+l(xn bn)=sup{,u(z):z 00 (Pn(B)) = 1 for all Bin S, but lim n > 00 (Pn ( A)) < 1, then A is not a reasonable con seq uence of S. The converse is also true (i.e., if the fact that lim n > 00 (Pn(B)) = 1 for all B in S guarantees that lim n > 00 (Pn(A)) = 1, then A is a reasonable consequence of S), but this is not so obvious. Finally, I have deliberately used the tautological consequence symbol '}' here as the symbol for strict consequence. The justification has already been noted: namely that strict and tautological consequence are equivalent. This is shown in Theorem 1, below, which establishes some other elementary consequences of the basic definitions.
THEOREM 1. Let.!e be a language, and let S and A be, respectively, a set of formulas and a formula of .!e.
PROBABILITY AND THE LOGIC OF CONDITIONALS
275
1.1. SI A if and only if A is a tautological consequence of S. 1.2. If SII A then SI A. 1.3. II is a deduction relation, when restricted to finite sets: i.e., if Sand S' are finite sets of formulas, then i) if A is in S, then SII A, ii) if S'II B for all B in S, and SII A, then S'II A, iii) if S' and A' result from S and A, respectively, by substituting a truthfunctional formula
0 there exists b >0 such that if PCB»~ Ib for all Bin S, then peA»~ Ie. Since S'If B for all Bin S, there exists bB for all Bin S such that if P(C» 1b B for all C in S', then PCB»~ Ib. Since Sis finite, there exists a minimum b B for all B in S, which is positive; let b o be this minimum. Clearly, then, if P(C» Ib o for all C in S', then PCB»~ Ib for all Bin S, and therefore peA»~ Ie. Hence S'If A, as was to be shown. Part (iii) follows directly from the fact that, for any probability function P' of!£ it is possible to construct another probability function P of se such that P(B)=P'(B') for all formulas Band B ' of se, where B' results from B by replacing all occurrences of a in B by qJ. The construction of P is elementary and will not be described here. Assuming this construction, it follows directly that if not S'If A', then not SIf A. For, if there were some e>O such that for all b>O there existed a probability function P' such that P'(B'» Ib for all B' in S', but P' (A')::;; 1 e, then it would also be the case that PCB»~ Ib for all Bin S but peA)::;; 1e, and hence not SIf A. This concludes the proof. Theorem 1.1 shows that the notion of strict consequence is of no formal interest, since it is equivalent to tautological consequence. The intuitive significance of Theorem 1.1 is that it suggests that we should not 'get in trouble' in analyzing logical relations among conditional statements by treating them as material conditionals, so long as the premises of our arguments can be asserted with logical certainty. That is, where we may expect trouble in applications of standard logic is in situations in which we are reasoning from premises which are not known with certainty. Theorem 1.3 is significant in showing that the reasonable consequence relation has at least some minimal properties of deduction relations, and therefore justifies calling this a 'consequence' relation, at least as applied to finite sets of premises. That the probabilistic consequence relation is not a deduction relation where its domain is extended to include infinite sets of formulas is seen from the fact that it fails to satisfy the compactness condition: i.e. there are infinite sets of formulas, S, and formulas A, such that SII A, but not S/If A for any finite subset, S', of S. An example of a set Sand formula A having this property is as follows. Let S be the set of all formulas B;= 'a, v a. ; 1 >ai + 1 &  a;' for i = 1, 2, ... (where the 'a;' are distinct atomic formulas), and let A = 'a1>F'. Now it is a trivial consequence of the axioms of probability that if P(BJ>f for all i = 1,2, ... , then P(aJ::;;tP(ai+ 1) for all i,
PROBABILITY AND THE LOGIC OF CONDITIONALS
277
from which it follows that P(a 1) must be 0, hence P(ar+ F) = I. Clearly, therefore, SIIA, since an arbitrarily high probability for A can be guaranteed by requiring that all formulas of S have probability of at least t. On the other hand, the same argument shows that for any finite subset S' of S, an assignment P(a1»0, and therefore P(a 1+F) =0 is consistent with assigning arbitrarily high probabilities to all formulas of S', so it is not the case that S'IIA.
In what follows we shall be concerned exclusively with the reasonable consequence relation restricted to finite sets of premises. It will prove that the reasonable consequence relation restricted to finite sets is equivalent to several other conditions with intuitive significance, and in fact it is possible to give a system of rules of inference within a natural deduction system such that a conclusion, A, follows from a finite set, S, of premises if and only if A is derivable from S by those rules. These rules will be given in the following section, in the definition of the relation of 'probabilistic consequence', and it will be shown that derivability in accordance with these rules is a sufficient condition for a conclusion to be a reasonable consequence of premises. The proof that probabilitistic consequence is also a necessary condition for reasonable consequence (the completeness proof) is more difficult, and requires further preliminaries. 3. Probabilistic consequence. We now give a set of rules for deriving 'probabilitistic consequences' from sets of formulas, S, and show that if a formula, A, is a probabilistic consequence of S, then A is a reasonable consequence of S. The rules for deriving probabilistic consequences form the clauses of Definition 6, below. DEFINITION 6. Let S be a set of formulas. Then the set of probabilistic consequences (abbreviated 'p.c.s.') of S is the smallest set S' having S as a subset such that for all truthfunctional formulas ip, lJ', and y: PCl. if
& 'P~y is reasonable in complete generality. Something similar also holds in the case of PCS; the stronger version of that rule (that q>~y can be derived from q> v P~y) is not reasonable in complete generality. In fact, we shall eventually prove the following: if either PCS or PC8 is replaced by their 'stronger' versions, in the definition of probabilistic consequence, then all tautological consequences are derivable by the modified rules. The next theorem asserts what was taken for granted above: namely that all probabilistic consequences are reasonable consequences. All that is required to prove this is to show that any conclusion derivable by a single application of any of PCIPC8 is a reasonable consequence of the formulas from which it is immediately derived, since Theorem 1.3 guarantees that reasonable conseq uences of reasonable conseq uences are themselves reasonable consequences. THEOREM 2. Let S be a set of formulas, and let A be a formula. If A is a probabilistic consequence of S then A is a reasonable consequence of S. Proof As noted above, all that is required is to show that all immediate inferences in accordance with rules PCIPC8 are reasonable consequences of the formulas from which they are derived. This will be shown in only two cases  Rules PCI and PC4, since the proofs of the other rules proceed in entirely similar fashion. PCI is trivial. If q> is tautologically equivalent to 'P, then for any probability function P, P(q» = P(P), and thereforeP(q>+y) = P(lp+y). Hence 'Jf+y is clearly a reasonable consequence of q>+y.
PROBABILITY AND THE LOGIC OF CONDITIONALS
279
The proof of rule PC4, as well as the other 'nontrivial' rules PCSPC8, is most easily obtained using the following simple inequality (which will prove important in other developments): for any probability function P and truthfunctional formulas tp, ':?, y and f.t such that peep) and P(y) are both positive,
P(P v f.t) PcP) P(f.t) _.. <   +P(ep v y)  P (ep) P(y)" This follows as a matter of simple algebra. For, if we set P(':?f.t)=a, P(':?&f.t)=b, P(f.t ':?)= c, P(ep y) =d, P(ep &y)=e, and P(y ep)= f, then P(':?v f.t)=a+b+c, P(ep v y)=d+e+ f, P(':?)=a+b, P(f.t)=b+c, P(ep)= die, and P(y)=e+ f, then what must be shown is that:
a+b+c a+b b+c  ie.
280
ERNEST W. ADAMS
In the case in which one or both P(qJ) or pel{!) is zero, the proof is even simpler, for it follows directly from the axioms in that case that if, say, P( qJ) =0 then P (qJ v l{!+y) = P(l{!+y), and therefore a value at least 1 a for P(({} v l{!+y) is assured by requiring that the probability of P(l{!>y) be at least Ia. This concludes the proof that qJ v l{!+y is a reasonable consequence of qJ+y and l{!+y, and therefore finishes the argument. An interesting sidelight on the proof of Theorem 2 is the following: it is possible to guarantee a probability at least Ia for an immediate inference from a single premise by requiring that the premise have probability at least I  s, and it is possible to guarantee a probability at least 1 s for an immediate inference from two premises by requiring that both premises have probabilities at least I 1e. It will be shown later that this result generalizes to remote inferences as well: that is, if an inference of a conclusion from n premises is reasonable, then it is possible to guarantee a probability at least Ie for the conclusion by requiring the premises to have probabilities at least Ieln. Thus, one may establish conclusively that an inference of a conclusion from n premises is not reasonable, by finding some a such that the probabilities of all the premises is greater than Ieln but the probability of the conclusion is less than I a. We conclude this section by listing a number of probabilistic consequences which follow from rules PCIPC8. These consequences will be used in proving the completeness of the rules. THEOREM 3. Let 'I and qJI, ... , qJn and 'PI' ... , 'Pn be truth functional formulas. 3.1. If Y is a tautological consequence of 'P[ then qJ[+y is a p.c. of qJI + 'P[. 3.2. qJ[ > 'P[ and qJI +qJj & 'P[ are p.c.s. of one another. 3.3. qJt v qJl+  «PI  'Pd is a p.c. of qJI+ 'Pl' 3.4. qJt v ... vqJn+(qJ['PI)& ... &«Pn'Pn) is a p.c. ofqJt+'Pt, ... , qJn + 'Pn ' 3.5. If qJl & 'P1 is a tautological consequence of qJt & 'PI and qJt  'P t is a tautological consequence of qJl 'P1 then qJl+ 1Jf1 is a p.c. of qJt+ 'Pt. Proof The proofs of four parts of this theorem are most conveniently presented in the form of schemata of natural deduction derivations of the conclusions of the inferences from their premises. The proof of 1.1 goes as follows: 1. qJ[+1Jf1 given. 2. 'PI fy ('P j tautologically implies 'I) given.
PROBABILITY AND THE LOGIC OF CONDITIONALS
2S1
3. CPl & 'Pi ~y 2, PC3. 4. CPl ~y 1, 3, PCS. The derivation of CPl ~CPl & 'Pi from CPl ~ 'Pi goes as follows: 1. CPl ~ 'Pi given. 2. CPl tautologically implies CPl' 3. CPl ~CPl 2, PC3. 4. CPl ~CPl & 'Pi 1, 3, PC7. The derivation of CPl ~ 'Pi from CPl ~CP1 & 'P1 is also simple. 1. CP1 ~CPl & v, given. 2. CPl & 'Pi tautologically implies 0/1 &CP1' 3. CPl ~ 'P1 &CPl 1, 2, Theorem 3.1. 4. CP1 ~ 'Pi 1, 3, PC 6. The proof of 3.3 goes: 1. CP1 ~ 'Pi given. 2. CP1&'Plf(CP1'P1) tautology. 3. CP1 & 'P1~  (CP1  'Pi) 2, PC3. 4. CPl ~  (CP1  'P1) 1, 3, PCS. 5. CPzCP lf(CP1'P1) tautology. 6. CPZCP1~(q)1'Pl) 5,PC3. 7. CP1V(CPZCP1)~(CP1'P1) 4, 6, PC4. 8. CP1 v CPz~ (CPl  'Pi) 7, PCl. Theorem 3.4 is obtained by simple iteration of applications of Theorem 3.3, plus use of rule PC7. Thus, Theorem 3.3 entails that CP1 v ... V CPn+ (CPi 'Pi) is a p.c. of lfJi~ 'Pi for each i= 1, ... , n. And applying PC7 nI times yields a derivation of the desired formula as a p.c. of the formulas CPl v ... V CPn~ (CPi 'Pi)' for i= 1, ... , n. Theorem 3.5 requires a somewhat longer derivation. 1. CPl ~ 'Pi given. 2. (CPl &CPZ)V(CP1CPZ)~'Pl 1, PCl. 3. CP1 & 'Pi f cpz & r, given. 4. CP1  cpz f  'Pi 3. 5. CP1 cpz+  'P1 4, PC3. 6. CPl &cpz+'P1 2, 5, PC5. 7. CP1 &cpz & 'Pi f 'Pz 3. S. CPl &cpz & 'P1~ 'Pz 7, PC3. 9. CPl &CPz~ v, 6, 8, PC8. 10. cpz  'Pz f CP1  'P1 given. 11. CPzcplf'PZ 10. 11, PC3. 12. cpz  CPl ~ 'Pz
282
ERNEST W. ADAMS
13. ( v 'P+ 'P)) >0. We will not go through the argument to show that, provided the limits involved exist, then lim n > 00 (Pn (q> v 'P+ lfJ») is positive if and only if the limit of P; ('P)/P; (q» is positive (or possibly equal to plus infinity), and therefore the intuitive interpretation of q> ~ 'P is justified, since we will not use this fact in what follows. The proof, however, is elementary. Likewise, it follows trivially from Definition 8 that if PI' P l , ... is associated with ~, then q> < 'P holds if and only iflim n > 00 (p" (q> v 'P+q») = 0, and (provided the limit exists), this is equivalent to the condition that lim,., 00 (Pn (q»/P; ('P)) = O. What are important for present purposes are the facts asserted in the next theorem. THEOREM 5. Let 2 be a finite language, and let A and S be, respectively, a formula and a finite set of formulas of 2. 5.1. If PI' P l , ... is a uniform sequence of probability functions for 2, then there is a unique Pordering ~ of 2 such that PI> P l , ... is associated with ~. 5.2. If ~ is a Pordering of 2 then there is a uniform sequence of probability functions of 2 associated with ~. 5.3. If ~ is a Pordering of 2, and PI' P l , ... is a uniform sequence associated with ~, then A holds in ~ if and only if limc., 00 (Pn (A») = 1. 5.4. If SIIA then A holds in all Porderings of 2 which all formulas of S hold in. Proof of 5.1. This proof proceeds by showing that the binary relation defined by the condition: for all q> and 'P, q> ~ 'P if and only if limj., oo(Pn (q> v lfJ+ lfJ») > 0 satisfies conditions (i)(iv) of Definition 7.1, and is therefore a Pordering of 2, and is, furthermore one such that PI' P l , ... , is a uniform sequence associated with it, according to Definition 8. The proof of each of conditions (i)(iv) of Definition 7.1 is routine, and we shall actually carry out only the proof of (i)  that ~ is a weak ordering of the truth functional formulas of 2. That either q> ~ 'P or 'P ~ q> must hold follows, since lim (Pn(q> V
n> 00
'P + q> v 'P») = 1
Hence at least one of the two limits on the right above must be positive, and therefore either q> ~ lfJ or 'P ~ q>. The transitivity of the relation ~ follows from the following inequality
286
ERNEST W. ADAMS
of the pure calculus of probability: for any probability function P" whatever, P" (q> v y ...... y) :::": P"(q> v p ...... P)' P" (P v y ...... y) . This inequality follows by simple algebra from the axioms of probability as given in Definition 4.1. Assuming this inequality, transitivity follows immediately, since if both q> ~ P' and 'P ~ y, then both of the limits of P" (q> v 'P...... P) and p,,(P'vy ...... y) must be positive, hence the limit of their product must be positive, and therefore, by the inequality, the limit of P" (q> v y...... y) is also positive, hence q> ~ y holds. This concludes the proof of 5.l. Proof of 5.2. Let ~ be a Pordering for 2, and let SD+ =SDu{F}, where SD is the SDset for 2. Assume that the elements of SD + are ordered as follows: i.e., the elements of SD+ are ordered in increasing 'blocks', CX"i_,+I' ••. , CX"i where the elements within each block are all equivalent to one another. Now, for each i = I. ... , k , let Pi be the disjunction of the elements in the ith block: I.e .•
We first define the values of P,,({3i)' for n=2, 3, ... and for i= 1, ... , k. If k=2, then everything is trivial; we set P" ({31) = 0 and P" ({32)= 1 for all n = 2, 3, .... If k > 2 then the probabilities are defined as follows:
P,,(fJJ=
1
n
fori=I .... ,k1
2",i'
1/1"'21
P,,(fJJ = 1 2",2 /1
/1
1
for n = 2, 3, .... Now. it follows by simple algebra from the foregoing equation that:
I
and
i= 1
P" (fJJ
1
P,,(fJi) = 1,
  tp)) =0. Now, only a finite number of values of Pn(qJ) can be zero, for if an infinite number were zero, then an infinite number of values of P; (qJ>  tp) would be equal to 1, hence the limit could not be zero. We can assume without loss of generality that all values of P;(qJ) are positive, in which case, P;(qJ>  tp) = 1 P;(qJ> tp), from which it follows that the limit approached by P;(qJ> tp)
288
ERNEST W. ADAMS
must be 1, since the limit of Pn ( qJ~  P) is zero. The argument is reversible, so if the limit of P,,(qJ~P) is 1, then qJ~P holds in :S;. In case qJ:s;F, the limit approached by Pn(qJvF~F) must be 1. Since Pn(qJv F"F) can only be 0 or 1, and 1 only if Pn(qJ)=O, it can only be that Pn(qJ) is 0 for all but a finite number of values of Pn(qJ). But, in this case, P; (qJ~ P) can differ from 1 for only a finite number of values of n, and therefore limn~(X) (Pn (lp" P)) = 1. Conversely, if limn~(X) (Pn (lp~ P)) = 1 and qJ:S; F, clearly lp" P holds in :S::. This concludes the proof of 5.3. Proof of 5.4. This follows trivially from 5.2 and 5.3. Suppose that there is a Pordering :s; such that all members of S hold in :S;, but A does not hold in it. Then by 5.2 there is a uniform sequence PI' Pz , ... associated with :s;, and by 5.3, Iimn~(X)(Pn(B))=1 for all Bin S, but limll~(X)(Pn(A)) ... , Sp is the reduction sequence of S. 9.4. If it is not the case that SD (S) = SD, and if the ordered partition of SD generated by Sis SD t , ... , SD p+ b then the standard Piordering of!l' associated with S is the Pordering :s; such that for all a and fJ in SD, if a is in SD i and fJ is in SD j then a:S; F if and only if i = P + 1, and a:S; fJ if and only if j:S;i, Definition 9.4 presupposes what has not been shown: namely, that the ordered partition of SD generated by S is a partition of SD. This and other important properties of these partitions and their associated Porderings will be derived below. First, however, it may help to make the intuitive bases of the concepts introduced in Definition I~ 9 clearer to illustrate them as they apply in a simple example. Consider the a language generated from just the two ! atomic sentences 'p' and 'q' (plus 'T' and 'F'). The SDset of !l' may be bed taken to consist of the four formulas a='p& q', b='p& q', c='p&q', and d='p&q', whose relations may be most easily visualized with the aid of the Venn Diagram to the right. Now, let S be the set containing just the four formulas p ..... q,  q'>  p, q'>  p, and p v q'>p. To determine the immediate reduction of S, we must determine which formulas A in S have the property that SD(A)~SD(S). These two concepts are characterized in Definitions 3.2 and 3.3. SD (A) is the set of SDs belonging to Ant(A) (the antecedent of A), SD(A) is the set of SDs belonging to Ant(A)& Cons(A), and SD(S) is the union of all SD(A) for A in S. The formulas A of S, together with the sets SD(A) and SD(A) are conveniently represented in a table, as below: formula A l.p+q 2.  q +  P 3. q +  p 4. p V q + p
The formulas A of S such that
SD(A) {b, e} {a, b} {e, d} {b, e, d}
SD(A)
{b} {b}
{e} {el}
SD(S) SD(A)~SD(S)are
=
{b, c, d}.
clearly the first, third and
290
ERNEST W. ADAMS
fourth in the above table: i.e., the formulas p*q, q*p, and pvq*p. These, then, comprise the set Red (S). To construct the reduction sequence of S, we simply iterate the process described above. Thus, we set St =S, and set Sz= Red(St)={p*q, q* p, p v q*p}. To construct S3, we should find Red(Sz), which again is easily determined from the above table by finding all A in Sz such that SD(A) ... , SD p+ t that for i=l, ... ,p, SD(S;) = SD i+ t u ... u SD p+ t, where St, ... , Sp is the reduction sequence of S. Since AES=St, A is in at least one S, in the reduction sequence, and if A is in S, then SD(A)~SD(Si) =SDi+1U ... uSD p+ 1' Suppose first that A is in Sp. Then SD(A)~SDp+l' and the latter set is the set of all SDs which are set equivalent to F in the ordering :C:;. But, SD(A) is the set of all SDs belonging to cp, and therefore all SDs belonging to cp are equivalent to F, and so cp:C:; F. This in turn entails that cp+ lJ' holds in :C:;. Suppose that A is not in Sp' Then there is some i such that A is in S, but not in Si+l' As before, SD(A) is a subset of SD(Si)=SDi+tu ... uSD p+t' On the other hand, since A is not in Si+ t = Red (S;), it is not the case that SD(A)~SD(Si)' so SD(A) must contain an SD in the union SD t u ... uSD i . Now, the members ofSD(A) are all SDs belonging to cplJI, and
292
ERNEST W. ADAMS
therefore all SDs belonging to qJ 'Pare in SD i+ 1 u ... uSD p+ j  The members of SD(A) are all SDs belonging to tp, so there must be some SD in SD 1 u ... u S D, belonging to qJ, and therefore to qJ & 'P, since that element cannot belong to qJ  'P. It follows immediately from this that qJ  'P < qJ & 'P, since the maximum elements in the ordering belonging to qJ  'P are in SD i + 1 u ... u SD p + I' and these are strictly less than the maximum elements in qJ & 'P, which are in the union SD 1 u ... uSD i . But, qJ  'P < tp & 'P entails that qJ+ 'P holds in s. Therefore, the assumption that A is in S entails that A holds in s, so Theorem 6.2 is proved. 6. Completeness and a decision procedure. Let A be a formula and S be a finite set of formulas. We are now ready to establish the equivalence of the following three conditions: (I) A is a probabilistic consequence of S (in the sense of Definition 6), (2) A is a reasonable consequence of S, and (3) A holds in all Porderings in which all members of S hold. Actually, we have already shown that condition (1) entails condition (2) (Theorem 2), and that condition (2) entails condition (3) (Theorem 5.4), so what we have to do now is 'close the ring' by showing that condition (3) entails condition (1). It turns out to be easier to do this if three more links are added to the chain: conditions (4), (5) and (6) such that condition (3) entails (4), (4) entails (5), (5) entails (6) and (6) entails (I). Adding these links actually simplifies the proof and, moreover, yields an immediate decision procedure for determining whether a conclusion is a reasonable consequence of premises. THEOREM 7. Let 2 be a finite language, let SD be its SDset, let A be a formula and S be a finite set of formulas of 2; let Sf be the set Su{ ~ A}, let f S;, ... , S; be the reduction sequence for Sf, let SD 1 , ••. , SD;+ 1 be the ordered partition of SD generated by Sf, and let So be the set S;~{ ~A} (i.e., So is the set resulting from the deletion of ~ A from S;). Then the following conditions are equivalent: (I) A is a probabilistic consequence of S, (2) A is a reasonable consequence of S, (3) A holds in all Porderings of 2 in which all members of S hold,
(4) SD(A)~SD(S;)
(5) SD(A)~SD(So) and SD(So)~SD(So)~SD(A)~SD(A), (6) for some subset S" of S, SD(A)~SD(S") and SD(S")~SD(S") ~SD(A)~SD(A).
Proof. That condition (1) entails condition (2) and condition (2) entails
condition (3) were proven in Theorems 2 and 5.4, and that (5) entails (6) is
PROBABILITY AND THE LOGIC OF CONDITIONALS
293
trivial. What will now be shown is that (3) entails (4), (4) entails (5) and (6) entails (1). Assume first that condition (3) is satisfied: i.e., A holds in all Porderings of !l? in which all members of S hold. We can assume here and later that A is a conditional formula q>+ 'P, since if A is some unconditional formula 'P, then A can be replaced by T + 'P throughout without altering any of the conditions under consideration. Consider now the augmented set S' =Su{ ~A}, where ~A was defined (Definition 2.1) to be the formula q>+  'P. Either SD(S')=SD, in which case the standard ordering associated with S' is undefined, or SD(Sl;6 SD, in which case the standard ordering associated with S' is defined. Consider the case in which SD(S')= SD first. In this case condition (4) holds trivially, since SD(A) is clearly a subset of the set of all SDs of 2:'. Now, suppose that SD(Sl;6 SD, and therefore that the standard ordering, ::s;, associated with S' is determined. Since ~ A is in S', and all formulas of S' hold in S (by Theorem 6.2), ~ A holds in ::S;. Likewise, since S is a subset of S', all formulas of S hold in s, and therefore A must hold in ::s;, by the assumption that condition (3) holds. Now, it follows directly from the definition of 'holding' that two 'contrary' formulas A = q>+ 'P and ~ A = q>+  'P can both hold at once in a Pordering S only if q>  P < q> & 'P or q> S F, and ~ A holds if and only if either q> & 'P < q>  'P or q>::S; F, so both can hold at once only if q>::S; F. But condition (4) follows directly from the fact that q>sF, since q>sF can hold in the standard ordering S if and only if all SDs belonging to q> are in the last member of the ordered partition SD~, ... , SD~+l' And, SD~+l is by definition equal to SD(S;) (Definition 9.3), and SD(A) is the same as the set of all SDs belonging to tp, hence SD(A) ~ SD( S;). Therefore we have shown that condition (3) entails condition (4). Suppose now that condition (4) holds. We consider separately the two cases in which ~ A = q>+  'P is not a member of S;, and in which ~ A is a member of S;. In the first case, the derivation of condition (5) from condition (4) is trivial. For, if ~A is not a member of S; then So=S;. Assuming that condition (4) is satisfied, SD(A)~SD(S;), and therefore SD(A)~SD(S;), since SD(A) is a subset of SD(A). By the definition of the reduction sequence (Definition 9.2), however, SD(S;)=SD(S;), so SD(A)~SD(S;)=SD(So)' And, that SD(So)~SD(So)is a subset of SD(A)~SD(A) follows because So=S;, and SD(S;)=SD(S;), and therefore SD(So)~SD(So)is empty. Now assume that condition (4) holds, and that ~ A = q>+  P is a member of To prove that SD(A)~SD(So),suppose that tx is a member of SD(A):
S;.
294
ERNEST W. ADAMS
i.e., a is an SD belonging to CfJ  'P (see Definition 3.2). SD (A) is a subset of SD(A), hence, by condition (4), a, which is a member of SD(A), is a member of SD(S~)=SD(S~). Also, since S~=Sou {CfJ~  'P}, SD(S~)=SD(So)u SD (CfJ ~  'P): hence a is a member of SD (So)uSD (CfJ ~  'P). But the elements of SD (CfJ~  'P) are just the SDs belonging to CfJ & 'P, and so a does not belong to SD (CfJ~  'P). Therefore a is a member ofSD (So), and we have shown that SD(A) is a subset of SD(So)' To show that SD(So)~SD(So) is a subset of SD(A)~SD(A), suppose that a is a member of SD(So)~SD(So)'Since So is a subset of S~, a must be an element of SD(S~)=SD(S~)=SD(So)uSD(CfJ~ 'P). By hypothesis, though, a, is not in SD(So), so it must be a member of SD(CfJ~  'P)= SD( ~ A). But, the members ofSD (CfJ~  'P) are all of the SDs belonging to CfJ & 'P, and these are the SDs composing the set SD (A)~SD(A). Hence, a is a member of SD(A)~SD(A), and we have shown that condition (4) entails condition (5). Now suppose that condition (6) holds: i.e., SD(A)s;SD(S") and SD(S") ~SD(S")s;SD(A)~SD(A)for some subset S" of S. We will show that A is a probabilistic consequence of S", and hence of S, since S" is a subset of S. Let us assume that S" is the set of conditional formulas CfJt ~ 'P l, ... , CfJn~ 'Pn  if any unconditional formula 'Pi occurs in S, then we can replace it by T  'Ph since the latter is a probabilistic consequence of 'P, (Rule PC2 of Definition 6). Now construct the formula:
According to Theorem 3.4, B is a probabilistic consequence of S". Moreover, it is the case that SD(B)=SD(S") and SD(B)=SD(S"), and therefore SD(B)~SD(B)=SD(S")~SD(S"). That SD(B)=SD(S") follows from the fact that the antecedent of B is the disjunction of the antecedents of the , CfJn~ 'P/I in S". Therefore, SD(B) is the set of all SDs formulas (Pl+ Ill l , belonging to CfJt v V CfJn' which is the union of the SDs belonging to CfJl' CfJz, etc., which is equal to SD(S"). That SD(B)=SD(S") follows from the fact that Ant (B) &  Cons (B) is tautologically equivalent to (CfJl  'P l) V ... V (CfJ/I 'Pn)· SD(B) is the set of all SDs belonging to the formula Ant(B)&  Cons (B), and is therefore the same as the set of SDs belonging to the union of the sets of SDs belonging to CfJl  'Pi> ... , CfJn  'P". But, SD (CfJi~ 'P) is the set of all SDs belonging to ip,  'Pi' so SD (S") is equal to the union of all these sets, and is therefore equal to SD (B). Now, given any two formulas A and B, it can only be the case that
PROBABILITY AND THE LOGIC OF CONDITIONALS
295
SD(A)s;SD(B) if Ant(B)Cons(B) is a tautological consequence of Ant(A)Cons(A). For SD(A) and SD(B) are, respectively, the SDs belonging to Ant(A)Cons(A) and Ant(B)Cons(B), and the set of SDs belonging to one formula are a subset of the set of SDs belonging to a second if and only if the second is a tautological consequence of the first. Therefore, Ant(B)Cons(B) is a tautological consequence of Ant(A)Cons(A), since SD(A)s;SD(S")=SD(B). The same kind of argument can be used to show that if SD (B),....,SD (B)S; SD (A),...., SD (A), then theformulaAnt(A)&Cons(A) is a tautological consequence of Ant(B)&Cons(B). Therefore Ant(A)& Cons(A) is a tautological consequence of Ant(B)&Cons(B), since SD(B) ,....,SD (B) = SD (S,,),...., SD (S"):s; SD (A),....,SD(A). We have now shown both that Ant(B)Cons(B) is a tautological consequence of Ant(A)Cons(A), and that Ant (A) &Cons (A) is a tautological consequence of Ant (B) & Cons (B). It follows from Theorem 3.5 in this case that A is a probabilistic consequence of B, and hence A is a probabilistic consequence of S. This completes the proof. Theorem 7 presents the key mathematical results of this paper. In the following section we will derive some relatively easy correlates of Theorem 7, some of which have, perhaps, a more immediate intuitive significance than does the main theorem. In concluding this section, note that Theorem 7 provides the basis for a fairly direct decision procedure for determining whether a formula A is a reasonable consequence of a finite set, S, offormulas. Probably the simplest procedure for determining whether A is a reasonable consequence of S is to construct the reduction sequence of the augmented set S'=Su{,....,A}, and determine whether SD(A) is a subset of SD(S~), where S; is the final term in the reduction sequence of S'. According to condition (4) of the theorem, A is a reasonable consequence of S if and only if SD(A) is a subset of SD(S;). The following illustration will help to make clear how this decision procedure works. Suppose that the problem is to determine whether the tautologically valid inference of  p+q from p v q is reasonable. In this case Sis the singleton set {p v q}, and A =  p+q. As always, we begin by replacing all nonconditional formulas by their corresponding conditionals in the standard way. In this case, p v q is replaced by T +p v q, the justification being that the two formulas are each reasonable consequences of the other. The next step is to form the augmented set S' = Su {,...., A}, where > A in this case is p+ q. The next step is to form the reduction sequence, S;, ..., S;, of S'. One simple way is to use the tabular representation of formulas B, and the sets SD(B) and SD(B). These are exhibited here, letting a, b, c and d be the
296
ERNEST W. ADAMS
formulas' p & q', 'p & q', 'p &q' and 'p&q', respectively (see diagram on p. 289). Then we have: formula B T+pVq p,.q
SD (B)
SD(B)
(a, b, c, d} {a, d} SD(SD
{a} {d}
= SD(S') = {a, d}.
The second term of the reduction sequence, S~, is the set of all B in S; = S' such that SD(B)s;SD(S;): so in this case, S~ is the singleton set {p~ q}. S~ is determined in similar fashion, by finding all B in S; such that SD(B)s;SD(SD= {d}. In this case there are no members of B of S~ with this property, so S3 = A. And, since SD(A) =SD(A), S~ terminates the sequence: i.e., the reduction sequence of S' is the sequence S;, S;, S~ where: S'! S~
= {T ~ p v q,  p ~  q} = { p ~  q}
S~ = II
SD(S;) = {a,d} SD(S~)
= {d}
SD(S~)
= SD(S;) = A.
Now, SD(A)=SD( p~q)={a, d}, and this is not a subset of SD(S;)=A, and so it is not the case that A is a reasonable consequence of S. The procedure outlined above also leads immediately to the construction of a Pordering in which all members of S hold, but A does not hold, if it proves to be the case that A is not a reasonable consequence of S. In particular, if s; is the standard Pordering associated with the augmented set S' = Su {A}, but SD(A) is nota subset of SD(S;), then all formulas of Shold in ~, but A does not. Thus, in the foregoing example, the ordered partition of SD generated by S' is the sequence SD 1 , ... , SD 4 , where SD t =SD~SD(S;)= {b, c}, SD z =SD(S;)~SD(S~)= {a}, SD 3 =SD(Sn~SD(S~)= {d}, and SD 4 = SD (S~) = A. The standard Pordering is thus determined from the following ordering of the set SDu{F}:F(ept 'P1) &  (ep2  lJI 2) , and if P(C» 110 and P(D» 18, then P(C &D» 1108. Assume first that neither P(ept) nor P(ep2) is zero. Then:
P(ep1  1[11) P( ~ C) = P(ept +  'P1) = P(ept) = 1  P(C) < 10, and
P( ~ D) = P(ep2
+
'P2) =
P(ep2  'P2) = 1 P(D) < 8. P(ep2)
By the inequality of general probability theory cited in the proof of Theorem 2,
P((ep1  'P 1) V (ep2  'P2))    s P(ep1
Ve(2)
P(ep1  P 1) P(ep2  P 2) _ + < 10 P(epd P(ep2)
+ 8.
But, by the elementary calculus of probability,
P((ep1  P t ) v (ep2  lJI 2)) P(C&D)=I >1108. P(ept Ve(2) In case either ept or ep2 is zero, the proof is even simpler. For example, suppose that P(ept)=O. Then by elementary probability,
P(C&D) = P(ep1 v ep2 + (ept  P t) &  (ep2  lJI 2)) = = P(ep2 + 1[12) = P(D), hence, clearly P(C &D» 1108. Thus, we have shown that the 'probability defect' of the conjunction C &D is not greater than the sum of the 'defects' of the conjuncts. Iterating this, it follows directly that if
P(epi+ PJ > 1 10 for i = 1, ... .n, then P(C(S')) > 1 nco We have supposed that C(S') was so constructed that Ant(A)Cons(A) tautologically implies Ant( C(S')) Cons (C(S')) and Ant (C(S')) &Cons (C(S')) tautologically implies Ant(A)&Cons(A). To shorten writing, let us set A = ep+ P and C(S')=y+ fl. Then ep  P tautologically implies Y fl and
310
ERNEST W. ADAMS
I' & Ji tautologically implies