Probabilistic Logic in a Coherent Set,tin·g by
Giulianella Coletti and Romano Scozzafava
PROBABILISTIC LOGIC IN A COH...
42 downloads
863 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Probabilistic Logic in a Coherent Set,tin·g by
Giulianella Coletti and Romano Scozzafava
PROBABILISTIC LOGIC IN A COHERENT SETTING
TRENDS IN LOGIC Studia Logica Library VOLUME 15 Managing Editor Ryszard W6jcicki, Institute of Philosophy and Sociology, Polish Academy of Sciences, Warsaw, Poland Editors Daniele Mundici, Department of Computer Sciences, University of Milan, Italy Ewa Orlowska, National Institute of Telecommunications, Warsaw, Poland Graham Priest, Department of Philosophy, University of Queensland, Brisbane, Australia Krister Segerberg, Department of Philosophy, Uppsala University, Sweden Alasdair Urquhart, Department of Philosophy, University of Toronto, Canada Heinrich Wansing, Institute of Philosophy, Dresden University of Technology, Germany
SCOPE OF THE SERIES Trends in Logic is a bookseries covering essentially the same area as the journal Studia Logica - that is, contemporary formal logic and its applications and
relations to other disciplines. These include artificial intelligence, informatics, cognitive science, philosophy of science, and the philosophy of language. However, this list is not exhaustive, moreover, the range of applications, comparisons and sources of inspiration is open and evolves over time.
Volume Editor Heinrich Wansing
The titles published in this series are listed at the end of this volume.
GIULIANELLA COLETTI University of Perugia, Italy
ROMANO SCOZZAFAVA University of Roma "La Sapienza", Italy
PROBABILISTIC LOGIC IN A COHERENT SETTING
~
''
KLUWER ACADEMIC PUBLISHERS DORDRECHT/BOSTON/LONDON
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 1-4020-0917-8
Published by Kluwer Academic Publishers, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. Sold and distributed in North, Central and South America by Kluwer Academic Publishers, 101 Philip Drive, Norwell, MA 02061, U.S.A. In all other countries, sold and distributed by Kluwer Academic Publishers, P.O. Box 322, 3300 AH Dordrecht, The Netherlands.
Printed on acid-free paper
All Rights Reserved © 2002 Kluwer Academic Publishers No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands.
Preface The theory of probability is usually based on very peculiar and restrictive assumptions: for example, it is maintained that the assessment of probabilities requires an overall design on the whole set of all possible envisaged situations. A "natural" consequence is that the use of probability in the management of uncertainty is often challenged, due to its (putative) lack of "flexibility". Actually, many traditional aspects of probability theory are not so essential as they are usually considered; for example, the requirement that the set of all possible "outcomes" should be endowed with a beforehand given algebraic structure (such as a Boolean algebra), or the aim at getting, for these outcomes, uniqueness of their probability values, with the ensuing introduction of suitable relevant assumptions (such as a-additivity, conditional independence, maximum entropy, ... ) , or interpretations (such as a strict frequentist one, which unnecessarily restricts the domain of applicability). The approach adopted in this book is based on the concept of coherence, that can be framed in the most general view of conditional probability (as proposed by Bruno de Finetti), and it is apt to avoid the usual criticisms, making also a clear-cut distinction between the meaning of probability and the various multifacet methods for its assessment. In other words, referring to de Finetti's approach is not a "semantic" attitude in favour of the subjectivist position, rather it is mainly a way of exploiting the "syntactic" advantages of this view (which differs radically from the usual one, based on a measure-theoretic framework). For example, in a coherent setting a natural handling of partial probability assessments is possible, and the process of updating is ruled by coherence through an algorithm involving linear systems and linear programming, that does not necessarily lead to unique values of the relevant assessments. Contrary to what could appear at first glance, dealing with eo-
2
herence gives rise to a number of delicate and subtle problems, and has little to do with a conventional Bayesian approach. To say the less, in the latter the main emphasis is on the so-called priors and posteriors, which after all are just two particular probability assessments referring to two different "states of information". In our general coherent setting, we refer to an arbitrary family of conditional events and to the corresponding conditional probability assessments, including all their possible revisions. In this way we are able to show how the theory of coherent conditional probability can act as a unifying tool: through a direct assignment of conditional probabilities, we get a general theory of probabilistic reasoning able to encompass also other approaches to uncertain reasoning, such as fuzziness, possibility functions and default reasoning. Furthermore, we put forward a meaningful concept of conditional independence, which avoids many of the usual inconsistencies related to logical dependence. In the last Chapter we give a short account on how to extend our methodology and rules to more general (decomposable) uncertainty measures. Let us emphasize that we will not attempt here to enter into any controversy concerning as to whether probability may or may not be the only appropriate tool for reasoning under uncertainty, even if we underline the unifying role of coherent conditional probability. The book is kept self-contained, provided the reader is familiar with the elementary aspects of propositional calculus, linear algebra and analysis. Much of the material presented appears already, possibly in different form, in many published papers, so that the main contribution of the book is the assembling of it for a presentation within a unified framework. Finally, we want to express our thanks to an anonymous referee for many valuable comments, and to Barbara Vantaggi for a careful reading of the manuscript ensuing useful suggestions.
Contents 1 Introduction 1.1 Aims and motivation . . . . . 1.2 A brief historical perspective .
7 12
2 Events as Propositions 2.1 Basic concepts . . . . 2.2 From "belief' to logic? 2.3 Operations . . . . . . . 2.4 Atoms (or "possible worlds") . 2.5 Toward probability . . . . .
17 17 18 20 21 24
3 Finitely Additive Probability 3.1 Axioms . . . . . . . . . . . . 3.2 Sets (of events) without structure 3.3 Null probabilities . . . . . . . . .
25 25 26 27
4 Coherent probability 4.1 Coherence . . . . . 4.2 Null probabilities (again) .
31 31 34
5 Betting Interpretation of Coherence
37
6 Coherent Extensions of Probability Assessments 6.1 de Finetti's fundamental theorem 6.2 Probabilistic logic and inference
43 43 45
3
7
4
CONTENTS
7 Random Quantities
49
8 Probability Meaning and Assessment: a Reconciliation 8.1 The "subjective" view 8.2 Methods of evaluation
53 53 55
9 To Be or not To Be Compositional?
57
10 Conditional Events 10.1 Truth values .. . 10.2 Operations . . . . 10.3 Toward conditional probability.
61 63 65 70
11 Coherent Conditional Probability 11.1 Axioms . . . . . . . . . . . . . . . 11.2 Assumed or acquired conditioning? 11.3 Coherence . . . . . . . . . . . 11.4 Characterization of a coherent conditional probability . . . . 11.5 Related results . . . . . . . . 11.6 The role of probabilities 0 and 1
73 73
12 Zero-Layers 12.1 Zero-layers induced by a coherent conditional probability . . 12.2 Spohn's ranking function . 12.3 Discussion . . . . . . . . .
99 99 101 102
13 Coherent Extensions of Conditional Probability
109
14 Exploiting Zero Probabilities 14.1 The algorithm . . . . . . 14.2 Locally strong coherence . .
117 117 122
74 76 80 90 94
CONTENTS
5
15 Lower and Upper Conditional Probabilities 15.1 Coherence intervals . . . . . . 15.2 Lower conditional probability 15.3 Dempster's theory .
127 127 128 134
16 Inference 16.1 The general problem 16.2 The procedure at work 16.3 Discussion . . . . . . . 16.4 Updating probabilities 0 and 1 .
137 137 139 151 155
17 Stochastic Independence in a Coherent Setting 17.1 "Precise" probabilities . 17.2 "Imprecise" probabilities 17.3 Discussion . . . . . . 17.4 Concluding remarks . . .
163 164 179 186 190
18 A Random Walk in the Midst of Paradigmatic Examples 18.1 Finite additivity . . . . . . . . . . . 18.2 Stochastic independence . . . . . . 18.3 A not coherent "Radon-Nikodym" conditional probability . . 18.4 A changing "world" . . . . . . 18.5 Frequency vs. probability . . 18.6 Acquired or assumed (again) . 18.7 Choosing the conditioning event 18.8 Simpson's paradox 18.9 Belief functions . . . . . . . . .
191 191 193 194 197 198 202 202 204 206
19 Fuzzy Sets and Possibility as Coherent Conditional Probabilities 215 19.1 Fuzzy sets: main definitions 216 19.2 Fuzziness and uncertainty . 219
CONTENTS
6 19.3 Fuzzy subsets and coherent conditional probability . . . 19.4 Possibility functions and coherent conditional probability 19.5 Concluding remarks . . . . . . . .
20 Coherent Conditional Probability and Default Reasoning 20.1 Default logic through conditional probability equal to 1 . 20.2 Inferential rules 20.3 Discussion . . . . . . .
225 232 240
241 243 247 251
21 A Short Account of Decomposable Measures of Uncertainty 21.1 Operations with conditional events 21.2 Decomposable measures . . . . 21.3 Weakly decomposable measures 21.4 Concluding remarks . . . . . .
257 258 262 266 270
Bibliography
271
Index
285
.....
Chapter 1 Introduction 1.1
Aims and motivation
The role of probability theory is neither that of creating opinions nor that of formalizing any relevant information in the framework of classical logic; rather its role (seemingly less ambitious) is to manage "coherently" opinions using all information that has been anyhow acquired or assumed. The running of this process requires, first of all, to overcome the barriers created by prevailing approaches, based on trivially schematic situations, such as those relying just on combinatorial assessments or on frequencies observed in the past. The starting point is a synthesis of the available information (and possibly also of the modalities of its acquisition), expressing it by one or more events: to this purpose, the concept of event must be given its more general meaning, not just looked on as a possible outcome (a subset of the so-called "sample space"), but expressed by a proposition. Moreover, events play a two-fold role, since we must consider both those events which are the direct object of study and those which represent the relevant "state of information": so conditional events and conditional probability are the tools that allow to manage specific (conditional) statements and to update 7 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
8
CHAPTER 1
degrees of belief on the basis of the evidence. We refer to the state of information (at a given moment) of a real (or fictitious) person, that will be denoted (following de Finetti [53]) by "You" . A typical situation is the following: You are not able to give categorical answers about all the events constituting the relevant environment, and You must therefore act under uncertainty. In fact You have- about the problem- some knowledge that should help in assessing degrees of belief in relevant events, singledout by suitable sentences. Even if beliefs may come from various sources, they can be treated as being of the same quality and nature, since the relevant events (including possibly statistical data) can always be considered as being assumed (and not asserted) propositions. We maintain that these beliefs can be measured (also in the management of partial and revisable information in automated reasoning) by probability (conditional or not). In the aforementioned typical situation, propositions may include equalities or inequalities involving values taken on by random variables: often the latter are discrete, so that each one has a finite (sometimes, countable) range of possible values. The usual probabilistic models refer to a set of random variables, and the relevant joint probability distribution should completely specify the probability values that You assign to all involved propositions. Even if the joint distribution can in principle answer any question about the whole range, its management becomes intractable as the number of variables grows: therefore conditional independence is often assumed to make probabilistic systems simpler. So a belief network (represented by a suitable graph, a DAG- directed acyclic graph - having no directed cycles) can be used to represent dependencies among variables and to give a concise specification of the joint probability distribution: the set of random variables makes up the nodes of the graph, while some pairs of nodes are connected
INTRODUCTION
9
by arrows, whose intuitive meaning is that the parent of a node X (i.e., any node having an arrow pointing to it) has some direct "influence" on X itself; moreover, this influence is quantified by a conditional probability table for each relevant node. Essentially, given all possible envisaged situations - which are usually expressed by uncertain conditional statements (which model information in a weaker way than that given in the form of "if-then" rules) - the problem consists in suitably choosing only some of them, concerning "locally" a few variables (where "locally" means that they are regarded as not being "influenced" by too many other variables). In this book we discuss (in the framework of conditional events and conditional probability, and giving up any "ad hoc" assumption) how to deal with the following problem, which clearly encompasses that sketched above. Given an arbitmry family of conditional events (possibly just "a few", at least at the initial stage) and a suitable assessment of a real function P defined on e, You must tackle, first of all, the following question: is this assessment coherent? This essentially means that it can be framed in the most general view of conditional probability as proposed by Bruno de Finetti, which differs radically from the usual one (based on a measure-theoretic approach). For example, its direct assessment allows to deal with conditioning events whose probability can be set equal to zero, a situation which in many respects represents a very crucial feature (even in the case of a finite family of events). In fact (as we will show), if any positivity condition is dropped, the class of admissible conditional probability assessments is larger, that of possible extensions is never empty, the ensuing algorithms are more flexible, the management of stochastic independence (conditional or not) avoids many of the usual inconsistencies related to logical dependence. The concept of coherence privileges probability as a linear oper-
e
10
CHAPTER 1
ator rather than as a measure, and regards the minimal Boolean algebra (or product of Boolean algebras) spanned by the given events only as a provisional tool to handle coherence, so that this tool can possibly change when new events and new information come to the fore. So, taking de Finetti's approach as starting point is not just a "semantic" attitude in favour of the subjectivist position, rather it is mainly a way of exploiting the "syntactic" advantages of this view by resorting to an operational procedure which allows to consider, for example, partial probability assessments. Moreover, it is possible to suitably "propagate" the above probability assessments to further conditional events preserving coherence (in the relevant literature this result is known, for unconditional events, as de Finetti's fundamental theorem of probabilities). This process of updating is ruled by coherence through an algorithm involving linear systems and linear programming, and does not necessarily lead to unique values. These aspects, both from the syntactic and the semantic point of view, are discussed at length in the expository papers [23] and [115]. Many real examples are given throughout the text: in particular, some referring to medical diagnosis are discussed as Example 4 in Chapter 2, Example 8 in Chapter 4, and Examples 23, 24, 25 in Chapter 16. The concept of conditional event plays a central role for the probabilistic logic as dealt with in this book: we give up (or better, in a sense, we generalize) the idea of de Finetti of looking at a conditional event EIH as a three-valued logical entity (true when both E and H are true, false when H is true and E is false, "undetermined" when H is false) by letting the third value suitably depend on the given ordered pair (E, H) and not being just an undetermined common value for all pairs. We introduce suitable (partial) operations of sum and product between conditional events (looked on as random quantities), and this procedure gives
INTRODUCTION
11
rise to the rules of a coherent conditional probability. Contrary to a conventional Bayesian approach, we will not refer to the rigid schematic view privileging just priors and posteriors (not to mention, by the way, that also the role of the so-called likelihood is crucial in the global check of coherence), and we will also get rid from the simplifying assumption of mutually exclusive and exhaustive events. Making inference requires a space "larger" than the initial one (i.e., to consider "new" conditional events), and in our general context it is possible to take into account as relevant information also a new probability assessment (once the "global" - with respect to the previous assessments - coherence has been checked) without resorting to the so-called "second order probabilities" (referring in fact to a new probability assessment as a conditioning event is an awkward procedure, since an event is a logical entity that can be either true or false, and a probability assessment has undoubtedly a quite different "status"). Notice that the very concept of conditional probability is deeper than the usual restrictive view emphasizing P(EIH) only as a probability for each given H (looked on as a given fact). Regarding instead also the conditioning event H as a "variable" , we get something which is not just a probability (notice that H alsolike E - plays the role of an uncertain event whose truth value is not necessarily given and known). So it is possible to represent (through conditional events) and manage (through coherent conditional probability) "vague" statements as those of fuzzy theory and to look on possibility functions as particular conditional probabilities; moreover, a suitable interpretation of the extreme values 0 and 1 of P(EIH) for situations which are different, respectively, from the trivial ones E 1\ H = 0 and H ~ E, leads to a "natural" treatment of the default logic. Finally, in Chapter 21 we extend methodology and rules on which our approach is based to more general uncertainty measures,
12
CHAPTER 1
starting again from our concept of conditional event, but introducing (in place of the ordinary sum and product) two operations EB and 0 for which some of the fundamental properties of sum and product (commutativity, associativity, monotonicity, distributivity of EB over 0 ) are required.
1.2
A brief historical perspective
Bruno de Finetti (1906-1985) lived in the twentieth century, writing extensively and almost regularly from 1926 (at the age of 20) through 1982 (and not only in probability, but also in genetics, economics, demography, educational psychology, and mathematical analysis). He has put forward a view not identifying probability as a measure on a u-algebra of sets, but rather looking at it (and at its generalization, i.e. the concept of prevision) as a linear operator defined on a family of random quantities (e.g., events, looked on as propositions). He was also challenging (since the mid-1920s) the unnecessary limitations imposed on probability theory by the assumption of countable additivity (or u-additivity): his ideas came to the international attention in a series of articles in which he argued with Maurice Frechet regarding also the status of events assessed with probability zero. Then Frechet invited de Finetti for a series of lectures at the Institute Henri Poincare in Paris in 1935, whose content was later published in the famous paper "La prevision: ses lois logiques, ses sources subjectives" [51], where, through the concept of exchangeability, he assessed also the important connection between the subjective view of probability and its possible evaluation by means of a past frequency. In the article [52] published in 1949 (and appearing in English only in 1972), de Finetti critically analyzed the formalistic axioma-
INTRODUCTION
13
tization of Kolmogorov: he was the first who introduced the axioms for a direct definition of conditional probability (for the connections with Popper measure, see Section 10.3), linking it to the concept of coherence, that allows to manage also "partial" assessments. All his work exhibits an intuitionist and constructivist view, with a natural bent for submitting the mathematical formulation of probability theory only to the needs required by any practical application. In the preface to his book [53], de Finetti emphasizes how probabilistic reasoning merely stems from our being uncertain about something: it makes no difference whether the uncertainty relates, for instance, to an unforeseable future, or to an unnoticed past, or to a past doubtfully reported or forgotten. Moreover, probabilistic reasoning is completely unrelated to general philosophical controversies, such as determinism versus indeterminism: for example, in the context of heat diffusion or transmission, it makes no difference on the probabilistic model whether one interprets the underlying process as being random or strictly deterministic; the only relevant thing is uncertainty, since a similar situation would in fact arise if one were faced with the problem of forecasting the digits in a table of numbers, where it makes no difference whether the numbers are random, or are some segment (for example, the digits between the 2001st and the 3000th) of the decimal expansion of 1r (possibly available somewhere or, in principle, computable, but unknown to You). The actual fact of whether or not the events under consideration are in some sense determined, or known by other people, is for You of no consequence on the assessment of the relevant probabilities. Probability is the degree of belief assigned by You (the "subject" making the assessment: this is the essential reason why it is called subjective probability) to the "occurrence" (i.e., in being possibly true) of an event. The most "popular" and well known methods of assessment are
14
CHAPTER 1
based on a combinatorial approach or on an observed frequency: de Finetti notes that they essentially suggest to take into account only the most schematic data and information, and in the most schematic manner, which is not necessarily bad, but not necessarily good either. Nevertheless these two approaches can be recovered if looked on as useful (even if very particular) methods of coherent evaluation. They are subjective as well, since it is up to You to judge, for example, the "symmetry" in the combinatorial approach or the existence of "similar" conditions in the different trials of the frequentist approach. Not to mention that they unnecessarily restrict the domain of applicability of probability theory. On the other hand, the natural condition of coherence leads to the conclusion that subjective probability satisfies the usual and classic properties, i.e.: it is a function whose range is between zero and one (these two extreme values being assumed by ~ but not kept only for ~ the impossible and the certain event, respectively), and which is additive for mutually exclusive events. Since these properties constitute the starting point in the axiomatic approach, de Finetti rightly claims that the subjective view can only enlarge and never restrict the practical purport of probability theory. An important remark (that has a strong connection with our discussion of Section 2.2) is now in order: de Finetti makes absolutely clear the distinction between the subjective character of the notion of probability and the objective character of the elements (events, or any random entities whatsoever) to which it refers. In other words, in the logic of certainty there exist only TRUE and FALSE as final (not asserted!) answers, while with respect to the present knowledge of You there exist, as alternatives, certain or impossible, and possible. Other scholars (he claims) in speaking of a random quantity assume a probability distribution as already attached to it: so adopting a different view is a consequence of the unavoidable fact that a "belief" can vary (not only from person to
INTRODUCTION
15
person, but also) with the "information", yet preserving coherence. Then, besides the above "semantic" argument in favour of keeping distinct "logic" and "belief" , there is also a "syntactic" one: coherence does not single-out "a unique probability measure that describes the individual's degrees of belief in the different propositions" (as erroneously stated by Gardenfors in [67], p.36). In this respect, see also Example 8, Chapter 4, and let us quote again from de Finetti's book [53]: "Whether one solution is more useful than another depends on further analysis, which should be done case by case, motivated by issues of substance, and not- as I confess to having the impression - by a preconceived preference for that which yields a unique and elegant answer even when the exact answer should instead be any value lying between specifiable limits". We are going to deepen these (and others) aspects in this book; other comments on de Finetti's contributions are scattered here and there in the text, while a much more extensive exposition of the development of de Finetti's ideas (but with a special attention to statistical inference) is in the long introduction of the book [89] by Frank Lad.
Chapter 2 Events as Propositions 2.1
Basic concepts
An event can be singled-out by a (nonambiguous) statement E, that is a (Boolean) proposition that can be either true or false (corresponding to the two "values" 1 or 0 of the indicator I E of E). Obviously, different propositions may single-out the same event, but it is well-known how an equivalence relation can be introduced between propositions through a double implication: recall that the assertion A ~ B (A implies B) means that if A is true, then also B is true.
Example 1 - You are guessing on the outcome of "heads" in the next toss of a coin: given the events
A
= You guess right ,
B = the outcome of the next toss is heads ,
clearly the two propositions A and B single-out the same event. On the other hand, if You are making many guesses and, among them, You guess also on the outcome of "heads" in the next toss of a coin, then B ~ A, but not conversely. 17 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 2
18
Closely connected with each event E is its contrary Ec: if the event E is true, then the event Ec is false, and vice versa. Two particular cases are the certain event n (that is always true) and the impossible event 0 (that is always false): notice that n is the contrary of 0, and viceversa. Notice that only in these two particular cases the relevant propositions correspond to an assertion. Otherwise the relevant events (including possibly statistical data) need always to be considered (going back to a terminology due to Koopman [87]) as being contemplated (or, similarly, assumed) and not asserted propositions. To make an assertion, we need to say something extralogical or concerning the existence of some logical relation, such as "You know that E is false" (so that E = 0). Other examples of assertions, given two events A and B, are "A implies B", or "A and B are incompatible" (we mean to assert that it is impossible for them both to occur: the corresponding formal assertion needs the concept of conjunction, see below, Section 2.3). Remark 1 - In the relevant literature, the word event is often used in a generic sense, for example in statements like "repetitions (or trials) of the same event" . We prefer to say (again, following de Finetti) "repetitions of a phenomenon", because in our context "event" is a single event. It is not simply a question of terminology, since in two different trials (for example, tosses of a coin) we may have that "heads" is TRUE (so becoming 0) in one toss, and FALSE {so becoming 0) in the other: anyway, two distinct events are always different, even if it may happen that they take the same truth value.
2.2
From "belief" to logic?
It should be clear, from the previous introduction of the main preliminary concepts (see also the discussion in the final part of Section
EVENTS AS PROPOSITIONS
19
1.2), that our approach does not follow the lines of those theories (such as that expounded in the book [67] by Gardenfors) that try to explain a "belief' (in particular, probability) for arbitrary objects through the concept of epistemic state, and then to recover the logical and algebraic structure of these objects from the rules imposed to these beliefs. We maintain that the "logic of certainty" deals with TRUE and FALSE as final, and not asserted, possible answers, while with respect to a given state of information there exist, as alternatives concerning an event (and measured, for example, by probabilities), those of being certain or impossible, and possible. To us the concept of "epistemic state" appears too faint and clumsy to be taken as starting point, and certainly incompatible with our aim to deal with partial assessments. In other words, we do not see the advantages of resorting to it in order to (possibly) avoid to presuppose a "minimal" logic of propositions. Not to mention the special role (that we will often mention and discuss in the sequel) that in our setting have events of probability 0 or 1; in fact, in Gardenfors' approach (as in similar theories) the so-called "accepted propositions" are identified with those having maximal probability (p.23 of [67]: in the same page it is claimed that "to accept a proposition is to treat it as true in one way or another"). Moreover, on p.39 of the same book Gardenfors claims: "Some authors, for example de Finetti .. . allow that some sentences that have probability 1 are not accepted .. . Even if a distinction between acceptability and full belief is motivated in some cases, it does not play any role in this book" (our bold). On the other hand, in our approach the concept of "accepted" is ... a stranger: a proposition may be "true" (or "false") only if looked on as a contemplated statement, otherwise (if asserted) it reduces to the certain (or impossible) event. Anyway, we do not see how to handle (from a "syntactic" point of view) subtleties
CHAPTER 2
20
such as "one must distinguish the acceptance of a sentence from the awareness of this acceptance" (cf. again [67], p. 23). We are ready now to recall the classic operations among events, even if we do not presuppose a beforehand given algebraic structure (such as a Boolean algebra, a a-field, etc.) of the given relevant family.
2.3
Operations
We will refer in the sequel to the usual operations among events (such as conjunction, denoted by 1\, and disjunction, denoted by V ) and we shall call two events A and B incompatible if A 1\ B = 0 (notice that the implication A ~ B can be expressed also by the assertion Ac VB= 0 ). The two operations are (as the corresponding ones - intersection and union - between sets) associative, commutative and distributive, and they satisfy the well-known De Morgan's laws. Considering a family E of events, it may or may not have a specific algebraic structure: for example, a Boolean algebra is a family A of events such that, given E E A, also its contrary Ec belongs to A, and, given any two events A and B of the family, A contains also their conjunction A 1\ B; it follows easily that A contains also the disjunction of any two of its events. But it is clearly very significant if You do not assume
that the chosen family of events had such a structure (especially from the point of view of any real application, where You need consider only those events that concern that application: see the following Example 4). On the other hand, E can always be extended (by adding "new" or "artificial" events) in such a way that the enlarged family forms a (Boolean) algebra.
EVENTS AS PROPOSITIONS
2.4
21
Atoms (or "possible worlds")
Each event can be clearly represented as a set of points, and in the usual approaches it is customary to refer to the so-called "sample space", or "space of alternatives": nevertheless its systematic and indiscriminate use may lead to a too rigid framework. In fact, even if any Boolean algebra can be represented (by Stone's theorem: a relevant reference is [119]) by an algebra of subsets of a given set n, the corresponding "analogy" between events and sets is nothing more than an analogy: a set is actually composed of elements (or points), and so its subdivision into subsets necessarily stops when the subdivision reaches its "constituent" points; on the contrary, with an event it is always possible to go on in the subdivision .. These aspects are discussed at length and thoroughly by de Finetti in [53], p.33 of the English translation. The following example aims at clarifying this issue.
Example 2 - Let X be the percentage of time during which there are (for instance tomorrow, between 9 a. m. and 1 p. m.) more than 10 people in the line at a counter of the nearest post office, and consider the event E = {X= X 0 } (for example, X 0 = 37%}. Then E can be regarded either as an "atomic" event (since a precise value such as 37.0 does not admit a further refinement) or as belonging to an infinite set (i.e., the set of the events {X= x: 0 ~ x ~ 100}}. However it also belongs to the family consisting of just the two events E = {X = X 0 } and Ee = {X =/= x 0 } and can be decomposed into E = (E 1\ A) V (E 1\ Ae), where A is the event "at least one woman is in the line", or else into E = (E 1\ B) V (E 1\ Be), where B is the event "outside is raining", or else with respect to the partition {A 1\ B, Ae 1\ B, A 1\ Be, Ae 1\ Be}, and so on. Another important aspect pointed out in the previous example is that no intrinsic meaning can be given to a distinction between events belonging or not to a finite or infinite family. In the same
CHAPTER2
22
way, all possible topological properties of the sets representing events are irrelevant, since these properties do not pertain to the logic of probabilistic reasoning. Concerning the aforementioned problem of the choice of atomic events, in any application it is convenient to stop, of course, as soon as the subdivision is sufficient for the problem at hand, but ignoring the arbitrary and provisional nature of this subdivision can be misleading. Not to mention that "new" events may come to the fore not only as a "finer" subdivision, but also by involving what had previously been considered as certain.
Example 3 - Given an election with only three candidates A, B, C, denote by the sa~e symbols also the single events expressing that one of them is elected. We have A VB VC= n, the certain event. Now suppose that C withdraws and that we know that then all his votes will go to B: so we need to go outside the initial "space" {A, B, C}, introducing a suitable proposition (representing a new information) which is given byE C AV B, with E = "C withdraws and all his votes go to B ". We will see in Chapter 10 how to manage a new information through the concept of conditional event. This example has been discussed by Schay in [108], in the context of conditional probability: we will deal with it again in Chapter 18, Example 33, challenging Schay's argument. Let us now write down the formal definition that is needed to refer to the "right" partition in each given problem.
Definition 1 - Given an arbitrary finite family
E=
{ E~,
... , En} ,
of events, the atoms A1 , ••• ,Am generated by these events are all conjunctions Ei 1\ E2 ... 1\ E~, different from the impossible event 0, obtained by putting (in all possible ways) in place of each EJ, for i = 1, 2, ... , n, the event Ei or its contrary Ef.
EVENTS AS PROPOSITIONS
23
Atoms are also called (mainly in the logicians' terminology) "possible worlds". Notice that m ::; 2n, where the strict inequality holds if there exist logical relations among the Ei's (such as: an event implies another one; two or more events are incompatible, ... ) . When m = 2n (i.e., we have the maximum number of atoms), the n events are called logically independent. This means that the truth value of each of these events remains unknown, even if we assume to know the truth value of all the remaining others.
Definition 2 - Given an arbitrary finite family
e=
{ E1, ... , En} ,
of events, let A be the set of relevant atoms. We call indicator vector of each Ei {with respect to A) the following m-dimensional vector with
Ar ~ Ei Ar 1\ Ei = 0, ' The usual indicator of an event E corresponds to the trivial partition A= {E, Ec}. JAr Ei
if if
= { 1I Q
Example 4 - A patient feels serious generalised abdominal pains, fever and retches. The doctor puts forth the following hypotheses concerning the possible relevant disease: H3
=
H 1 = ileum , H 2 = peritonitis acute appendicitis, with an ensuing local peritonitis.
Moreover the doctor assumes a natural logical condition such as H3 C
Hf 1\ H2,
so that the given events are not logically independent. Correspondingly there are then five atoms Al = Hl 1\ H2 1\ HL
A2 = Hl 1\ H~ 1\ HL
A3 =
Hf 1\ H2 1\ HL
CHAPTER 2
24
A4 = Hf I\ H2
I\
H3 ,
As = Hf I\ H~ I\ Hg .
Clearly, the events Hb H 2 , H 3 have been chosen as the most natural according to the doctor's experience: they do not have any specific algebraic structure and do not constitute a partition of the certain event S1. Moreover, a doctor often assigns degrees of belief directly to sets of hypotheses (for example, he could suspect that the disease the patient suffers from is an infectious one, but he is not able to commit any belief to particular infectious diseases).
2.5
Toward probability
Since in general it is not known whether an event E is true or not, we are uncertain on E. In our framework, probability is looked upon as an "ersatz" for the lack of information on the actual "value" of the event E, and it is interpreted as a measure of the degree of belief in E held by the subject that is making the assessment. As we shall see in the next chapters, we can only judge, concerning a probability assessment over any set of events whatsoever, whether or not it is among those evaluations which are coherent. Notice also that a careful distinction between the meaning of probability and all its possible methods of evaluation is essential: ignoring this distinction would be analogous to identifying the concept of temperature with the number shown by a thermometer, so being not entitled to speak of temperature in a room without a thermometer (these aspects will be further discussed in Chapter
8).
Chapter 3 Finitely Additive Probability 3.1
Axioms
An usual way of introducing probability is through the following framework: given a non-empty set n (representing the certain event) and an algebra A of subsets (representing events) of n, a probability on (0, A) is a real-valued set function P satisfying the following axioms
(Al) (A2) (A3)
P(O) = 1; P(A V B) = P(A) + P(B) for incompatible A, B P(E) is non-negative for any E EA.
E
A;
Remark 2 -A simple consequence of {A1)-{A3) is that P(E) = 0 if E = 0, but (obviously) the converse is not true. Even if we will deal in this book mainly with a ''finite world", nevertheless the consideration of (not impossible) events of zero probability is unavoidable (see Section 3.3). The algebraic condition put on the definition of probability (i.e. the requirement that A be an algebra) strengthens the effectiveness of
25 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER3
26
axioms (Al-A3): for instance, a trivial consequence of the additivity is the monotonicity of P, that is: A~ B implies P(A) :5 P(B). But what is more is that they imply that, given any finite partition 8 = {B 17 ••• , Bn} ~ A of n, then the probability of any event E belonging to the algebra spanned by 8 is completely specified by the probabilities P(Bi), Bi E 8, since necessarily P(E) =
L
P(Bi) .
B,c;;E
3.2
Sets {of events) without structure
In many real situations we cannot expect that the family of events we need to deal with has some algebraic structure. So, if is just any collection of subsets of n, representing events and subject only to the requirement that n E e, then (Al-A3) are insufficient to characterise pas a probability on (n, e): for example, if e contains no union of disjoint sets, (A2) is vacuously satisfied. Moreover, it may even happen that there does not exist an extension of P onto an algebra A containing e, with P satisfying (Al-A3).
e
Example 5 -Let {F, G, H} be a partition off2: consider the family
e = {E1 = F VG , E2 = F V H , E3 = G V H , H
, f2}
and the assignment P(EI)
4
= 9 , P(E2)
=
4
2
5
9 , P(E3) = 3 , P(H) = 9 , P(f2) = 1.
It can be easily verified that (A1-A3} hold one: nevertheless monotonicity of P does not hold, since P(H) > P(E2 ) while H C E 2 • Now, even if we consider the family e' obtained by deleting the event H from e (giving up also the corresponding assessment P(H) = ~'
FINITELY ADDITIVE PROBABILITY
27
so that monotonicity holds}, it does not exist an extension of P on the algebra
verifying {A1-A3}. In fact this extension should satisfy the system
+ P(G) = ~ P(F} + P(H) = ~ P(G} + P(H} = ~ P(F) + P(G) + P(H} P(F}
= 1
while we get {by summation)
2P(F) + 2P(G}
4
4
2
14
9
9
3
9
+ 2P(H} =- +- +- = -
that is
P(F)
+ P( G) + P(H)
=
7
g< 1
So the extension of P from E' to the algebra A is not a probability on A. This can also be expressed by saying that P is not coherent {see Chapter 4) onE'.
3.3
Null probabilities
If You ask a mathematician to choose at his will - in a few seconds - and tell us a natural number n, he could choose and tell any element of lN, such as "the factorial of the maximum integer less than e27 ". If You judge that choice as not privileging any natural number with respect to any other one, then a probability distribution expressing all these possible choices is necessarily "uniform", i.e. P(n} = 0 for every n.
28
CHAPTER 3
This also means that a finitely additive setting may be a better framework than the more usual a-additive one: obviously, the adoption of finite additivity as a general norm does not prevent us from considering probabilities which are a-additive (possibly with respect to some particular subfamily of events), when this turns out to be suitable. What is essential is that the latter property be seen as a specific feature of the information embodied in that particular situation and not as a characteristic of every distribution. For a deepening of these aspects, see the interesting debate between de Finetti and Frechet (the relevant papers must be read in the order [45], [63], [46], [64], [47]), and the expository papers [109],
[110]. A concrete example concerning a statistical phenomenon (the so-called first digit problem) is discussed in Chapter 18, Example 30. A common misunderstanding is one which makes finite or countable additivity correspond to the consideration, respectively, of a finite or infinite set of outcomes: we may instead have an infinite set of possibilities, but this does not imply countable additivity of the relevant probability on this set. We end this Chapter with an example that concerns a zero probability assessment in a "finite world".
Example 6 - You toss twice a coin and consider the following outcomes, fork= 1, 2:
sk =the coin stands
(e.g., leaning against a wall) at the k-th toss,
and, analogously, denote by Hk and Tk, respectively, heads and tails. The "natural" probability assessments are
P(Sk) = 0 , P(Hk) =
1
1
2 , P(Tk) = 2 ,
since the events Sk are not impossible, neither logically nor practically, but the classic probability assignments to heads and tails force
FINITELY ADDITIVE PROBABILITY
29
P(Sk) = 0. Now, You may wish to assign probabilities to the possible outcomes of the second toss conditionally to the result of the first one, for example conditionally to 81. Even if You had no idea of the formal concept of conditional probability, nevertheless for You "natural" and "intuitive" assignments are, obviously,
As we shall discuss at length in Chapter 11, it is in fact possible to assign directly {i.e., through the concept of coherence and without resorting to the classic Kolmogorov's definition} the above probabilities, even if the conditioning event has zero probability. Other "real" examples of zero probability assignments are in [28], [109], [110] and some will be discussed in detail in Chapter 18.
Chapter 4 Coherent probability The role of coherence is that of ruling probability evaluations concerning a family containing a "bunch" of events, independently of any requirement of "closure" of the given family with respect to logical operations. Even if its intuitive semantic interpretation can be expressed in terms of a betting scheme (as we shall see in Chapter 5), nevertheless this circumstance must not hide the fact that its role is essentially syntactic.
4.1
Coherence
To illustrate the concept of coherence, consider, for i = 1, 2, ... , n, an assessment Pi = P(Ei) on an arbitrary finite family
£={Et, ... ,En}, and denote by A17 .•• ,Am the atoms generated by these events.
Definition 3 - An assessment Pi = P(Ei), i = 1, 2, ... , n, on an arbitrary finite family £ is called coherent if the function P can be extended from £ to the algebra A generated by them in such a way that Pis a probability on A. 31 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
32
CHAPTER 4
In particular, Pis defined on the set of atoms generated by£, and so coherence amounts to the existence of at least one solution of the following system, where Xr = P(Ar),
L
Xr =pi,
i
=
1, 2, ... , n
Ar~Ei
(4.1)
m
L Xr = 1'
Xr 2: 0'
r
= 1, 2, ... ,m.
r=l
Remark 3 - In the above system (containing also m inequalities), the number of equations is n+ 1, where n is the number of events of the family £, and the number of unknowns is equal to the number m of atoms. When then events are logically independent (see Definition 1}, any assessment Pi (i = 1, 2, ... , n) with 0 ::; Pi ::; 1, is coherent (cfr. de Finetti {53}, Vol. 1, p.109 of the English translation). Example 1 - As we have shown in Example 5, the assessment on £' (and, all the more so, on £) is not coherent. A simple check shows, however, that the assessment obtained by substituting P(E2 ) = ~ in place of P(E2 ) = ~ is instead coherent (even on £). Since the atoms are exactly the same as before, then the solution of the corresponding system, that is
follows by an elementary computation. Notice that in the previous examples we have met instances of the following three situations: • a probability (that is, a P satisfying (A1-A3) on an algebra A) is coherent; • a coherent function P (on a family £) is the restriction to £ of a probability on a Boolean algebra A 2 £;
COHERENT PROBABILITY
33
• a function P satisfying (Al-A3) on£ may not be extendible as a probability. In conclusion, we have (roughly speaking) the following set inclusion
where P is the set of "all" probabilities, C the set of "all" coherent assessments, and :F the set of "all" functions P just satisfying (AlA3). Clearly, the latter set (as the previous discussion has pointed out) is not interesting, and we shall not deal any more with it. Remark 4 -In the previous example the system (4.1) has just one solution. This is a very particular circumstance, since in general this system has an infinite number of solutions, so that a coherent assessment is usually the restriction of many (infinite) probabilities defined on the algebra generated by the given events.
The following example (a continuation of Example 4) refers to a situation in which the relevant system (1) has infinite solutions. Example 8 - We go on with Example 4: the doctor gives {initially: we will deal with updating in Chapters 11, 13, and 16) the following probability assessments:
P(H3)
1
=B.
Clearly, this is not a complete assessment (as it has been previously discussed), and so the extension to other events of these evaluations - once coherence is checked -is not necessarily unique. The above (partial) assessment is coherent, since the function P can be extended from the three given events to the set of relevant atoms in such a way that P is a probability on the algebra generated
CHAPTER 4
34
by them, i.e. there exists a solution of the following system with unknowns Xr = P(A,.) +xa = ~ x1 +xa +x4 XI
X4 5
=k
=l
LXr=1 r=l Xr ~
0.
For example, given A, with 0 :5 A :5 XI=
A,
1
xa =--A, 2
to, then
3 xa = 40 - A,
1
X4
=
B'
3 xs = 10 +A,
is such a solution.
Since (as we shall see in the next Chapter) the compatibility of system (1) is equivalent to the requirement of avoiding the so-called Dutch-Book, this example (of non-uniqueness of a coherent assessment) can be seen also as an instance ofthe (important) issue raised at the end of Section 1.2, concerning a quotation from Gardenfors' book [67].
4.2
Null probabilities (again)
Notice that coherent assessments (even if strictly positive) may possibly assign (compulsorily!) zero probability to some atoms A,. : for example let A, B, C be three events, with C c A A B , and assess P(A) = P(B) = P( C) = 1/2. This assessment may come from a uniform distribution on the square E = [0, 1] x [0, 1] c R 2 , taking, e.g., 1 A= {(x, y) E E: 0 :5 x < 1, 0 < y :52},
COHERENT PROBABILITY
35
B = {(x, y) E E: 0
<x
~ 1, 0 ~ Y
1
< 2},
1
C = {(x, y) E E: 0 < x < 1, 0 < y < 2} \ {(x, y) E E:
1
X=
4}.
The relevant atoms are
A1 =A 1\ B 1\ C = C, A2 =A 1\ B A4
= A 1\ Bc 1\ cc ,
As
1\
cc, Aa = Ac 1\ B
1\
cc,
= A c 1\ Bc 1\ cc
and system (4.1) reads X1
= ~
x1 +x2 +x4 = 21 x1 + x2 +xa
= 21
5
LXr
= 1
r=1
Xr ~
0.
Its only solution is 1
X1
=2
7
X2
= X3 = X4 = 0
1
7
Xs
=2
7
i.e. it assigns 0 probability to the atoms A2 , A 3 , ~In conclusion, this is another instance of the fact that dealing with zero probability is unavoidable, even in a "finite world"!
Chapter 5 Betting Interpretation of Coherence In the relevant literature, the term "coherence" refers to the betting paradigm introduced by de Finetti (see, e.g., [53]). Our aim is now to show that the two concepts of coherence are syntactically equivalent : for this, we need resorting to a classic theorem of convex analysis (also known as "alternative theorem": for the proof, see, for instance, [66]). Theorem 1 - Let M and N be, respectively, real (k x m) and (h- k) x m matrices, x an unknown (m x 1) column vector, and J.t and v, respectively, (1 x k) and 1 x (h- k) unknown row vectors. Then exactly one of the following two systems of linear inequalities has solution:
Mx>O {
Nx~O
(5.1)
x~O,
p,M + vN::::; 0 { J.t, V~ 0 p,#O.
37 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
(5.2)
38
CHAPTER 5
where 0 is the null row vector.
Remark 5 - The above theorem holds not only for real matrices and vectors, but also if their elements are rational numbers, so that no computational problems, due to the relevant truncations, can arise. By relying on the previous theorem, we can establish the following result:
Theorem 2 - Let B be an (n x m) matrix, y an unknown (m x 1) column vector, and~ an unknown (1 x n) row (real} vector. Then exactly one of the following systems has solution: {
By=O
y;::: 0,
IIYII =1
~B>O,
(5.3) '
(5.4)
m
with
IIYII= L
Yi ·
i=l
Proof- First of all, we note that if (5.1) admits a solution x, then it admits also a solution y, with Yr = Xr/llzll· Take, in the above Theorem, k = 1 and h = 2n+ 1, and take M as the unitary (1 x m) row vector, and N as a (2n x m) matrix, whose first n lines are those of the matrix B, and the remaining those of its opposite -B. m
Then My > 0 gives
L Yr = 1,
that is the second line of (5.3),
r=l
while the (2n
X
1) column vector
Ny equals c~:y)' so that the
first line of (5.3) follows. On the other hand, JLM is a (1 x m) row vector with all components equal to a nonnegative real number J.L, and so the first line of (5.2) gives vN < 0; denoting by v 1 the vector whose columns
BETTING INTERPRETATION OF COHERENCE
39
are the first n columns of v and by v 2 the vector whose columns are the remaining one, this can be written (v1 - v 2 )B < 0. In conclusion, considering a vector .X = v 1 - v 2 with real components, we get .XB < 0, or else (by changing the sign of .X) the form (5.4). • We go back now to the system (4.1), which expresses coherence of the assessent Pi= P(Ei), with i = 1, ... , n, on the finite family £: it can be written in the matrix form (5.3), where B denotes the m x n matrix whose i-th row (i = 1, 2, ... , n) is I~ -Pi, and J~ is the indicator vector of Ei (see Definition 2), that is ] Al E; -
Pi ' . . . '
JAm E; -
Pi .
By Theorem 2, it has a solution if and only if the "dual" system
.XB > 0, has no solutions. Now, putting .XB
(5.5)
= G,
the columns of G are
n
9r
= ~ Ai(I:; -Pi)
r=1,2, ... ,m,
i=l
z. e. all the possible values, corresponding to all the "outcomes" singled-out by the relevant atoms, of the function n
G = ~ >.i(IE; -Pi).
(5.6)
i=l
So the system ( 4.1) has a solution if and only if, for any choice of the real numbers >.i , n
inf G {Ar}
=
inf ~ >.i(IE; -Pi) ~ 0.
(5.7)
{Ar} i=l
And what is the meaning of G? First of all, a possible interpretation of Pi = P(Ei) is to regard it as the amount paid to bet on the event Ei, with the proviso of
CHAPTER 5
40
receiving an amount 1 if Ei is true (the bet is won) or 0 if Ei is false (the bet is lost), so that, for any event E "the indicator I E is just the amount got back by paying P(E) in a bet onE".
It is possible (and useful) to consider, in a bet, also a "scale factor" (called stake) Ai, that is to refer to a payment PiAi to receive - when the bet is won - an amount Ai (we were previously referring to the case Ai = 1: "wealthy people" would choose a bigger .Ai !). Since each Ai is a real number, its consideration is useful also to exploit its sign to make bets in both directions (exchanging the role between bettor and bank, that is the role between the two verbs "to pay" and "to receive"). Following this interpretation, notice that {5.6) represents the random gain for any combination of bets on some (possibly all) events of the given family E : the events Ei on which a bet is actually made are those corresponding to Ai =I= 0 (by the way, this is not equivalent to paying 0 for the events on which we do not bet, since we might pay 0 - and bet - also for some of the former Ei, if Pi= 0; for example, betting on the events Sk, Hk, Tk of Example 6, the expression G = .AI(/s~c - 0)
+ .A2(IH~c
1
- 2)
+ A3(lr~c
1
- 2)
represents the relevant random gain). Then the coherence condition (5.7) -equivalent to the compatibility of system {1) - corresponds to the requirement that the choice of the Pi's must avoid the so-called Dutch-Book: "possible gains all positive" (or all negative, by changing the sign of the ..Xi's). Notice that coherence does not mean that there is at least an outcome in which the gain is negative: it is enough that at least an outcome corresponds to a gain equal to 0. In other words: no sure losers or winners!
BETTING INTERPRETATION OF COHERENCE
41
For example, given A and B, with A 1\ B = 0, You may bet (as bettor) on A and on B by paying (respectively) p' and p", and bet (as bank) on A VB by "paying" -p'- p" (i.e. by receiving p' + p"): this is a coherent combination of bets, since the relevant possible gains are obviously all equal to zero. Remark 6 - Since coherence requires that (5.5) has no solution, it follows that, for any choice of the unknowns Ai 's, the coherent values of the Pi's must render (5.5) not valid. In other words: coherence is independent of the way you bet (that is - according to the sign of Ai - it is irrelevant whether you are paying money being the bettor, or whether you are receiving money being the bank) and it is also independent of your ... "wealth" (that drives the choice of the size of .Ai}. Recall that, given n events, the gain (5.6) refers to any combination of bets on some (possibly all) of these events: they are singled-out by choosing Ai =/:- 0, and so there is no need to mention their number k :::; n. Conversely, we could undertake a number of bets greater than n, i.e. consider some events more than once, say h times, since this is the same as just summing the corresponding .Ai to get h.Ai. Therefore we can express the definition of coherence (for a finite family of events) taking as number of bets any k E lN (choosing from the set { 1, 2, ... , n} some indices with possible repetitions). These (obvious) remarks suggest to retain the definition of coherence (in terms of betting) also for an infinite (arbitrary) family £ of events. Therefore (recalling Definition 3) a real function P defined on £ is called coherent if, for every finite subfamily :F c £, the restriction of P to :F is a coherent probability (i.e., it is possible to extend it as a probability on the algebra g spanned by :F). We proved elsewhere (for details, see [25]) that this is equivalent (similarly to the finite case) to the existence of an extension f of P from £ to the minimal algebra g generated by £ : we need resorting
CHAPTER 5
42
to the system
{
/(/~)=Pi,
f(Ir:) = 1, where f is an unknown linear functional on g, and 1:_. are the indicator functions (see Chapter 2) of the events Ei , defined on the set A of atoms generated by e (their definition is similar to that of the finite case, but allowing infinite conjunctions). If this system has a solution, the function f is a finitely additive probability on g' agreeing with p on e. Moreover, by using an alternative theorem for infinite systems (see, for instance, [61], p.123), it is possible to show that the above system has a solution if and only if the coherence condition (in terms of betting) holds for every finite subfamily of e. Summing up: coherence of a probabilistic assessment on an arbitrary set (that is, the existence of a finitely additive probability 1 on the algebra g spanned bye and agreeing with P on e ) is equivalent to coherence on any finite subset of e ; it is therefore of paramount importance to draw particular attention also to finite families of events.
e
Chapter 6 Coherent Extensions of Probability Assessments 6.1
de Finetti's fundamental theorem
Given a coherent assessment Pi = P(Ei), i = 1, 2, ... , n, on an arbitrary finite family£ = {E1 , ... ,En}, consider a further event En+ 1 and the corresponding extended family IC = £ U { En+l} . If En+l is logically dependent on the events of £, i.e. En+l is a union of some of the atoms Ar generated by £, then, putting Xr = P(Ar), we have Pn+l
=
Lr
Xr,
Ar 0, that t 2 + t 3 + t 4 < (t 3 + t4)t(BIK) (which is, taking into account (10.8) and (10.9), a contradiction). On the other hand, if t(Hl!l) = 0, then t 1 = t 2 = ts = ta = t 9 = t 10 = 0, so that t 3 + t 4 = (t 3 + t 4)t(BIK), which implies either t(BIK) = 1 ~ t(AIH) or t 3 = t 4 = o, that is t(Kl!l) = 0. In the latter case, consider again eq.(10.7), but substitute, in the above two triples, HV Kin place of!l; putting Zr = t(ArlHV K), by (10.5) we get again (*) and (o), with Zr in place of tr. Then, arguing as above (with Zr in place of tr) and assuming t(BIK)- t(AIH) < 0, we would get again, if t(HIH V K) > 0, a contradiction; and when t(HIH V K) = 0, arguing again as above we would get either t(BIK) = 1 ~ t(AIH) or t(KIH V K) = 0, but the latter is impossible, since
t(KIH V K) ~
t(KIH V K)
10.3
+ 0 = t(KIH V K) + t(HIH V K)
+ t(H 1\ KCIH V K)
=
~
t(H V KIH V K) =
1. •
Toward conditional probability
To conclude this Chapter on conditional events, and to pave the way for the extension of the concept of coherence to conditional probability, some important remarks are now in order.
CONDITIONAL EVENTS
71
• The above conditions (i) ', (ii), (iii) coincide exactly with the axioms given by de Finetti in 1949 (see (52]) to define a conditional probability (taking go as set of conditioning events). They will be reported again- introducing the usual symbol P(·l·) in place oft(·l·)- at the beginning of the next Chapter. Properties (i)' and (iii) are also in the definition of "generalized conditional probability" given by Renyi in (104], where condition (ii) is replaced by the stronger one of o--additivity (obviously, the two conditions are equivalent if the algebra E is finite). Actually, in (104] Renyi takes, as set of conditioning events, an arbitrary family B (not requiring to be an additive one), and this choice may entail some "unpleasant" consequences (see Section 11.5). Popper also dealt with conditional probability from a general point of view (which includes its "direct" assignment, and the possibility of zero probability for the conditioning event) in a series of papers, starting from 1938 (in Mind, vol. 47), but in this paper - as Popper himself acknowledges in the new appendix *II of the book [101] - he did not succeed in finding the "right" set of axioms. These can be found (together with a discussion of many interesting foundational aspects) in the long new appendix *IV of the same book, where he claims: "I published the first system of this kind only in 1955" (in British Journal for the Philosophy of Science, vol. 6). Popper's definition (based essentially on the same axiom system as that given by de Finetti) is known in the relevant literature as "Popper measure" . Other relevant references are Csaszar (43], Krauss (88] and Dubins (56]: see the discussion in Section 11.5. • Conditions (i)-(iii) hold even for a family C which is not the cartesian product of an algebra and an additive set: this re-
72
CHAPTER 10
mark will be the starting point for the introduction of the concept of coherent conditional probability. On the other hand, since the value p = t(EiH) is the amount paid to bet on EjH, it is obviously sensible to regard this function as the natural "candidate" to be called (on a suitable family of conditional events) conditional probability: recall in fact the particular case of an event E - corresponding to H = n and the more general case of a random variable Y, in which the analogous amounts correspond, respectively, to probability and to prevision. • From the point of view of any real application, it is important not assuming that the family C of conditional events had some specific algebraic structure. We stress that only partial operations have been introduced for conditional events: in fact we did neither refer to Boolean-like structures, nor try to define logical operations for every pair of conditional events. In the relevant literature (going back to the pioneering paper by de Finetti [49] and, more recently, to Schay [108], Bruno and Gilio [16], Calabrese [17], Dubois and Prade [59], Goodman and Nguyen [78]) there are different proposals on how to define, for example, conjunction and disjunction between any two conditional events. Many pros and contras concerning the "right" choice among these different possible definitions of operations for conditional events are discussed in [75].
Chapter 11 Coherent Conditional Probability The "third" value t(EIH) of a conditional event has been interpreted (following the analogy with the probability of an event and the prevision of a random variable) as the amount paid to bet on El H. As we have seen in the discussion in the last Section of the previous Chapter, this entails - in a sense, "automatically" - the axiomatic definition of conditional probability (in the sequel we will identify the set C of conditional events and that C8 of their Boolean supports).
11.1
Axioms
Definition 5 -If the set C = g x so of conditional events EIH is such that g is a Boolean algebra and S ~ g is closed with respect to (finite) disjunctions (additive set), then a conditional probability on g x so is a function P -+ [0, 1] satisfying the following axioms (i) P(HIH) = 1, for every HE so (ii) P(·IH) is a (finitely additive) probability on g for any given 73
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
74
CHAPTER 11
HE8° {iii} P((EI\A)IH) = P(EIH) ·P(AI(EI\H)), for every A E and E, HE 8°, E 1\ H f= 0.
g
Axiom {iii) can be replaced by {iii}' P(AIC) = P(AIB)P(BIC) if A ~ B ~ C, with A E g and B, C E 8°; in fact, by {iii} with A~E=B~H=C,
we get {iii} '. Conversely, from {iii} ', taking in particular the three events E 1\ A 1\ H ~ E 1\ H ~ H , we have P( (E 1\ A 1\ H)IH)
= P( (E 1\ A 1\ H)I(H 1\ E) )P( (E 1\ H)IH),
that is, taking into account that P((F 1\K)IK) = P(FIK), axiom {iii}. Putting P(·IH) =PH(·), property {iii} can be written (11.1)
This means that a conditional probability PH (·) is not singledout by its conditioning event H, since its values are bound to suitable values of another conditional probability, i.e. PE/\H(-). Then Pn(·) cannot be assigned (so to say) "autonomously". On the contrary, what is usually emphasized in the literature - when a conditional probability P(EIH) is taken into account is only the fact that P(·IH) is a probability for any given H: this is a very restrictive (and misleading) view of conditional probability, corresponding trivially to just a modification of the "world" (or sample space) n.
11.2
Assumed or acquired conditioning?
It is essential to regard the conditioning event H as a "variable", i.e. the "status" of H in EIH is not just that of something repre-
COHERENT CONDITIONAL PROBABILITY
75
senting a given fact, but that of an (uncertain) event (like E) for which the knowledge of its truth value is not required (this means, using Koopman's terminology, that H must be looked on as being contemplated, even if asserted: similar terms are, respectively, assumed versus acquired). So, even if beliefs may come from various sources, they can be treated in the same way and can be measured by (conditional) probability, since the relevant events ( including statistical data!) can always be considered as being assumed propositions (for example, the "statistical" concept of likelihood is nothing else that a conditional probability seen as a function of the conditioning event). An interesting aspect can be pointed out by referring to a situation concerning Bayesian inferential statistics: given any event H (seen as hypothesis), with prior probability P(H), and a set of events E 1 , ... , En representing the possible statistical observations, with likelihoods P(E1 IH), ... , P(EniH), all posterior probabilities P(HIE1), ... , P(HIEn) can be pre-assessed through Bayes' theorem (which, by the way, is a trivial consequence of conditional probability rules). In doing so, each Ek (k = 1, ... , n) is clearly regarded as "assumed". If an Ek occurs, P(HIEk) is chosen - among the prearranged posteriors - as the updated probability of H: this is the only role played by the "acquired" information Ek (the sample space is not changed!). In other words, the above procedure corresponds (denoting the conditional probability P(HIE) by p) to regard a conditional event HIE as a whole and to interpret p as (look at the position of the brackets!) "the probability of [H given E]" and not as "[the probability of H], given E". On the other hand, the latter interpretation is unsustainable, since it would literally mean "if E occurs, then the probability of H is p", which is actually a form of a logical deduction leading to absurd conclusions (for a very simple situation, see Example
CHAPTER 11
76
35 in Chapter 18). So we are able to challenge a claim of Schay (at the very beginning of the paper [108]) concerning ... the position of the brackets in the definition of the probability of H given E .
11.3
Coherence
Let us now discuss the main problem of this Chapter: how to assess P on an arbitrary set C of conditional events. Similarly to the case of unconditional probabilities, we give the following Definition 6 - The assessment P(·l·) on an arbitrary family c =cl Xc2 of conditional events is coherent if there exists C' 2 c, with C' = Q x 8° (Q a Boolean algebra, B an additive set, with B ~ Q), such that P (·I·) can be extended from C to C' as a conditional probability. Notice that a conditional probability on C is also (obviously) a conditional probability on a subfamily C" ~ C' with the same algebraic features: therefore Definition 6 can be formulated with reference to the minimal algebra g generated by cl u c2 and to the minimal additive set B generated by C2 . Among the peculiarities (which entail a large flexibility in the management of any kind of uncertainty) of this concept of coherent conditional probability versus the usual one, we mention the following two: • due to its direct assignment as a whole, the knowledge (or the assessment) of the "joint" and "marginal" unconditional probabilities P(E 1\ H) and P(H) is not required; • moreover, the conditioning event H (which must be a possible one) may have zero probability, but in the assignment of P(EIH) we are driven by coherence, contrary to what is done in those treatments (see, e.g. (65]) where the relevant
COHERENT CONDITIONAL PROBABILITY
77
conditional probability is given an arbitrary value in the case of a conditioning event of zero probability. For a short discussion of the approach to conditioning with respect to null events in the framework of the so-called Radon-Nykodim derivative, see Chapter 18, Example 32. On the other hand, if Pn (·) = P( ·) is strictly positive on 8°, we can write, putting H = n in (11.1),
P(E A A) = P(E) · P(AIE) . Then- in this case- all conditional probabilities P(·IE), for any E, are uniquely determined by a single "unconditional" P (as in the usual Kolmogorov's definition), while in general- see the next Theorem of this Chapter - we need a class of probabilities P0 's to represent the "whole" conditional probability. Now, a question arises: is it possible, through a suitable alternative theorem, to look again (as in the case of a probability of events) on the above definition of coherence from the point of view of a "semantic" interpretation in terms of betting? Since a conditional event is a (particular) random quantity, it seems reasonable to go back to the interpretation of coherence given in terms of betting for random quantities. Given an arbitrary family C of conditional events, we can refer, as in the case of (unconditional) probabilities (see the discussion in the final part of Chapter 5), to each finite subset
So, denote by Pi>..i the amount paid to undertake a bet on each EiiHi to receive T(EiiHi)>..i, where T(EIH) is given by (10.1), with Pi = t(EiiHi)· It follows, by (7.3) and (10.1), the following expression for the relevant gain n
G=
L >..i(T(EiiHi) -Pi) = i=l
CHAPTER 11
78 n
n
i=l
i=l
= E >..i(IEi/\Hi + Pi(1 -]Hi) -Pi) = E AilHi (IEi
-Pi) .
(11.2)
Recall that the condition of coherence requires that, for any choice of the real numbers >..i, the values of the random gain G are not all positive (or all negative), where "all" means for every possible outcome corresponding to the m atoms generated by the events E1, ... ,En, H1, ... , Hn. Now, by a slight modification of the argument in Chapter 5, this requirement is equivalent to the compatibility - in place of system (4.1) or, equivalently, (5.3)- of the following system, whose unknowns Xr are the probabilities of the atoms An
L
Xr
-Pi
Arc;;,Ei/\Hi
L
Xr
= 0,
i = 1, 2, ... , n
Arc;;,Hi
(11.3)
m
E
Xr
= 1'
Xr
~ 0'
r = 1, 2, ... ,m.
r=l
But, since the first n equations are all homogeneous, many of them can be satisfied giving zero to all the relevant unknowns, independently of the choice of the assessments Pi· In particular, if the disjunction H~ of the Hi's is properly contained in n (that is, there is some atom contained in Hf A... A H~), then we can put equal to zero the probabilities of the remaining atoms, so that all the first n equations are trivially satisfied. Therefore we may have a solution of system (11.3)- and then a gain with values not all of the same sign- not because of a "good" (coherent) choice of the Pi's, but just because of a combination of n bets that have been all called off: in fact, each atom Ar singling-out the outcome corresponding to this combination of bets is contained in all Hf 's, so that Ar A Hi= 0 for any i = 1, 2, ... , n; then, by (11.2), this outcome gives G = 0 for any choice of the p/s. This arbitrariness is not allowed for those indices i (possibly all) such that Xr = P(Hi) > 0, (11.4) Arc;;, Hi
L
COHERENT CONDITIONAL PROBABILITY
79
since in this case system (11.3) gives
P(Ei 1\ Hi) P(Hi)
(11.5)
and each amount Pi plays the role of a "natural candidate" to extend the results of Chapters 4 and 5 from a probability P(Ei) to a conditional probability P(EiiHi)· Moreover, with respect to the subfamily singled-out by (11.4), the choice (11.5) is a coherent one, since, denoting by
the value of G corresponding to the outcome Ar, where the sum is over the Hi's satisfying (11.4), we get
~r XrGr = ~r Xr ~i AiiHJAr (JE; =
~i Ai ( ~r
Xr - Pi
Arr;,_E;I\H;
=
~- Ai ( z
~
r Arr;,_E;I\H;
~r
-Pi)
Xr)
=
=
Arr;;_H; Lr
Xr
Xr- Arr;,_E;I\H;
L
A CrH· r_
X
r
~
r Arr;,_H;
Xr)
= 0. '
•
then, since the real numbers Xr ~ 0 are not all equal to zero, the possible values Gr of G are neither all positive nor all negative. In conclusion, in the general case (i.e., without any restrictive assumption ofpositivity), system (11.3) is not apt -contrary to the case of the analogous system (4.1) for unconditional events - to characterize coherence. It is only a necessary condition, entailing the relations
which may hold possibly even as 0 = 0, and so in this case the choice of Pi is arbitrary (even negative or greater than 1 ... ) .
80
CHAPTER 11
In order to cope with this situation, de Finetti [53] introduced the so-called "strengthening" of coherence, which amounts to require that, for any choice of the real numbers .\i, the values {11.2} of the random gain G are not all positive (or all negative), where "all" means for every possible outcome corresponding to the m atoms generated by Et, ... , En, H 1 , ... , Hn and contained in n
Ho= V{Hi: Ai
i- 0}.
1
For details, see the discussion in [31], where it is shown that this strengthened form of coherence (we call it dF-coherence) is equivalent to that given by Definition 6 in terms of extension of the function P( ·I·) as conditional probability. Slightly different formulations of dF-coherence have been given, e.g., in [94], [84], [102]; in [69] coherence is characterized by resorting (not to betting, but) to the so-called "penalty" criterion. Remark 8 - In terms of betting, dF-coherence of an assessment P( ·I·) on an arbitrary family C of conditional events is equivalent - as in the case of (unconditional) probability - to dF-coherence of P(·l·) on every finite subset :F ~ C. Then, since in a finite set dF-coherence is equivalent to the formulation given in terms of extension of P( ·I·) as conditional probability, it follows easily that we can refer as well {for the formulation based on Definition 6) to every finite subset of C.
11.4
Characterization of a coherent conditional probability
We characterize coherence by the following fundamental theorem ([21], [22]), adopting an updated formulation, which is the result of successive deepenings and simplifications brought forward in a series of papers, starting with [25].
COHERENT CONDITIONAL PROBABILITY
81
Theorem 4 - Let C be an arbitrary family of conditional events, and consider, for every n E IN, a finite subfamily
:F = {E1IHb ... , EniHn} ~ C; we denote by Ao the set of atoms Ar generated by the (unconditional) events E1, H1, ... , En, Hn and by g the algebra spanned by them. For an assessment on C given by a real function P, the following three statements are equivalent: (a) P is a coherent conditional probability on C; {b) for every n E IN and for every finite subset :F ~ C there exists a sequence of compatible systems, with unknowns x~ ~ 0, Lr
X~= P(EiiHi) Lr X~,
~~~A~
[if
Er Ar~H;
~~~
x~- 1 = 0, o: ~ 1]
(i = 1, 2, ... , n)
with o: = 0, 1, 2, ... , k S n, where H~ = H 0 = H 1 V... VHn and Hg denotes, foro:~ 1, the union of the Hi's such that Er x~- 1 = 0; Ar~H;
(c) for every n E IN and for every finite subset :F ~ C there exists (at least) a class of (coherent) probabilities {P[, P{, ... P[}, each probability P[ being defined on a suitable subset Aa ~ Ao {with Aa' C Aa" for o:' > o:" and P:, ( Ar) = 0 if Ar E Aa' ) such that for every G E g , G =/=- 0 , there is a unique P[, with Lr Pt(Ar) > 0 j
(11.6)
Ar~G
moreover, for every EiiHi E :F there is a unique PJ satisfying {11.6) with G = Hi and o: = /3, and P(EiiHi) is represented in the form
(11.7)
CHAPTER 11
82
Proof- We prove that (a)
=}
(b).
Suppose that P (defined on C) is coherent, so that it is coherent in any finite subset F ~ C ; put F = F 1 x F 2 , and denote by the same symbol P the extension (not necessarily unique) of P, which is (according to Definition 6) a conditional probability on g x B, where B is the additive class spanned by the events {H1 , ... , Hn} = F 2 and g the algebra spanned by the events { E 1, ... , En, H1, ... , Hn} = F1 U F2; soP satisfies axioms (i), (ii), (iii) of Definition 5. Put
Po(·) = with
P(·!H~), n
H~
=V Hi.
(11.8)
1
The probability P0 is defined on g and so, in particular, for all Ar ~ H~; notice that for at least an Hi we have P0 (Hi) > 0, and we have P0 (Ar) 0 for Ar CZ. H~. Then define recursively, for a 21,
Pa(·) = P(·!H~), with (11.9) Each probability Pa is defined on g and so, in particular, for all Ar ~ H:!; notice that for at least an Hi ~ H:! we have Pa(Hi) > 0, and we have Pa(Ar) = 0 for Ar CZ. H:!. Obviously, by definition of H:! and Pa there exists k ~ n such that a ~ k for any a ; moreover, for every Hi there exists f3 such that Lr Pfi(Ar) > 0 Arc;;_H;
holds. On the other hand, for every K E B, the function P(·!K) is a probability, and if Hi ~ K we have, by (11.1) in which H, A, E are replaced, respectively, by K, Ei, Hi, that
COHERENT CONDITIONAL PROBABILITY
83
Since Hr; E B, and the probabilities PK(Ei/\Hi) and PK(Hi) can be expressed as sums of the probabilities of the relevant atoms, then condition (b) easily follows by putting X~= Pa(Ar)
(notice that in each system (Sa) the last equation does not refer to all atoms as in system (11.3) -which coincides with (So) -but only to the atoms contained in Hr; ) . To prove that (b) implies (a), i.e. that on C the assessment Pis coherent, we show that P is coherent on each finite family :F ~ C (see Remark 8). Consider, as family C' 2 :F (recall Definition 6), the cartesian product Q x B, and take any event FIK E C' = Q x B. Since B is an additive set, then K is a disjunction of some (possibly all) of the Hi's: let f3 be the maximum of the indexes a's such that K ~ Hr; (i.e., the corresponding system (S 13 ) contains all the equations relative to the Hi ~ K ). Therefore the solution x~ = P13 ( Ar) of this system is nontrivial for at least one of the aforementioned equations and K C£. H~+l ; it follows P13(K)
=
Lr X~ Ar 0, P{(H2 ) = 1 > 0. For A = ~ we have P~'(Hs) = 0, so that H~ = A1 V As. Solving (Sa) for a = 1 gives u1
= 0,
us= 1,
with Ur = P{' (Ar). Notice that for the unique element of this class satisfying (11.6}, with G = Hs, we have P{'(Hs) = 1 > 0. In conclusion, we have found three classes - those defined under (c)- i.e.: {P0 } , {P~, P{}, {P~', P{'}; the corresponding representations (11. 7) for p 1 = P(E1 IHI) = 1 are
A 0 + 0 +A l2
- 0+0+ ~
=
-
P{(As) P{(A1 V A2 V As)
P"(A 0 s) P~'(A1 V A2 V As)'
and similar expressions can be easily obtained to represent P2 and P3·
Remark 11 - As we have seen in the previous Chapter (Theorem 3), a conditional probability can be, in a sense, regarded as a sort of monotonic function, that is
AIH ~0 BIK ~ T(AIH) ~ T(BIK) '
(11.10)
COHERENT CONDITIONAL PROBABILITY
89
where T is the truth-value defined by ( 10.1) and the inequality (obviously) refers to the numerical values corresponding to every element of the partition obtained as intersection of the two partitions {A 1\ H, Ac 1\ H, ne} and {B 1\ K, BC 1\ K, Kc}. Recalling that the present notation for t( ·I·) is P( ·I·) and that it is easy to check by a complete scrutiny of the truth-values of the relevant (unconditional) events - the validity of {11.10) when the truth-values of the given conditional events are 0 or 1, we can easily show that {11.10) characterizes coherence. The relevant system (So) is
(So)
I
+ x2 + Xs + x6 + Xg + x10) Xt + X2 + Xa +~4 = P(BIK)(xl + X2 + x 3 + X4 + x 5 ) X1 + ... + Xu -1 x1
= P(AIH)(xl
Xr
2:: 0
where the unknowns Xr 's are the probabilities of the eleven atoms introduced in the proof of Theorem 3. Notice that, to take into account of the possibility that P(H) = 0 or P(K) = 0, we need to go on by considering also system (SI). The computations are {"mutatis mutandis") essentially those already done in the just mentioned proof. The following theorem shows that a coherent assignment of P(·l·) to a family of conditional events whose conditioning ones are a partition of n is essentially unbound.
Theorem 5 - Let C be a family of conditional events {EiiHihEh where card(!) is arbitrary and the events Hi's are a partition of n. Then any function p: C -t [0, 1] such that
is a coherent conditional probability. Proof- Coherence follows easily from Theorem 4 (the characterization theorem of a coherent conditional probability); in fact, for any finite subset :F ~ C we must consider the relevant systems
CHAPTER 11
90
(Sa): each equation is "independent" from the others, since the events Hi's have no atoms in common, and so for any choice of P(EiiHi) each equation (and then the corresponding system) has trivially a solution (actually, many solutions). •
11.5
Related results
As already (briefly) discussed in Section 10.3, in (104] Renyi considers axioms (i}-(iii} for a (countably additive) function P(·l·) defined on g x 8°, where g is an algebra of subsets of nand Ban arbitrary subset of g (let us call such a P(·l·) a weak conditional probability). While a conditional probability - as defined in Section 11.1, Definition 5- is (trivially) coherent, a weak conditional probability may not be extendible as a conditional probability, i.e. it is not necessarily coherent (in spite of the fact that g is an algebra, and even if we take g and B finite), as shown by the following
Example 13 -Let A, B, C, D events such that A= B 1\C 1\D and B ~ C V D , B ~ C , B ~ D . Denote by g the algebra generated by the four given events, and take B = {B, C, D}. Among the assessments constituting a weak conditional probability P(·l·) on g x 8° , we may consider the one which takes, in particular, for the restrictions (unconditional probabilities) P(·IB), P(·IC), P(·ID), the following values : P(AIB) = 0 , P(AIC) = P(AID) =
~;
it satisfies (trivially} axioms (i}-(iii), but P is not coherent: in fact, extending it to the additive class generated by B, we must necessarily have
P(AIC V D)= P(AiC)P(CIC V D)= P(AID)P(DIC V D), (*)
COHERENT CONDITIONAL PROBABILITY
91
which implies P(CICV D)= P(DICV D). So at least one of these two conditional probabilities is positive, since
P(CICV D) +P(DICV D)~ 1, and then, by (*}, P(AIC V D) > 0. But
P(AIC V D)
= P(AIB)P(BIC V D) = 0
(contradiction). Renyi proves that a weak conditional probability can be obtained by means of a measure m defined in g (possibly assuming the value +oo) by putting, for every B E 8° such that 0 < m(B) < +oo and for A E g, (11.12) Conversely, he finds also a sufficient condition for a weak conditional probability P( ·I·) to be represented by a measure m in the sense of (11.12). Renyi poses also the problem of finding conditions for the existence of a class of measures {ma} (possibly assuming the value +oo) that allows- for every BE 8° such that 0 < ma(B) < +oo for some o: -a representation such as (11.12), with m= ma. Moreover (in the same year - 1955 - and in the same issue of the journal containing Renyi's paper), Csaszar (43] searches for a weak conditional probability P on g x 8° such that there exists a dimensionally ordered class of measures Ma defined in g , apt to represent, for any AIB E g x 8°, the function P. This means that, if A E g and P,-y(A) < +oo for an index-y, then p,p (A) = 0 for f3 < 1' ; moreover, if for every B E 8° there exists an o: such that 0 < Ma(B) < +oo, then (11.12) holds with m= /1-a. He proves that a necessary and sufficient condition for P to admit such a representation is the validity of the following condition
(C):
CHAPTER 11
92
{C) If Ai ~ Bi 1\ Bi+ 1 (with Ai Bn+l = B1 ), then
E
n
n
i=l
i=l
g, Bi
E
B 0 , i = 1, ... , n, and
II P(AiiBi) = II P(AiiBi+l) . Notice also that this condition was obtained by Renyi as a consequence of axioms (i)-(iii) in the case in which the family B is an additive set (that is, when the weak conditional probability is a conditional probability according to Definition 5); and Csaszar proves that (C) implies that P can be extended in such a way that the family B is an additive set. On the other hand, in 1968 Krauss [88] goes on by considering (in a finitely additive setting) a function P( ·I·) satisfying axioms (i)-(iii) on g x A 0 , with g and A Boolean algebras and A~ g (let us call this P(·l·) a strong conditional probability, which is, obviously, a conditional probability). In particular, P is called a full conditional probability when A= g. (We recall also that Dubins [56] proves that a strong conditional probability can always be extended as a full conditional probability, while Rigo [105] proves that a weak conditional probability can be extended as a full conditional probability if and only if condition (C) of Renyi-Csaszar holds). Krauss characterizes strong conditional probabilities in terms of a class of (nontrivial) finitely additive measures ma (not necessarily bounded), each defined on an ideal Ia of A, with If3 ~ Ia for (3 > a: for every B E A 0 there exist an ideal Ia such that
B E Ia \ U{.T,. : I,. ~ Ia} r
and for every A E Ia one has ma (A)
A
E
Uf4 : I,.
= 0 if and only if
~ Ia} U {0} ;
r
then, for any Ia and A, B E Ia ,
ma(A 1\ B)
= P(AIB) ma(B) .
COHERENT CONDITIONAL PROBABILITY
93
Notice that, if in our Theorem 4 (characterizing coherence) we take the set C = g x .A0 , with g and .A finite Boolean algebras and .A ~ g (in this case coherence of P is obviously equivalent to satisfiability of axioms {i}-{iii}), Krauss' theorem corresponds to the equivalence between conditions (a) and (c), with ma(·) = ma(Hg) Pa(-), and the family {Pa} is unique (as already observed after the proof of characterization theorem). We stress that none of the existing "similar" results on conditional probability (including those concerning weak and strong conditional probabilities) covers our framework based on partial assessments. In fact, for both Csaszar and Krauss (and Renyi), given a P(·l·) on g x 8° , the circumstance that g (and, for Krauss, also B ) are algebras plays a crucial role, as well as the requirement for P to satisfy condition {C): notice that both the subsets Ia and the measures ma need (to be defined) values already given for P, and the same is true for checking the validity of (C). In particular, to build the family {ma} Krauss starts by introducing, for any given event BE 8°, :F(B)
= {Bi
E ~: P(BIB V Bi)
> 0}
(so to say, B has not zero probability with respect each event Bi ), showing that :F(C) ~ :F(B) {::} P(CIC v B) = 0;
then for any B E 8° a relevant measure is defined in :F(B), by putting, for A E :F(B), P(AIAV B) mB(A) = P(BIA V B) '
and he proves that the set of :F(B)'s (and so that of the corresponding measures) is linearly ordered.
CHAPTER 11
94
In conclusion, all these results constitute just a way - so to say - to "contemplate" and ri-organize existing "data", while in our approach we must search for the values which are necessary to define the classes {Pa} ruling coherence. Then condition (b) of Theorem 4 becomes essential to build such classes (in the next Chapter we will come back to them, showing their important role also for the concept of zerolayer).
11.6
The role of probabilities 0 and 1
The following example shows that ignoring the possible existence of null events restricts the class of admissible conditional probability assessments. Example 14 -Given three conditional events E1IH1, E2IH2, EaiHa such that Ao = {A 1 , ... , A 5 }, with
H1 = A1 V A2 V Aa V A4 , H2 = A1 V A2, E1 A H1 = A1 , E2 A H2 = A2 , Ea
A
Ha = Aa V A4 , Ha = Aa ,
consider the assessment
P1
3
= P(EdH1) = 4,
If we require positivity of the probability of conditioning events, we must adjoin to the system (Sa) with a = 0 also the conditions
and this enlarged system (as it is easily seen) has no solutions. Instead the given assessment is coherent, since the system (Sa) has the solution 3
xl
= 4'
COHERENT CONDITIONAL PROBABILITY where Xr = P0 (Ar)· Then, solving now the system (Sa) foro: (notice that H~ = A 3 V A4) gives
95
=
1
1
Y3
with Yr
= Y4 = 2,
= PI(Ar)· In conclusion
are the representations {11. 7} of the given assessment. As far as conditioning events of zero probability are concerned, let us go back to Example 6 (Chapter 3) to show that what has been called a "natural" and "intuitive" assessment is a coherent one.
Example 15 (Example 6 revisited) - Given the assessment
consider the atoms generated by the events SI, H2, T2, s2:
A4
= H2 1\ sr, As= T2 1\ sr, A6 = 821\ sr
so that, putting Xr = P0 (Ar), to check coherence of the above assessment we should start by studying the compatibility of the following system XI+ X2 + X3 = O(xi + X2 + X3 + X4 + X5 + X6) xi= Hxi + x2 + x3) x2 = Hxi + x2 + x3) X3 = O(xi + x2 + x3) XI + X2 + X3 + X4 + X5 + X6 = 1 Xr;::: 0
96
CHAPTER 11
which has the solution X1 = x2 = Xg = Xs = 0, X4 = x 5 = ~ . So we can represent P(S1 If2) as ¥ = 0. Going on with the second system (SI), we get
(SI)
Y1 = !(y1 + Y2 + yg) Y2 = !(y1 + Y2 + yg) Y3 = O(y1 + Y2 + yg) Y1 + Y2 + Y3 = 1 Yr ~ 0
whose solution Y1 = Y2 = ~ , y3 = 0 allows to represent, by the probabilities P 1 (Ar) = Yr defined on A 11 also the three remaining given conditional probabilities.
Remark 12 - A sensible use of events whose probability is 0 (or 1) can be a more general tool in revising beliefs when new information comes to the fore. So we can challenge a claim contained in [118} that probability is inadequate for revising plain belief, expressed as follows: "'I believe A is true' cannot be represented by P(A) = 1 because a probability equal to 1 is incorrigible, that is, P(AIB) = 1 for all B such that P(AIB) is well defined. However, plain belief is clearly corrigible. I may believe it is snowing outside but when I look out the window and observe that it has stopped snowing, I now believe that it is not snowing outside". In the usual framework, the above reasoning is correct, since P(A) = 1 and P(B) > 0 imply that there are no logical relations between B and A (in particular, it is A AB I 0) and P(AIB) = 1. Taking instead P(B) = 0, we may have A AB = 0 and so also P(AIB) = 0. On the other hand, taking B= "looking out the window, one observes that it is not snowing" (again assuming P(B) = 0), and putting A="it is snowing outside", we can put P(A) = 1 to express
COHERENT CONDITIONAL PROBABILITY
97
a strong belief in A, and it is clearly possible (as it can be seen by a simple application of Theorem 4) to assess coherently P(AIB) = p for every value p E [0, 1]. So, contrary to the aforementioned claim, a probability equal to 1 can be, in our framework, updated.
Chapter 12 Zero-Layers We introduce now the important concept of zero-layer [29], which naturally arises from the nontrivial structure of coherent conditional probability brought out by Theorem 4.
12.1
Zero-layers induced by a coherent conditional probability
Definition 7 - Let
c = cl X c2
be a finite family of conditional events and P a coherent conditional probability on C. If P = { Pa} a=O,l,2, ... ,k is a relevant agreeing class, for any event E -=/=- 0 belonging to the algebra generated by C1 U C2 we call zerolayer of E, with respect to the class P, the (nonnegative) number /3 such that PfJ(E) > 0: in symbols, o(E) = j3.
Zero-layers single-out a partition of the algebra generated by the events of the family C1 u C2 . Obviously, for the certain event n and for any event E with positive probability, the zero-layers are o(O) = o(E) = 0, so that, if the class P contains only an everywhere positive probability Po, there is only one (trivial) zero-layer with a= 0. 99
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
100
CHAPTER 12
As far as the impossible event 0 is concerned, since Pa(0) = 0 for any o , we adopt the convention of resorting to the symbol +oo to denote its zero layer, i.e. o(0) = +oo. Moreover, it is easy to check that zero-layers satisfy the relations o(A V B) = min{ o(A), o(B)},
and o(A 1\ B) ~ max{ o(A), o(B)}.
Notice that zero-layers (a concept which is obviously significant mainly for events of zero probability) are a tool to detect "how much" a null event is ... null. In fact, if o(A) > o(B) (that is, roughly speaking, the probability of A is a "stronger" zero than the probability of B), then P(AI(A v B))= 0 (and so P(BI(A V B))= 1), since, by Theorem 4,
P(AI(A V B))
Pa(A)
= Pa(A V B) ,
where o is the zero-layer of the disjunction A VB (and so of B); it follows Pa(A) = 0. On the other hand, we have o(A) = o(B) if and only if P(AI(A v B))· P(BI(A V B))
> 0.
Two events A, B satisfying the above formula were called commensurable in a pioneering paper by B. de Finetti [50]. Definition 8 - Under the same conditions of Definition 7, consider a conditional event EIH E C: we call zero-layer of EIH, with respect to a class P = {Pa} of probabilities agreeing with P, the (nonnegative} number
o(EIH)
= o(E 1\ H) -
o(H) .
101
ZERO-LAYERS
Notice that P(EIH) > 0 if and only if o(E A H) = o(H), i.e. o(EjH) = 0. So, also for conditional events, positive conditional probability corresponds to the zero-layer equal to 0. Moreover, by the convention adopted for the zero-layer of 0, we have EA H = 0 =? o(EjH) = +oo.
Example 16 - Revisiting Example 15, it is easy to check that the zero-layers of the null events 8 1 and 8 1 A 8 2 are, respectively, 1 and 2; so the zero-layer of the conditional event 82 181 is 2- 1 = 1. Other examples of zero-layers can be easily obtained by revisiting the other examples given in the final part of the previous Chapter and resorting to the corresponding agreeing classes.
12.2
Spohn's ranking function
Spohn (see, for example, [121], [122]) considers degrees of plausibility defined via a ranking function, that is a map "" that assigns to each possible proposition of a finite "world" W a natural number (its rank) such that (a) either K(A) = 0 or K(Ac) = 0, or both; (b) K(A V B)= min{K(A), K(B)}; (c) for all A A B =j:.
K(BIA)
0, the conditional rank of B given A is
= K(A A B) -
K(A).
Ranks represent (according to Spohn terminology) degrees of "disbelief". For example, A is not disbelieved iff K(A) = 0, and it is disbelieved iff K(A) > 0. They have the same formal properties of zero-layers; the set of not disbelieved events is called the core E of K, that is
E
= {w
E
W : K( { w})
= 0}.
102
CHAPTER 12
It corresponds (in our setting) to the set of events whose zero-layer is a = 0 , i.e. events of positive probability P( ·IHg) (possibly,
Hg =
n).
Ranking functions are seen by Spohn as a tool to manage plain belief and belief revision, since he maintains that probability is inadequate for this purpose. But in our framework this claim can be challenged, as it has been discussed in Remark 12 of the previous Chapter (a simple computation shows that the zero-layer of the null event B considered in that Remark is equal to 1). See also the paper [39].
12.3
Discussion
Even if ranking functions have the same formal properties of zerolayers, notice that - contrary to Spohn - we do not need an "autonomous" definition, since zero-layers are - so to say "incorporated" into the structure of a coherent conditional probability : so our tool for belief revision is in fact coherent conditional probabilities and the ensuing concept of zero-layer. Moreover, ranking functions need to be defined on all subsets of a given "world" W, since otherwise their (axiomatic) properties could be, in some cases, trivially satisfied without capturing their intended meaning (compare this remark with the discussion of the axioms for probability, at the beginning of Section 3.2). The starting point of our theory is instead an arbitrary family cl u c2 of events (see Definition 7), from which zero-layers come out.
Example 17 - Let E, F, C be events such that E V F VC= n, E 1\ F 1\ C = 0 , Ec 1\ Fe = Fe 1\ cc = Ec 1\ cc = 0 . The following rank assignment
x:(E) = 1 , x:(F) = 2 , x:(C) = 0 satisfies the axioms, nevertheless it is not extendible to the algebra generated by the three given events.
103
ZERO-LAYERS
There are in fact three atoms
and we have
now, since
then ii;(A2 ) = 0 or ii;(A 3 ) = 0 (or both}. But the values of the rank assigned to E, F, G clearly imply ii;(A 2 ) ;::::: 2 and ii;(A3 ) ;::::: 1. Now, a brief discussion concerning further differences between zerolayers and ranking functions follows. In our framework, the assignment (and updating) of a zero-layer of an event through conditioning is ruled by coherence, and can give rise both to events remaining "inside" the same layer or changing the layer (this aspect will be deepened also in the last section of Chapter 16 on Inference, concerning the problem of updating probabilities 0 and 1); on the other hand, the definition of condizionalization given by Spohn [122] is, in a sense, free from any syntactic rule. In fact, to make inference a ranking function ii; is updated by a function ii;A,n (where A is an event of Wand n a natural number) given by
I
/i;(BIA) = /i;(B A A) - /i;(A) , if B ~ A
ii;A,n(B) =
ii;(BIAc)
+ n, if B
~
Ac
min{ii;A,n(B A A), ii;A,n(B A Ac)}, for all other B.
The "parameter" n is a measure of the "shifting" of ii; restricted to A with respect to ii; restricted to Ac, and Spohn himself ascribes
104
CHAPTER 12
to the value n a wholly subjective meaning (he claims: "there is no objective measure of how large the shift should be") ; but the value of n plays a crucial role in the new assessment of r;,, which is influenced by n also in the third case (B g A and B g A c ) • Anyway, what comes out is a new "scenario" relative only to the situation A . So it is not possible, with a ranking function, to consider at the same time many different conditioning events Hi in the same context, as we do in our setting; moreover, there is no needin the approach based on coherence- of the (arbitrary) number n, since coherent conditional probabilities allow "automatic" assignment of both probability values and zero-layers. The following example may help in making clearer this issue : Example 18 - Consider five conditional events Ei!Hi, obtained from the square E = [0, 1] x [0, 1] c JR? in this way: take the (unconditional) events
Ea with x1
= {(x, y) E E
:x
= y} ,
= ~ , Y1 = Y2 = ~ , x2 = ~ , and
Then (assuming a uniform distribution on E) consider the assessment: P(E1IH1)
= P(E2IH2) = P(Ea!Ha) = 0,
P(E4IH4) The relevant atoms are
1
= 2, P{E5IH5) = 0.
105
ZERO-LAYERS
and system (80 ) is X1 = 0 ·(XI+ X2 + X3 + X2 = 0 ·(xi+ X2 + X3 +
X4)
X3 = 0 ·(xi+ X2 + X3 +
X4)
X4)
X1 = ~ · (x1 + x2) X1 = 0 • (xi+ X3) X1 + X2 + X3 + Xr ~
X4
= 1
0.
Its only solution is
and then o(A4 ) = 0. Going on with system (SI), we get Y1 = ~ · (YI + Y2) Y1 = 0 · (YI + Y3) Y1 + Y2 + Y3 = 1 Yr ~ 0,
whose only solution is Y1 = Y2
=0,
Y3 = 1 ,
so that o(E3 ) = 1 . Finally, the system (S2 ) gives z1 = ~ · (z1 + z2) { z1 + z2 = 1 Zr ~
0,
that is z1 = z2 = ~, so that o(EI) = o(E2 ) = 2 (and since we have E 4 = E 5 = E 1 , then also E 4 and E 5 are on the same layer). Then
CHAPTER 12
106
o(Hs) = o(EI V E3) = min{ o(EI), o(E3)} = 1, so that, in conclusion,
o(E4IH4) = o(E4)- o(H4) = 2- 2 = 0 (in fact P(E4IH4) > 0 ), while
o(EsiHs)
= o(Es)- o(Hs) =
2- 1 = 1,
z.e. conditioning on H 5 makes E 5 a "weaker" zero (a picture of the unit square with the relevant events may be helpful to appreciate the intuitive meaning of these conclusions!) In this example we have also another instance of the possibility of updating (coherently!) a probability equal to 1: consider in fact, for example, P(E4) = 1, and notice that P(E4IH4) = ~. In conclusion, coherent conditional probability complies, in a sense, with Spohn's requirements; he claims in Section 7 of [120]: "... Popper measures are insufficient for a dynamic theory of epistemic states . . . the probabilistic story calls for continuation. It is quite obvious what this should look like: just define probabilistic counterparts to ranks which would be something like functions from propositions to ordered pairs consisting of an ordinal and a real number between 0 and 1 . . . the advantage of such probabilified ranks over Popper measures is quite clear". We have shown that we do not need to distinguish between the two elements of the ordered pair that Spohn associates to each proposition, since all the job is done by just one number. In this more general (partial assessment allowed!) setting, the same tool is used to update both probabilities (and zero-layers) of the events initially taken into account (or else of those belonging to the same context, i.e. logically dependent on them), and probabilities (and zero-layers) of "new" events "come to the fore" later. In fact updating is nothing else than a problem of extension (see the next
ZERO-LAYERS
107
Chapter and Chapter 16 on Inference), so that a Popper measure (which is the "nearest" counterpart to de Finetti's conditional probability: see Section 10.3) is certainly apt to do the job, since it is a particular coherent conditional probability, whose updating is always possible (see also the remarks at the end of Section 12.2). Notice also that the set of events belonging to the same zerolayer is not necessarily an algebra, so the role of coherence is crucial to assign a probability to them. On the other hand, it is unclear, starting from the assignment of ranks, how to get a "probabilified" rank without conditioning to the union of events of the same rank (regarding this conditional probability as a restriction of the whole assessment on W ), but this is a matter of conditioning- except for the rank 0 - with respect to events of zero probability; then, since a tool like coherent conditional probability (or Popper measure) is anyway inevitable, why not introducing it from the very beginning instead of letting it "come back through the back-door"? Another issue raised by Spohn in [120] is to resort to nonstandard numbers (i.e., the elements of the iperreal field R * , a totally ordered and nonarchimedean field, with R* :::) R) as values of the relevant conditional probabilities. We deem that a (ticklish) tool as the iperreal field is not at all easily manageable, for example when we need considering both reals and iperreals (as it may happen, e.g., in Bayes' theorem). Moreover, it is well known (see [88]) that an iperreal probability P* gives rise to a conditional probability P(EIH) = R [P*(E A H)] e P*(H) '
where Re denotes the function mapping any iperreal to its real part (see, e.g., [106]); conversely, given a conditional probability, it is possible to define (not uniquely) an iperreal one. Then, if the above ratio is infinitesimal, we get P(EIH) = 0. Anyway, in our coherent setting the process of defining autonomously ranks to be afterwards "probabilified", or of introducing iperreal probabilities, is not needed (not to mention- again-
108
CHAPTER 12
the further advantage of being allowed to manage those real situations in which partial assessments are crucial). The role of zero-layers for the concept of stochastic independence is discussed in Chapter 17, where also the "unpleasant" consequences coming out from resorting (only) to ranking functions to define independence are shown (see, in particular, Remark 16).
Chapter 13 Coherent Extensions of Conditional Probability A coherent assessment P, defined on a finite set C of conditional events, can be extended in a natural way (through the introduction of the relevant atoms) to all conditional events EIH logically dependent on g, i.e. such that E 1\ H is an element of the algebra g spanned by the (unconditional) events Ei, Hi (i = 1, 2, ... , n) taken from the elements of C, and H is an element of the additive class spanned by the Hi's. Obviously, this extension is not unique, since there is no uniqueness in the choice of the class {P01 } related to condition (c) of Theorem 4. In general, we have the following extension theorem (essentially due to B. de Finetti [52] and deepened in its various aspects in [94], [126], [84], [102]).
Theorem 6 - If C is a given family of conditional events and P a corresponding assessment, then there exists a {possibly not unique) coherent extension of P to an arbitrary family K of conditional events, with /C 2 C, if and only if P is coherent on C. Notice that if P is coherent on a family C, it is coherent also on 109 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 13
110
E c_; C. In order to have a complete picture of the problems related to the extension to a new conditional event EIH of a coherent conditional assessment P on a finite family C, we will refer to the following two points: (i) finding all possible coherent extensions of the conditional probability P(EIH) when EIH E g x go; (ii) extending this result to any conditional event FIK (i.e., possibly with FIK rt g x go).
Consider (i). First of all, notice that, given two coherent assessments relative to n + 1 conditional events
IT' = {P(EiiHi) =Pi, i = 1, ... , n; P(EIH) = p'} and
with p'
~ p",
then also the assessment
IIa ={Pi ,i = 1, ... ,n; ap' + (1- a)p"} is coherent for every a E (0, 1]: this plainly follows by looking at the relevant gain n
G
= L AilH;(IE;- Pi)+ AolH(IE- (ap' + (1- a)p")) i=l
and noting that, for A0 > 0, n
G 2 L AifH; (/E, -Pi)
+ AolH(JE -
P11 )
i=l
and
n
G ~ LAifH;(/E;- Pi)+ AofH(JE- P1 ) , i=l
CONDITIONAL PROBABILITY EXTENSIONS
111
so that the conclusion follows from the coherence of IT' and IT". Therefore the values of the possible coherent extensions of P to EIH constitute a closed interval [p', p''] (possibly reducing to a single point). Now, denote by pi the set of all classes {Pa}; related to the often mentioned characterization theorem. For the conditioning event H E go there are the following two situations: • (A) there exists, for every class {Pa} E pi, an element PfJ (defined on the subset AfJ of the set Ao of atoms: cf. condition (c) of Theorem 4) such that Hi ~ AfJ for some i, with PtJ(H) > 0;
• (B) there exists a class {Pa} E pi such that for every a one has Hi~ Aa for some i, and Pa(H) = 0. In the case (A), we evaluate by means of formula (11.7) all the corresponding values P(EIH), and then we take infimum and supremum of them with respect to the set pi. By writing down the relevant programming problem, we get
L: ArCEI\H
11
p = sup 1'3
"" L.....J
y~ a
Yr
'
Ar~H
where y~ = Pa(Ar) and Ar E Ao, the set of atoms of the algebra g (we denote by the same letter a all the indices ai corresponding to the class containing the probability Pa such that Pa(H) > 0).
CHAPTER 13
112
It is easily seen that this problem is equivalent to the following linear one
p'=inf
L
z~
p" = sup
'PJ Arr:;_EAH
where
z~ = y~ / L y~
L
z~ ,
p; Arr:;_EAH
.
Arr:;_H
Clearly, infimum and supremum will be reached in correspondence to those classes such that the relevant systems have the minimum number of constraints. In the next Chapter we will expound a strategy to make easier the computation of the solution of the above programming problem by suitably "exploiting" zero probabilities : this means that we search for classes pi in which Pa(H) = 0 for the maximum number of indices a.
In the case (B) we are in the situation discussed in Remark 10 of the previous Chapter: we can assign to the conditional events EIH arbitrary values, so that p' = 0 and p" = 1. Consider now point (ii): we must take a conditional event FIK ~ Q x go, so that the events F 1\ K and K are not both in Q ; we show now that we can find suitable conditional events F.IK. and F*IK* such that the events F., K., F*, K* are union of atoms, proving then that a coherent assessment of the conditional probability P(FIK) is any value in the closed interval p. ~ P(FIK) ~ p*,
where p. = inf P(F.IK.), 'PJ
p* = supP(F*IK*). pj
(13.1)
CONDITIONAL PROBABILITY EXTENSIONS
113
Obviously, if 1l is the algebra spanned by g U {F, K}, there is (by Theorem 6) a coherent extension of P to 1l x 1l0 • Now, let p a possible value of a coherent extension of P(·l·) (initially given on C) to the conditional event FIK fj. g x go, and consider the set Bo of the atoms Br generated by Ei, Hi (i = 1, 2, ... , n), F, K (that is, Ar 1\ (F 1\ K), Ar 1\ (Fe 1\ K), Ar 1\ Kc for any Ar E Ao ). Since p is coherent, there exists (at least) a class {Pa} containing a probability Pa (to simplify notation, for the index singling-out this probability we use the same symbol which denotes the generic element of the class) such that
P.a (F 1\ K)
p-
-
Pa(K)
Er
Pa(Br)
BrCFI\K
- ----=--==-------=----:-=---:--
-
Er Pa(Br)
Br~K
>
X
Er
Br~FI\K
Pa(Br)
+
Er
Pa(Ar)
Ari\Fci\K-:j=0
X+ a
Since a ~ 0, the latter function (of x) is increasing for any x, so that, taking into account that, for the atoms (of the two classes Ao and 8 0 ) contained in F 1\ K we have Vr Ar ~ Vr Br , we get
Er p
2::
Er
Ar~FI\K
Pa(Ar)
ArCFI\K
Pa(Ar)
+
Er
Pa(Ar)
= P(F*IK*)'
Ari\Fci\K-:j=0
where each probability Pa assures - according to condition (c) of the characterization theorem - also the coherence of the initial assessment on C, and
CHAPTER 13
114 Moreover, clearly,
F.
1\
K. =
F; 1\ K. =
V Ar
Ar~FI\K
V
~
F
1\
Ar 2 pc
K, 1\
K.
ArAFCI\K-:f;0
Notice that F. IK. is the "largest" conditional event belonging to g x go and "included" in FIK, according to the definition of inclusion ~o for conditional events recalled in Theorem 3 of Chapter 10 and in Remark 11 of Chapter 11. Now, letting Pa. vary on all different classes of pJ assuring the coherence of the initial assessment on C, we get the left-hand side of (13.1). For the right-hand side the proof is similar, once two events F* and K* are suitably introduced through the obvious modifications of their "duals" F. and K • . In conclusion, we can summarize the results of this Chapter in the following Theorem 7 - Given a coherent conditional probability P on a finite set
c =cl X c2 = {EliHl, ... 'EniHn} of conditional events, let pJ = { Pa.} be the set of classes agreeing with P, and let g be the algebra generated by C = C1 UC2 • Consider a further conditional event FIK fj C, and put F.IK.
= AIBg>FIK sup { AIB} , AIBEQxgo
F* IK*
= FIKg>AIB inf {AIB} AIBEC/XC/
0
.
Then a coherent assessment of P(FIK) is any value of the interval [p.,p*], wherep. = 0 andp* = 1 if F.IK. or F*IK* satisfy condition (B), while, if both satisfy condition (A), p. = infP(F.IK.), p1
p* = supP(F*IK*). Pi
CONDITIONAL PROBABILITY EXTENSIONS
115
Remark 13 - When condition (A) holds for both F* IK* and F* IK* , we may have p* = 0 and p* = 1 as well: it is easily seen that this occurs when there exists a class {Pn} such that A,e 2 Hg and A,e R. K* (or K*) for an index f3 . This is equivalent to the existence of a solution of system (To) under (3.2) of Section 14.1 of the next Chapter.
Chapter 14 Exploiting Zero Probabilities The previous results related to the coherence principle and to coherent extensions can be set out as an algorithm for handling partial conditional probability assessments, the corner-stone of all the procedure being the characterization Theorem 4 of Chapter 11.
14.1
The algorithm
If C is an arbitrary family of conditional events EiiHi (i = 1, ... , n), suitably chosen as those referring to a "minimum" state of information relative to the given problem, supply all the known logical relations among the relevant events Ei, Hi , and give a "probabilistic" assessment P = {Pi = P(EiiHi)}. The procedure to check coherence can be implemented along the following steps:
• (1): build the family of atoms generated by the events Ei, Hi (taking into account all the existing logical relations); • (2): test the coherence of P. 117 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 14
118
The second step is really a subprocedure implemented by the following algorithm: • (2.1): introduce the system (Sa) with na unknowns,
• (2.2): put a = 0 in (Sa) ; • (2.3): if (Sa) has solutions, go to (2.4); otherwise the assessment is not coherent and must be anyhow revised (in the latter case, go to step (2.3') to get suggestions for the revising process); • (2.3'- a): introduce subsystems (Sa,k) of (Sa) obtained by deleting, in all possible ways, any k equations; • (2.3'- b): put k = 1 in (2.3'- a) and go to (2.3'- c); • (2.3'- c): if there exist compatible subsystems (Sa,k), then for each of them choose, among the conditional events EiiHi appearing in (Sa) and not in (Sa,k), a conditional event EjiHi: to find its interval of coherence, go to step (3.2), putting there F*
= Ei , K* = Hi ;
• (2.4): if (Sa) has a solution Pa(Ar) such that
Pa(Hi) =
L
Pa(Ar) > 0
Ar0 Yr=O
L
Yr
= P(Ei!Hi) L
~~~A~
Yr
(i
= 1, ... ,n)
~~~
(notice that the last n equations are trivially satisfied); • (3.3): if (To) has a solution (cf. also Remark 13 at the end of previous Chapter), go to step (3.8); • (3.4): if (T0 ) has no solutions, introduce a system (S~) obtained by adding in the following equation to (Sa) :
E
X~=O;
Ar~K.
• (3.5): put a= 0 in (S~) • (3.6): if (S~) has no solutions, go to (3.9); • (3.7): if (S~) has a solution, put a+ 1 in place of a and go to (3.4) until a not compatible system is found - in this case go to step (3.9)- or until the exhaustion of the Hi's- in this case go to (3.8);
122
CHAPTER 14
• (3.8): put p. = 0 and go to step (3.10); • (3.9): solve the following linear programming problem
L
min
x~,
A,.~F.AK.
with constraints
I
L
x~-Pi
A,.~E,I\Hi
L
X~
L
x~=O
A,.~Hi
=1,
X~ ~ 0,
Ar ~ H~
A,.~K.
• (3.10): consider the conditional event F*IK* (i.e., the "dual" of F.IK... as introduced at the end of the previous Chapter) and repeat the procedure from step (3.2) by replacing, in all steps, K,.. by K*, K.I\F; by K* 1\F*, p,.. = 0 by p* = 1 and, x~) by (max x~) . finally, replacing (min
L
L
A,.~F*I\K*
14.2
Locally strong coherence
In this Chapter we are showing how to exploit zero probabilities through the possibility of searching for conditioning events H such that Pa(H) = 0 for the maximum number of probabilities Pa. Furthermore, checking coherence "locally" to get "global" coherence is also strictly connected with the existence of logical relations among the given events, and it is then useful to find suitable subfamilies that may help to "decompose" the procedure: in other words, we need to build only the atoms generated by these subfamilies. This procedure has been deepened in all details (and implemented in XLISP-Stat language) by Capotorti and Vantaggi in [18] through the concept of locally strong coherence, which applies in fact to subfamilies of the given set of conditional events: checking
EXPLOITING ZERO PROBABILITIES
123
whether the assessment on a subfamily does not affect coherence of the whole assessment allows to neglect this subfamily. Hence, even if looking at subfamilies has, in a sense, a "local" character, their elimination has a global effect in the reduction of computational complexity. We start (all results that follow are contained in reference [18]) with the following
Definition 9 - Given the family
of conditional events, an assessment P in C is called strongly coherent with respect to B, where B is an event such that BI\Hi =f 0 for all i = 1, ... , n, if the assessment P' defined on C' = {Eii(Hi 1\ B), (Hi 1\ B)IO: 1 = 1, ... , n}
by putting P'(EiiHi 1\ B) coherent.
=
P(EiiHi) and P'(Hi 1\ B)
>
0 zs
Obviously, strong coherence (with respect to B) implies coherence, but the converse is not true. Moreover, strong coherence implies that it is possible to choose the coherent extension of P to the atoms (generated by the - unconditional - events of the family C) contained in ne by giving them zero probability.
Definition 10 - Let :F = :F1 x :F2 be a subfamily of C, and put v = c \ :F = V1 x v2 . If B:r- = (
V
Hit'
HiE'D2
then the assessment P is locally strong coherent in :F when the restriction of P to :F is strongly coherent with respect to B :F. It follows that B:r- 1\ Hi = 0 for every Hi E :F2. The following theorem points out the connections between coherence of the assessment on C and locally strong coherence in a suitable subset :F.
124
CHAPTER 14
Theorem 8 - Let P : C -+ [0, 1] be locally strong coherent on :F . Then P is coherent (on C) if and only if its restriction to V = C\:F is coherent. The proof of the theorem is based on the following observations : if p is locally strong coherent in :F ' then :F2 n v2 = 0 and the first system (Sa) (of the characterization theorem, Theorem 4 in Chapter 11) has a solution such that x~ = 0 for any atom Ar c; BJ=- and such that x~ > 0 for every Hi E V 2 ; therefore the second sys-
L
Art;Hi
tern (SI) contains only equations relative to the conditional events EiiHi E V, so that coherence on C depends only on coherence on V. The relevant aspect of the above theorem is that locally strong coherence on a subset :F of C makes this subset :F a sort of "separate body" that allows to ignore the relationships among conditional events in :F and those in V: as a consequence, the size of both the family of conditional events EiiHi and the set of atoms where coherence must be checked can be more and more strongly reduced by an iterative procedure, thanks also to necessary and sufficient logical conditions for locally strong coherence relative to specific subsets. For example, in [18] there is a complete characterization of locally strong coherence when :F is a singleton, and many sufficient conditions have been found when :F contains two or three conditional events. We report here only the characterization relative to a single conditional event EIH. If :F = {EIH}, then Pis locally strong coherent in :F if and only if one of the following conditions holds: (a)
P(EIH)
1\
= 1 and E /\ H
Hj
=f. 0 ;
Hj#H
(b)
P(EIH)
= 0 and Ec /\ H
1\ Hj"#H
Hj
=f. 0
EXPLOITING ZERO PROBABILITIES
125
(c)
Therefore, if a conditional event of C satisfies one of the conditions (a), (b), (c), then it is clearly enough to prove coherence only for the remaining n - 1 conditional events: but, before doing so, we can repeat the procedure, searching if there is among them another conditional event satisfying one of the three conditions, and so on, until this is possible. When none of the remaining conditional events verifies (a), (b), or (c), we can proceed by analyzing the (possible) locally strong coherence f~r subsets of C containing two conditional events, and so on. Finally, we meet with a subset of C which is not locally strong coherent with respect to any of its subsets, and here coherence must be checked in the usual way. ln [18] it is proved that the result does not depend on the "path" that has been followed to reach the subset of C where coherence must be checked. Here is a simple example (for more sophisticated ones, see the aforementioned paper).
Example 19 - Let
be such that
and consider the assessment
126
CHAPTER 14
Now, we search for locally strong coherence relative to singletons contained in C: with respect to :F1 = {E1IH1} locally strong coherence fails, since E 1 A H 1 A H~ = 0 , while P is locally strong coherent in :F2 = {E2IH2}, since E~ A H2 A Hf A H3-=/= 0. Then we need to check coherence of P on :F = {E1IH1, E3IH3}, and now P (or, better, its restriction to :F) is locally strong coherent on the set :F1 = {EdH1}, because E1 A H1 A H3 -=/= 0 and
Ef A H1 A H3 -=/= 0 . Therefore it is enough to check coherence only on the singleton :F3 = {E3IH3}, but this is assured by any value in [0, 1] of the relevant conditional probability. In conclusion, the given assessment is coherent. Notice that, by resorting to the usual procedure through the sequence of systems (Sa) , we would need to consider- in this example - eleven atoms.
Chapter 15 Lower and Upper Conditional Probabilities 15.1
Coherence intervals
The extension Theorem 6 (Chapter 13) is the starting point to face the problem of "updating" (conditional) probability evaluations. In particular, the extension to a single "new" conditional event F 1 IK1 (cf. Theorem 7) gives rise to an interval [p~, p~] of coherent values for P(FIIK1). Choosing then a value p E [p~, p~], we can go on with a further new conditional event F 2 IK2 , getting for it a coherence interval [p~, p~], which (besides depending on the choice of p) can obviously be smaller than the interval we could have obtained by extending directly (that is, by-passing F1IK1) the initial assessment to F2 IK2 • Therefore, given an initial assessment P( ·I·) on n conditional events E 1IH1, ... , En!Hn, and h "new" conditional events
if we do not proceed step-by-step by choosing a coherent value in each subsequent interval, we could make h "parallel" coherent extension [p~, p~], ... , [p~, p~], but in this way we are not warranted
127 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
128
CHAPTER15
that choosing then a value Pi in each [p~,p?], i = 1, ... , h, the ensuing global assessment (including the initial one) would be coherent (this particular - and unusual - circumstance is called "total coherence" in [73]). On the other hand, if we choose as values of the further assessment all the left extremes p~ of the above intervals we get, as global evaluation, what is called a lower probability (and we get analogously an upper probability if we choose the right extremes). In particular, we may obviously find that some values of the lower probability are equal to zero, so that the assumption of positivity which is usually done in many approaches to conditioning for "imprecise" probabilities is a very awkward and preposterous one. A thorough discussion of these aspects is in [32] and will be dealt with in the next Chapter in the framework of inferential problems. Moreover, most of the functions introduced in the literature as measures of uncertainty (aiming at extending probability) can be looked upon as particular lower and upper probabilities: so this seems a further argument which renders natural to undertake an alternative treatment of these functions.
15.2
Lower conditional probability
We will refer - here and in the sequel - only to lower conditional probabilities P; clearly, we can easily get corresponding results concerning upper probabilities: in fact an upper probability is a function P defined as in the subsequent formula (15.1) by replacing "inf' by "sup".
Definition 11 - Given an arbitrary set C of conditional events, a coherent lower conditional probability on C is a nonnegative function P such that there exists a non-empty dominating family
LOWER AND UPPER CONDITIONAL PROBABILITIES
129
P = {P( ·I·)} of coherent conditional probabilities on C whose lower envelope is P, that is, for every EIH E C, P(EIH) = i~f P(EIH) .
(15.1)
Example 20 - Given a partition {E1 , E 2 , Ea, E 4 } of n, consider the event H = Ea V E 4 and the assessment
To see that this is not a coherent conditional probability it is enough to refer to Theorem 6 (Chapter 13) : in fact there does not exist, for instance, a coherent extension to the conditional event HIH, since p(EaiH)
1
+ p(E4IH) = 2 i= 1 = p(HIH).
Nevertheless there exists a family P = {P', P"} of coherent conditional probabilities, with
P'(E1In) P"(E1In)
= ~, P'(E2In) = ~, P'(EaiH) = ~, =
l,
P"(E2In)
= ~, P"(EaiH) = ~,
P'(E4IH)
= ~,
P"(E41H)
=
l·
and p is its lower envelope.
We show now that, when C is finite, if P is a coherent lower conditional probability, then there exists a dominating family P' ;2 P such that
P(EIH)
= II.}!n P(EIH) .
Since any element of the dominating family must be a coherent conditional probability (then extendible to g x 8°, with g algebra and 8° additive class), we may argue by referring to C = g x 8°.
130
CHAPTER 15
Let EIH E C be such that P(EIH) = i~f P(EIH), but not the minimum; then for any € > 0 there exists a conditional probability PF. E P with
Define a new conditional probability P' = lim PF. ( P' is a condiF.-tO tional probability, since the limit operation keeps sum and product and also the equality PF.(HIH) = 1 ). Now P'(EIH)
= limPF.(EIH) = P(EIH) F.-tO
and for any other conditional event FIK E C we have lim0 PF.(FIK) = P'(FIK) ~ P(FIK), E-t
Definition 12 - Given a coherent lower conditional probability P on C and any conditional event FiiKi E C, the element P of the dominating family P such that P(FiiKi) = P(FiiKi) will be called i-minimal conditional probability. The following relevant theorem has been given and discussed in [27] and [32]. For simplicity, we prefer to formulate it for a finite family of conditional events, but it could obviously (see Remark 8, Chapter 11) be expressed in a form similar to that of Theorem 4 (the characterization theorem for "precise" conditional probabilities). Theorem 9 - Let C be an arbitrary finite family of conditional events FiiKi, and denote by Ao the usual relevant set of atoms. For a real function P on C the following two statements are equivalent: (a) the function P is a coherent lower conditional probability on C; {b) there exists, for any FiiKi E C (at least) a class of probabilities ITi = { P~, Pf, ... } , each probability P~ being defined on a
LOWER AND UPPER CONDITIONAL PROBABILITIES suitable subset A~ ~ unique P~ with
Ao,
131
such that for any F3 jK3 E C there is a
Lr P~(Ar) > 0 Ar~Kj
and
Er P~(Ar) P(F.·IK·) < Ar~Fji\Kj . 3 3 Er P~(Ar)
i= i
if
j
if
j=i
Ar~Kj
Er
P(PIK·) -
3
= Ar~Fji\Kj
P~(Ar)
Er P~(Ar)
3
Ar~Kj
and, moreover, A~, C A~, for a' > a" , while P~, ( Ar) = 0 if ArEA~,.
Proof- Let .FiiKi E C: there exists a coherent conditional probability pi (i-minimal) on C such that Pi(EiiHi) = P(EiiHi) and Pi(E3 jH3) ~ P(E3 jH3) for j i= i. Then this clearly amounts, by Theorem 4, to the validity of condition (b). • Actually, it is possible to build the classes {P~} as solutions of sequences of systems (one for each conditional event FiiKi E C) like the following one:
Lr
P~(Ar)
= P(FijKi)
~~~!\~
Lr P~(Ar) [if P~-l(Ki)
= 0],
~~~
Lr
P~(Ar) ~ P(F;IK;) Lr P~(Ar) [if P~-l(K;)
~~~!\~
~~~
= 0],
Lr P~(Ar) = 1 Ar~K;:,;
where the second line gives rise to many inequalities, one for each j i= i , and K~i is, for a ~ 0, the union of the Ki 's such that P~_ 1 (Ki) = 0 (and P~ 1 (Ki) = 0 for all K/s). This can give rise, as in the case of probabilities, to an actual algorithm to prove the consistency (coherence) of a lower
132
CHAPTER 15
probability assessment on a finite set of events. Clearly, for a partial lower (upper) probability assessment we have less stringent requirements, since systems with inequalities have more solutions than those with only equalities, i.e. there are better "chances" (with respect to a probability assessment) to fulfill the requirement of coherence. But the relevant check is computationally more burdensome (in fact we must repeat the same procedure n times, where n is the cardinality of the given set of conditional events). Example 21 - Given two (logically independent) events A and B, consider the following (unconditional) assessment P(A)
P(B)
=
=
1
1
4 , P(A 1\ B) =
16 , P(A V B) =
3
4.
To prove that Pis a lower probability, we resort to Theorem 9 (taking all conditioning events equal to n ): so we need to write down four systems, one for each event; the unknowns are the probabilities of the atoms
Consider the system referring to the event A :
i XI +x2 ~ i
XI+ X3 =
XI~ I~
i
XI+ X2 + X3
~
XI + X2 + X3
+ X4
Xi~
=
1
0
A solution is 1
XI= X3 =
S'
1
X2
=-, 2
X4
=
1
4.
LOWER AND UPPER CONDITIONAL PROBABILITIES
133
Solutions of the other three systems are the following: that corresponding to B is Y1 = Y2 =
1
1
B,
Y4 =
4,
that corresponding to A I\ B is 1
Zg
= 4'
Ug
= 16'
and that corresponding to A V B is 1
u -1-
16'
1
3 U4
= 4,
which easily follow from the relevant systems, that we did not (for the sake of brevity) write down explicitly. An algorithm to check coherence of a lower probability assessment (again based, as in the case of conditional probability, on the concept of locally strong coherence) has been set out in [19] (and implemented in XLISP-Stat Language), showing also how to solve some relevant inferential problems (as those dealt with in the following Chapter 16). A (seemingly) similar procedure refers to "imprecise" probabilities, for which some authors require a very weak form of coherence, that is : the existence, given a family of imprecise assessments [a~, a~'], i = 1, ... , n, of at least a set {p1 , ... ,Pn}, with PiE [a~, a~'], constituting a coherent conditional probability. This concept, called coherent generalized probabilistic assessment, has been introduced by Coletti in [22], and has been (independently) considered also by Gilio in [71] and later (under the name of g-coherence) by Biazzo and Gilio [8]. For a relevant algorithm, see [9]. Let us now face the problem of coherent eztensions of lower conditional probabilities. Taking into account Theorem 9 and
CHAPTER 15
134
the results of Chapter 13, it follows that the coherent enlargement to a "new" event FIK of a lower conditional probability P, defined on a finite family of conditional events {.FiiKi}, is given by mjnPi(F.IK.), I
where Pi is the infimum with respect to a class IJi characterizing P in the sense of the just mentioned theorem, and F.IK. is the conditional event introduced in the final part of Chapter 13. Notice that, if there exists an index i and a family IIi of probabilities P~ such that P~(K) = 0 for every a, then P(FIK) = 0; otherwise the value of P(FIK) is obtained in the general case as the minimum of the solutions of n linear programming problems. (Analogous considerations could be easily rephrased for upper probabilities).
15.3
Dempster's theory
It is well-known that any lower (unconditional) probability P is a superadditive (or 0-monotone) function, that is, given two events A andB,
A 1\ B
=0
=*
P(A V B) 2:: P(A)
+ P(B) .
Notice that this is true also if we consider a conditional lower probability P( ·IK) relative to the same conditioning event K: this can be easily seen by resorting to Theorem 9. In fact, given AIK, BIK, (A V B) IK E C, let i and a be the indices such that P~(K) > 0 and
now, since A and Bare incompatible events, the right-hand side is the sum of two similar expressions relative to AIK and BIK, and
LOWER AND UPPER CONDITIONAL PROBABILITIES
135
then we get, taking into account the inequalities corresponding to the case j =1- i of the theorem, P(A V BIK) 2: P(AIK) + P(BIK) . On the other hand, a lower probability may not be 2-monotone, i.e. may not satisfy P(A V B) 2: P(A)
+ P(B)- P(A 1\ B),
as shown by the following Example 22 - Given a partition {A, B, C, D} of probability assessments
Pr(A) = 0, Pr(B) = Pr(C) = P2 (A) =
1
4 , P2(B)
n,
consider two
1
2 , P1(D) = 0, 1
4,
= 0, P2(C) =
g(D) =
1
2,
on the algebra generated by these four events. The lower probability obtained as lower bound of the class {P1 , P2 } has, in particular, the following values
P(A)
1
1
1
= 0, P(A V B) = 4 , P(A V C) = 2 , P(A VB V C) = 2 ,
so that 1
2 = P(A VB V C) < P(A V C)+ P(A V B) -
P(A) =
1
1
2+ 4 -
0.
Obviously, all the more reason a lower probability is not necessarily an n-monotone function, that is n
P(Al V ... V An) 2:
L P(Ai)- L P(Ai 1\ Aj) +. ·. i=l
i<j
Lower and upper probabilities induced by multivalued mappings were introduced by Dempster in [55]. In particular, these lower
136
CHAPTER 15
probabilities, which are n-monotone for any n E IN have been called by Shafer [116] belief functions. We will not deal with the theory of belief functions in this book, except for the discussion in Chapter 18 of a classical example (cf. Example 38) which is claimed (in [117]) as not being solvable without resorting to belief functions. Our aim will be to show instead that lower and upper conditional probability are a useful tool to find a simple probabilistic solution for this kind of examples, in a way that fits and encompasses the solution obtained via the belief function approach. For the sake of interested (and not acquainted) reader, we will recall (in that Chapter) the main definitions (for details, see [116]) concerning belief functions and Dempster's rule of combination.
Chapter 16 Inference 16.1
The general problem
We refer to an arbitrary family 1£ = {HI! H 2 , ••• , Hn} of events (hypotheses), i.e. 1£ has neither any particular algebraic structure nor is a partition of the certain event n. We detect logical relations among the given events (the latter could represent, e.g., some possible diseases), and some further information is carried by probability assessments, relative to an event E (e.g., a symptom) conditionally to some of the Hi's ("partiallikelihood"). If we assess (prior) probabilities for the events Hi's, ensuing problems are: (i) Is this assessment coherent? (ii) Is the partial likelihood coherent "per se"?
(iii) Is the global assignment (the initial one together with the likelihood) coherent? If the relevant answers are all YES, we may try to "update" (coherently) the priors P(Hi) into the posteriors P(HilE). This is an 137 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
138
CHAPTER 16
instance of a more general issue, dealt with in Chapter 13: the problem of coherent eztensions. A very particular case is Bayes' updating for exhaustive and mutually exclusive hypotheses, in which this extension is unique. In the general case the lack of uniqueness gives rise to upper and lower updated probabilities, and we could now update again the latter, given a new event F and a corresponding (possibly partial) likelihood. In this Chapter we discuss many relevant features of this problem (keeping also an eye on the distinction between semantic and syntactic aspects). To start with, the first "natural" step is that of detecting all possible logical relations among the given events: in fact, as we have seen in Chapter 4, Remark 3, if there are no logical relations among the n events (that is, if the number of the relevant atoms equals 2n), any assessment (with values between 0 and 1) is coherent. This result has been extended to conditional events in [73]. Obviously, logical relations reduce the number of relevant atoms. Further information is carried by suitable conditional probability assessments, relative to an (observable or observed) event E (that could possibly be, with reference to the medical problem, a symptom or the evidence coming from a suitable test) conditionally to some of the Hi's, or to events obtained from them through elementary logical operations. We shall call "partial likelihood" this set of conditional probabilities. Going back to the previous questions (i}, (ii}, (iii}, some remarks are now in order: • is the initial assessment coherent? The syntactic counterpart of this concept is the requirement that the function P defined on the set 1l can be extended as a probability on the minimal algebra generated by 1l (see Chapter 4: we just need to check the compatibility of system (4.1)); • is the partial likelihood coherent "per se"? Notice that usu-
INFERENCE
139
ally likelihoods come from observed frequencies: if we refer just to a single conditional event, its probability can be assessed by an observed frequency in the past (since a frequency is a number between 0 and 1, and this is a necessary and sufficient condition for coherence when only a single conditional event is considered). But things are not so easy when further events (conditional or not) are involved, since consistency problems (coherence!) must then be taken into account, due to the circumstance that the relevant conditioning events are not, in this general case, incompatible (cf. Theorem 5). Coherence for conditional assessments has been the subject of Chapter 11; • is the global assignment (the initial one together with the likelihood) coherent? As it will be shown in the following two examples, the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment were not. So it is better, in these circumstances, to avoid a hasty Bayesian updating of probability assessments (since it can lead to wrong conclusions) and to resort instead to a direct check of coherence.
16.2
The procedure at work
Example 23 (continuing Examples 4 in Chapter 2, and Example 8 in Chapter 4) - We recall that a doctor had considered three possible diseases H 1 , H 2 , H 3 with the logical condition H 3 c Hf 1\ H 2 , giving the coherent assessment
(see the two aforementioned examples). The doctor considers now the event E = "pressing in particular points of the abdomen does not increase pain", and he gives the
140
CHAPTER 16
following relevant logical and probabilistic information
E
1\
Ha
= 0 , P(EIH2) = ~, P(EIH~) = ~.
Obviously, the latter assignment is coherent, since it refers to a {trivial) partition {with respect to the conditioning events). If we update the {prior} probability of H 2 by means of the above likelihood {through Bayes' theorem}, we get
This new probability of H 2 is coherent with the previous probabilities of H 1 and Ha. To prove that, consider the atoms obtained when we take into account also the new event E :
= A1 1\ Ec , B1 = A4 1\ Ec ,
= A2 1\ Ec, Bs = As 1\ E ,
B4
= Aa 1\ Ec , Bg = As 1\ Ec .
Bs
B6
To check coherence we consider the following system, with unknowns
Yi = P(Bi) Y1
+ Y2 + Y4 + Ys =
Y1 = Y1
i
+ Ya =
~ (YI
~
+ Y2 + Ya + Ys)
9
LYi= 1 i=l
Yi 2 0. It is easily seen that the system (So) has {infinite) solutions and, since there are also solutions such that
Y1
+ Y2 + Ya + Ys > 0 ,
INFERENCE
141
this is sufficient to ensure that the assessment is coherent. This is true even if we take into account the updating of the probability of H 3 , that is P(H3 IE) = 0: in fact this corresponds to ignoring the second equation of system (80 ). But to consider this assessment as an updating of the previous one can be a too hasty (and wrong) conclusion, since the value of P(H2 IE) has been obtained by considering in fact as "prior" the assessment
P(Hc) 2
= ~5'
and not that actually given by the doctor, which involves also the evaluation of P(H1 ) and P(H3 ). The updating of that assessment obviously requires that the "whole" prior and the likelihood must be jointly coherent. Instead in this case coherence does not hold: considering indeed the following system
+ Y2 + Y4 + Y5 = ~ Y1 + Y3 + Y4 + YB + Y1 =
Y1
*
Y1 = Y1 + Y3 Y2
k
= ~ (YI + Y3 + Y4 + YB + Y1)
+ Ys =
HY2
+ Y5 + Ys + Y9)
9
LYi
=1
i=l
Yi 2': 0 ,
simple computations {solving for Y1 +y3 the fourth and the second eq. and inserting this and the third eq. into the second one) show that it does not admit solutions, so that the assessment is not coherent. The following example shows that even the "local" coherence of prior and "pseudoposterior" obtained in the previous example was just accidental.
142
CHAPTER 16
Example 24 - A patient feels a severe back-ache together with lack of sensitiveness and pain in the left leg; he had two years before a lung cancer that was removed by a surgical operation. The doctor considers the following exhaustive hypotheses concerning the patient situation: H 1 =crushing of L5 and Sl vertebrae, H 2 =rupture of the disc, H 3 =inflammation of nerve- endings, H 4 =bone tumor. The doctor does not regard them as mutually exclusive; moreover, he assumes some logical relations:
H1 Hf
A A
H4 A (H1 V H2 V H3) = 0, H2 A H3 = 0 , H1 A H~ A H~ = 0 , H2 A H~ = 0 , Hf A H~ A H3 = 0 .
Correspondingly, we have only the four atoms
A3 = Hf
A
H2
A
H3
A H~
,
A4 = Hf
A H~ A H~ A
H4 .
The doctor makes the following probabilistic assessments
Its coherence is easily checked by referring to the usual system with unknowns Xr = P(Ar), which has a unique solution 1 Xl
= 12 '
X4
=
1
2.
Let now E be the event E = an X-ray test is sufficient for a reliable and decisive diagnosis so that
INFERENCE
143
The doctor assigns the likelihood P(EIHI) =
~,
P(EIHf) =
~.
If we update the (prior) probability P(H1) by the above likelihood through Bayes' theorem, we get P(H1IE) = ~- But now (contrary to the situation of Example 23) this updated probability of H 1 is not coherent with the given probabilities of H 2 and H3 • Notice in fact that the atoms obtained when we take into account the new event E are exactly those generated by the events Hi, so that to check coherence we need to study the solvability of the system, with unknowns Xr = P(Ar), XI = X4
~(x1
+ x4)
= ~
=~ X2 + X3 = I52
XI+ X3
4
LXi = 1 i=l
But the first two equations give system is inconsistent.
X1
~ , hence x 3
< 0, so this
The circumstance that the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment are not, cannot occur in the usual case where Bayes' theorem is applied to a set of ezhaustive and mutually ezclusive hypotheses: this is clear by Theorem 5 (Chapter 11). In fact, looking at the systems (Sa) introduced in Chapter 11 to characterize coherence, each equation (corresponding to the "product rule" of probability) is "independent" from the others, since the events Hi's have no atoms in common, and so each equation (and then the system) has trivially a solution.
144
CHAPTER 16
When the answers to the previous questions (i), (ii), (iii) are all YES, the next aim is to face the problem of "updating" the priors P(Hi) into the posteriors P(HiiE). In general, the problem of coherent extensions can be handled by Theorem 6 of Chapter 13: if C is a given family of conditional events and P a corresponding assessment on C, then there exists a (possibly not unique) coherent extension of P to an arbitrary family Q of conditional events, with g 2 C, if and only if the assessment P is coherent on C. Since the lack of uniqueness gives rise to upper and lower updated probabilities, to go on in the updating process (i.e., to update again the "new" conditional probabilities - possibly upper and lower - given a new event F and a corresponding - possibly partial-likelihood) we must resort to the general Theorem 9 (Chapter 15) characterizing upper and lower probabilities. This will be shown in the final part of the following example (which shows also that, if coherence of the "global" - i.e. prior and likelihood together - assessment holds, it is possible to update (prior) probability by Bayes' rule - also in situations in which the given events are not mutually exclusive - by resorting to the partitions given by the relevant atoms). Example 25 - A patient arrives at the hospital showing symptoms of choking. The doctor considers the following hypotheses concerning the patient situation:
H1 =cardiac insufficiency, H 2 =asthma attack, H 3 = H 2 A H, where H = cardiac lesion. The doctor does not regard them as mutually exclusive; moreover, he assumes the following natural logical relation: Correspondingly, we have the atoms
INFERENCE
145
A4 = H 1 1\ H~ 1\ H~ , As = Hf 1\ H~ 1\ H~ . The doctor makes the probability assessments
P(H1) =
1
1
2 , P(H2) = 3 , P(H3)
=
1
5 , P(H1 V H2)
=
3
5.
(16.1)
Its coherence is easily checked by referring to the usual system with unknowns Xr = P(Ar), which has a unique solution 1
XI
= 5'
X2
1 30 '
=
X3
=
1 10 '
2
4 X4
= 15 '
Xs
= 5.
Let now E be the event E = taking medicine M against asthma does not reduce choking symptoms . Since the fact E is incompatible with having asthma attack (H2 ), unless the patient has cardiac insufficiency or lesion (recall that H 3 implies both H 1 and H 2 }, then
H2
1\
Hf
1\
E = 0.
The doctor now draws out from his database the ''partial likelihood" P(E!HI)
=
3 10 ,
(16.2)
Then the process of updating starts by building the new atoms
B4 = A4 1\ Ec , B7
= A2 1\ E
,
Bs = As 1\ Ec , Bs
= A4 1\ E
,
B6 = A1
1\ E ,
Bg = As 1\ E ,
and to check coherence we need to consider the usual system (Sa) with unknowns Yi = P(Bi) and whose first six equations come from {16.1} and {16.2). Given .X and J1, with 7
7
- 0"" x~ x1 - ~ z
i=1 8
x~ + x~ +
xg + xA ~ 0 L xi i=1
xi1 + xi2 +xi3-- 8'!..(xiI + xi2 +xi3 +xi5 +xi6 +xi) 7 x1I +xi2-- 6.!! (xi1 + x12 + xi5 + xi) 6 8
ExJ = i=I
xf
1
~ 0.
Among all possible solutions there is, given J-t with 14 0< u 1. In conclusion, we proved that, given the family
the corresponding assessment
{9 0 0 7 23 ' ' ' 8'
~6}
is a coherent lower probability. We are now going to prove that the assessment 7 9 7 5} {9 16 ' 23 ' 20 ' 8 ' 6 is a coherent upper conditional probability for the same family of conditional events. We need to consider a system which is ("mutatis mutandis") the analogue of (s~) for lower probabilities, that is
8
u 11
< -
.l 23 ""u~ L..J ~ i=1 8 "" 1 + u51 + u61-< 9 20 L..J ui
1+ 1 u1 u2
i=1
u~ + u~ + u~ = ~ (u~ + u~ + u~ + u~ u 11 + u 2 1 -- 6 §. (u 11 + u 21 + u 51 + u 61)
+ u~ + u~)
8
L:ui = 1 i=1
uf
~ 0,
which has a solution such that 1 u1
7
= 23
'
1 u2
2
= 69
'
1 u3
61
= 384
1
' u5
1
+ u6 =
1 15 '
CHAPTER 16
150 1
u4
1
+ Us =
7
7
1
16 ' u7 = 1920 .
This is a solution also of the system obtained from the previous one changing the second inequality into an equality. Then it remains to be proved that has a solution the system
s 3 u1
3
3 u5
3_9""'3 u6 - 20 ~ ui i=1
+ u2 + + u31 + u32 + u33_- sI (u31 + u32 + u33 + u35 + u36 + u3) 7 u~ + u~ = ~ (u~ + u~ + u~ + u~) s
:Eu~ = 1 i=1
u~
2: 0.
It is easily seen that there is a solution such that 3 u1
=
0
'
u3 _ ~
3
8 '
2 -
3
u4
u3 3
= 2
3
20 '
+ Us = 5 '
3 u5
3_0
u7 -
3
+ u6 =
3
40 '
.
Now it is possible to go on by introducing a new conditional event and checking its coherence (as briefly discussed at the end of Section 15.2): the relevant range is a suitable closed interval.
Remark 14 - In the previous example, two among the values of the updated lower probability P of the Hi's were equal to zero. To go on in the updating process, these values have been taken as new "prior" assignments: then it is of paramount importance (also from a practical point of view) to have a theory privileging the possibility of managing conditioning events of zero probability (since they may appear in the relevant likelihood}.
INFERENCE
16.3
151
Discussion
Notice that an important syntactic consequence of our choice (to deal only with those "imprecise" probabilities 'P and 'P' arising as coherent extensions, so that they are lower and upper probabilities) is the following: since the relevant enveloping probability distributions (those singling-out lower and upper probabilities) are unique, there is no ambiguity concerning the information "carried" by 'P and 'P' (see the discussion at the end of Chapter 6). On the other hand, we prefer to rely more on the syntactic aspects than on the semantic ones, so avoiding any deepening of vague statements such as "losing" or "carrying" information, which are not clearly and unambiguously interpretable, especially in the framework of the so-called "imprecise" probabilities. For example, does it carry more information a precise assessment p, with p = .5, or an imprecise one [p', p"], with p' = .8 and p" = .95? If this question had any essential significance, we would prefer - in this case - an "imprecise" conclusion (since it looks more "informative"). Summing up, the procedure applied to the previous specific examples (to handle uncertainty in the process of automatic medical diagnosis) can be put forth in general, as expressed in the next Theorem 10. First of all, we need to consider the following starting points: • consider a family of hypotheses (that is, events Hi (with i = 1, 2, ... , n) represented by suitable propositions) supplied by You: they could explain a given initial piece of information referring to the specific situation. No structure and no simplifying and unrealistic assumption (such as mutual exclusiveness and exhaustivity) is required for this family of events; • detect all logical relations between these hypotheses, either
152
CHAPTER 16
already included in the knowledge base, or given by You on the basis of the specific situation; • assess probability of the given hypotheses. Clearly, this is not a complete assessment, since these events have been chosen by You as the most natural according to your experience: they do not constitute, in general, a partition of the certain event 0, and so the extension to other events of these probability evaluations is not necessarily unique. • refer to a data base consisting of conditional events ElK and their relevant probabilities P(EIK), where each event K may represent a possible information which is in some way related to the given hypotheses Hi, while each evidence E (regarded as assumed) is an event coming as the result of a suitable evidential test. These probabilities could have been obtained by means of relevant frequencies and should be recorded in some files. Then, once this preliminary preparation has been done, the first step of our procedure consists in building the family of atoms (generated by the hypotheses H 1 , H 2 , ... , Hn): they are a partition of the certain event, but they are not the "natural" events to which You are willing to assign probabilities. Nevertheless these atoms are the main tool for checking the coherence of the relevant assessment: in fact coherence amounts to finding on the set of atoms (by solving a linear system) a probability distribution (not necessarily unique) compatible with the given assignment. If the assessment turns out not being coherent, You can be driven to a different assignment based on the relevant mathematical relations contained in the corresponding linear system. Another way-out is to look for suitable subfamilies of the set {H1, H2, ... , Hn} for which the assignment is coherent, and then proceed by resorting to the extension theorem. On the contrary, coherence of the probabilities P(Hi) allows to
INFERENCE
153
go on by checking. now the coherence of the whole assessment including also the probabilities P(EIK). This requires the introduction of new atoms, possibly taking into account all logical relations involving the evidences E and the hypotheses Hi. In particular, some of the latter may coincide with some K. As the previous examples have shown, the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment were not. On the basis of the results obtained by means of the evidential tests, You can now update the probabilities of the hypotheses Hi, i.e. You assess the conditional probabilities P(HiiE). Then You need to check again coherence of the whole assessment including the latter and the former probability evaluations. When prior probabilities and likelihood are jointly coherent, You can get formulas representing each posterior probability (of an hypothesis Hi given an evidence E) by Bayes' theorem P(H·IE) = P(Hi)P(EIHi) ' P(E) ' but the denominator P(E), with P(E) > 0, cannot by computed by the usual "disintegration" formula n
P(E) =
L P(Hi)P(EIHi) ,
(16.3)
i=l
since the Hi's are not a partition. Nevertheless we can express P(E) in terms of the atoms, but this representation is not unique, since the corresponding linear system may have more than just one solution: computing upper and lower bounds of P(E) we get, respectively, lower and upper bounds for the posterior probabilities P(HiiE). In conclusion Theorem 10 - Let 1l = {H11 ••• , Hn} be an arbitrary set of events {"hypotheses") and {P(H1), ... , P(Hn)} a coherent assessment ("prior" probabilities). Given any event E ("evidence"), a
CHAPTER 16
154
set of events IC = {K 11 ••• , Km} {possibly Ki = Hi for some j and i) and the relevant coherent assessment {P(EIKI), ... , P(EIKm)} {"likelihood"), then there exists a (not necessarily unique) assessment {P(HdE), ... , P(HniE)} (''posterior" probabilities) if and only if the global assessment
is coherent as well. In particular, if Ki = Hi for some j and i , denote by Ar the atoms generated by 1lUICU{E} and by P the family of conditional probabilities extending the global assessment also to the events Hi lE (i = 1, ... , n); if inf P(Ar) > 0, then P(HiiE) E [p',p"],
L
'P Ar 0, so that we will consider the following three situations
= 0;
• (1)
P(H) > 0, P(E)
• (2)
P(H) = 0 , P(E) = 0;
• (3)
P(H) = 0, P(E) > 0.
(1) Evidence has zero probability (and P(H) > 0) Since P(H) > 0, then P(E) = 0 if and only if P(EIH) system (So) becomes
x2 + X3 = P(H)(x1 + x2 + X3 + x4) X1 + X2 = 0 · (xl + X2 + X3 + X4) x2 x2 X1
= 0 · (x2 + x3) =
P(HIE)(xl
+ x2)
+ X2 + X3 + X4 = 1
Xr;:::
0,
= 0; so
157
INFERENCE and we get XI = x 2 second system is
(SI)'
= 0,
X3
= P(H),
x3
Y2 = P(HIE)(yi { YI + Y2 = 1 Yr ~ 0;
+ x4 =
1, so that the
+ Y2)
it follows easily that the posterior P(HIE) can take any value Y2 E [0, 1]. A noticeable consequence of this result concerns the so-called Jeffreys-Lindley paradox, which refers to the Bayesian approach to the classical problem of testing a "sharp" null hypothesis: it goes back to the pioneering work of H. Jeffreys [85] and D. Lindley [95], and it is regarded as a controversial issue, since a sharp null hypothesis may be rejected by a sampling-theory test of significance, and yet a Bayesian analysis may yield high odds in favor of it. (A simple resolution in terms of "vague" -qualitative- distributions through the concept of pseudodensity [110] has been given in [74]). The problem is the following : suppose that the hypothesis H0 = {0 = 00 } (concerning the value of an unknown parameter) is singled-out to be tested, since it is in some way special, against the alternative HI = {0 =/:. 00 } , on the basis of a measurement x of a random variable X (usually a Gaussian density, with unknown mean 0). In the usual Bayesian approach, the prior distribution 1r for 0 assigns a "lump" of probability 7r0 > 0 to the null hypothesis Ho, while the "remainder" 7ri ( 0) of the prior distribution on HI is given a suitable absolutely continuous distribution. A straightforward use of Bayes' theorem leads to a posterior ratio
P(Holx)
P(Hiix) (for details, see [74]) which can take on, for a sufficiently large prior variance, any arbitrary large value, whatever the data and whatever
CHAPTER 16
158
small is 1f0 > 0. We have already pointed out (at the beginning of this Section) the objections that can be raised against arguments based on "improper" mathematical tools, so it is not surprising that they may lead to paradoxical conclusions. Nevertheless, the previous computations in terms of coherence show that it does not make sense to give Ho a positive probability, pretending - on the basis of the evidence { E = x} - to draw conclusions (by Bayes' theorem) on the posterior P(Ho!E), since the latter can take - coherently - any value in [0, 1] , independently of the distribution of all other hypotheses (constituting the event HI). Notice that we have anyway, for the relevant "partial" likelihood, the value P(E!Ho) = 0. A further understanding can be reached by the study of the second case
(2) Prior and evidence both have zero probability The first system, (So)'', in this case gives easily XI + x 2 = 0 , i.e. the solution XI = x 2 = x 3 = 0 , x 4 second system becomes
XI
+ x3
= 0,
= 1 , and the
= P(H!E)(YI + Y2) Y2 = P(E!H)(Y2 + Y3) YI + Y2 + Y3 = 1
Y2
(SI)''
Yr
~
0.
If P(EIH) = 0 (so that y2 = PI(E/\H) = 0 ), we may have different solutions of (SI)": in fact, recalling that P(H) = 0, and hence that for the zero-layer of H we have o(H) > 0, the different solutions may correspond to different choices of o (H) . Take o(H) = 1: this means y2 +y3 > 0 (but recall that y2 = 0 ), and we have a solution with YI + y 2 = 0 (and so o(E) = 2) and
159
INFERENCE y3 = 1 ; then the third system is
(S2)"
z2 = P(HIE)(zl + z2) { Z1 + Z2 = 1 Zr ~
0,
that is P(HIE) = z2 : the posterior can take any value in
[0, 1]. ·Again: the evidence E, with o(E) = 2, has no influence on H, with o(H) = 1 (notice that in (1) we found, analogously, that E, with o(E) = 1, has no influence on H, with o(H) = 0 ). Still assuming o(H) = 1 (i.e. Y2 + y3 > 0 ), another solution of (SI)" is clearly, for 0 < A < 1, Y1 = A, Y2 = 0, y3 = 1- A, which gives P(HIE) = 0 (notice that now o(E) = 1, and the posterior is no more arbitrary). Is this zero posterior "more believable" than the zero prior? We have
o(HIE) = o(H A E) - o(E) = 2 - 1 = o(H) = 1 , that is prior and posterior are on the same layer (but a further comparison could be done "inside" it: see below, under (3)). Consider now the case o(H) = 2: this means, by (S1 )", that y2 = Ya = 0, and so Y1 = 1; it follows P(HIE) = 0 and o(E) = 1. The third system is
(S2 )"'
z2 = 0 · (z2 + za) { Z2 + Z3 = 1 Zr ~
0,
then z2 = 0, z 3 = 1 (it follows o(H A E)= 3 ). Now, a reasonable prior assumption, to distinguish a "sharp" null hypothesis Ho to be tested against the alternative H 1 ::/=Ho, is to choose o(H0 ) = 1 and o(HI) = 2. As we have just seen, we get P(HoiE) = P(H1 IE) = 0, and to compare these two zeros consider
o(HoiE) = 2 - 1 < o(H1IE) = 3 - 1;
160
CHAPTER 16
then the zero posterior P(HoiE) is "more believable" than the zero posterior P(H1IE) . Going on in taking into account all possible combinations of the probability values, we consider now the case P(EIH) > 0: the system (SI)" gives easily, putting a = P(EIH) and b = P(HIE) , with a + b > 0 , a unique solution YI
=
a(1 -b) a(1- b)+ b ' y 2
=
ab a(1- b)+ b' Ya
=
b(1 -a) a(1- b)+ b ·
Since P(H!E)
= y2 + Ya P(E!H) , Y1
+ Y2
the latter equality can be written (if (y2 + y3 )(y1 + y2 ) > 0, that is, if o(H) = o(E) = 1) as P(H!E) = P1(H) P(E!H) P1(E) .
It follows that P(HIE) > 0 if and only if P(EIH) > 0 (even if P(H) = 0 ). In conclusion, since P1(H) P(H!HV E) g(E) - P(EIH V E) '
the values of the posterior P(HIE) fall in a range which depends on the "ratio of the two zero probabilities P(H) = P(E) = 0".
(3) Prior has zero probability (and P(E) > 0) The system (So) gives easily: x 2 + x 3 = 0, x 1 = P(E), and x 2 = P(EIH) (x2 + x 3 ) • It follows P(HIE = 0, and the second system is (S1 )'"
Y2 { Y2
= P(E!H)(Y2 + Ya) + Ya
Yr ~ 0,
= 1
INFERENCE
161
so that P(EIH) = y2 , with o(H) = 1, o(E) = 0, while
Y2
arbitrary in [0, 1]. Notice that
1 if P(EIH) > 0 o(H 1\ E)= { 2 if P(EIH) = 0. It follows 1 if P(EIH) > 0 o(HIE) = o(H 1\ E) - o(E) = { 2 if P(EIH) = 0 . This means that, if the likelihood is zero, the posterior is a "stronger" zero than the zero prior; if the likelihood is positive, prior and posterior lie in the same zero-layer, and they can be compared through their ratio, since Bayes' theorem can be given the form P(HIE) P(H)
P(EIH) P(E) .
Among the results discussed in this Section, we emphasize that priors which belong to different zero-layers produce posteriors still belonging to different layers, independently of the likelihood.
Chapter 17 Stochastic Independence a Coherent Setting
• Ill
As far as stochastic independence is concerned, in a series of papers ([28], (29], (33], (36]) we pointed out (not only for probabilities, but also for their "natural" generalizations, lower and upper probabilities) the shortcomings of classic definitions, which give rise to counterintuitive situations, in particular when the given events have probability equal to 0 or 1. We propose a definition of stochastic independence between two events (which agrees with the classic one and its variations when the probabilities of the relevant events are both different from 0 and 1), but our results can be extended to families of events and to random variables (see (123]). We stress that we have been able to avoid the situations - as those in the framework of classic definitions - where logical dependence does not (contrary to intuition) imply stochastic dependence. Notice that also conditional independence can be framed in our theory, giving rise to an axiomatic characterization in terms of graphoids ; and this can be the starting point leading to graphical models able to represent both conditional (stochastic) independence 163 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 17
164
and logical dependence relations. This issue has been thoroughly addressed in (124], and so we will not deal here with conditional independence. Finally, we maintain that stochastic independence is a concept that must be clearly kept distinct from any (putative} formalization of the faint concept of "causality". For the sake of brevity, we shall use in this Chapter the loose terminology "precise" and "imprecise" when referring, respectively, to probabilities or to lower (upper} probabilities.
17.1
"Precise" probabilities
We start by discussing stochastic independence for precise probabilities. The classic definition of stochastic independence of two events A, B, that is P(A 1\ B)
= P(A)P(B) ,
may give rise to strange conclusions: for example, an event A with P(A) = 0 or 1 is stochastically independent of itself, while, due to the intuitive meaning of independence (a concept that should catch the idea that being A independent of B entails that assuming the occurrence of B would not make You change the assessment of the probability of A), it is natural to require for any event E to be dependent on itself. Other formulations of the classic definition are P(AIB) = P(A) and that are equivalent to the previous one for events of probability different from 0 and 1: actually, without this assumption the latter
CS-STOCHASTIC INDEPENDENCE
165
two formulations may even lack meaning, due to the usual definition of conditional probability P(EIH), which requires the knowledge (or the assessment) of the "joint" and "marginal" probabilities P(E A H) and P(H) , and the ensuing positivity of the latter. As widely discussed in previous Chapters, in our· approach conditional probability is instead directly introduced as a function whose domain is an arbitrary set of conditional events, bounded to satisfy only the requirement of coherence, so that P(EIH) can be assessed and makes sense for any pair of events E, H, with H ::/= 0; moreover, the given conditional probability can be extended (possibly not uniquely) to any larger set of conditional events preserving coherence. We recall a notation introduced in Chapter 2: given an event E, the symbol E* denotes both E and its contrary Ec; so the notation, for example, A*IB* is a short-cut to denote four conditional events: AIB' AIBC' ACIB' ACIBC.
Here is the definition of stochastic independence between two events:
Definition 13 - Given a set£ of events containing A, B, Ac, Be, with B ::/= n, B ::/= 0, and a coherent conditional probability P, defined on a family C (of conditional events) containing the set V= {A*IB*,B*IA*} and contained in£ x £ 0 , we say that A is cs-stochastically independent of B with respect to P {in symbols AJLcsB, that is: independence in a coherent setting) if both the following conditions hold:
{i) P(AIB) = P(A!Bc) ; {ii) there exists a class P = {Pa} of probabilities agreeing with the restriction of P to the family V, such that
where the symbol o(·l·) denotes the zero-layer of the relevant conditional event.
166
CHAPTER 17
Remark 15 - Notice that if 0 < P(AIB) < 1 {these inequalities imply also 0 < P(AciB) < 1} and if condition {i} holds (so that also 0 < P(AIBc) < 1 and 0 < P(AciBc) < 1}, then both equalities in condition {ii} are trivially (as 0 = 0) satisfied. Therefore in this case condition AJLcsB should coincide with the classic one : nevertheless notice that the latter would require the assumption 0 < P(B) < 1, so that our approach actually covers a wider ambit, since to give sense to the two probabilities under {i} the aforementioned assumption is not needed in our framework. If condition (i) holds with P(AIB) = 0, then the second equality under {ii} is trivially satisfied, so that stochastic independence is ruled by the first one. In other words, equality {i} is not enough to assure independence when both sides are null: it needs to be "reinforced" by the requirement that also their zero-layers (singled-out by the class {Pa}) must be equal. Analogously, if condition {i} holds with P(AIB) = 1 (so that P(AciB) = 0}, independence is ruled by the second equality under {ii). Example 26 - Going back to Example 6 (re-visited also as Examples 15 and 16}, consider A = H 2 1\ S 1 and B = H 2 • Clearly, P(AIB) = P(AIBc) = 0; we seek now for the relevant zero-layers. Since the atoms generated by A and Bare A1 = AI\B, A 2 = Aci\B, A 3 = A c A Be , it is easy to check that every agreeing class gives the zero-layers the following values o(AIB)
=1,
o(AIBc)
= +oo.
Therefore A is not cs-independent of B, a circumstance that makes clear the important role played by the equality of the two zero-layers: in fact A and B are even logically dependent/ The previous example points out the inability of probability (alone) to "detect" logical dependence, which parallels the inability of zerolayers (alone) to "detect" stochastic dependence (in fact, when 0
6], ~~~~
~~~
Lr P~(Ar) ~a Lr P~(Ar) , Ar~AB
Ar~B
Lr P~(Ar) ~a Lr P~(Ar)'
Ar~ABC
Ar~Bc
Lr P~(Ar) ~ (3 Lr P~(Ar) , Ar~AB
(S~)
Ar-.I)
PARADIGMATIC EXAMPLES
193
for any interval I and real>.. By choosing as>. a power of 10, it follows that, for any integer k between 1 and 9, and for any natural number n, P(Ikn) = 0, so {18.2} cannot hold. Instead, in a finitely additive setting, these equalities are compatible with the above value of P(Ek), since, by superadditivity (an elementary property of finitely additive measures on a countable partition}, we have 00
P(Ek) ~
L P(Ikn) · n=O
How to find a suitable {finitely additive) probability distribution satisfying {18.1} is shown, e.g., in {110}.
18.2
Stochastic independence
The first-digit problem is apt also to discuss situations concerning stochastic independence in our coherent setting versus the classic one. Example 31 - With the same notation of the previous Example, for any given natural number n, we have
while Ek and Ikn are clearly not independent {neither logically nor stochastically). In fact, for any given natural number n we have
which is different (referring now to Definition 13, Chapter 17} from
194
18.3
CHAPTER18
A not coherent "Radon-Nikodym" conditional probability
Consider the conditional probability P(EIH): even if allowing the conditioning event H to have zero probability gives rise to subtle problems, nevertheless this conditional probability has, in our framework, all the .. . "civil rights", since it can be directly assessed through the concept of coherence. On the other hand, in Kolmogorov's axiomatic approach, in which the formula P(EIH) = P(E 1\ H) P(H)
(assuming P(H) > 0) is taken as definition of the conditional probability, a difficulty immediately arises when absolutely continuous distributions are considered, since in this case zero probabilities are unavoidable. In order to recall in the shortest and most elementary way the procedure followed (in the usual approach) to cope with these difficulties, we will adopt an informal exposition, sketching the main ideas and avoiding any detailed and rigorous specification. Neither we shall recall each time explicitly that all the probability distributions of the classical framework must verify countable additivity (and not only finite additivity, which is the natural requirement in a framework based on coherence). Let (X, Y) be a random vector and P the relevant probability distribution. Given two Borel sets Ax and By contained respectively in the range of X and in that of Y, by the same symbols we shall denote also the events {X E Ax} and {Y E By}. For any given value x of X, the conditional probability p(Bylx) is defined (see, e.g., [10]) as a function of x such that P(Ax
n By)=
j
Ax
p(Bylx)p.(dx)
(18.3)
PARADIGMATIC EXAMPLES
195
where f.-t is the marginal distribution of X. The existence of such a function p(Bylx) is warranted (under usual regularity conditions) by Radon-Nikodym theorem: in fact
P(Ax
n By)~ J-t(Ax),
so that, putting
it follows that the probability measure {3(·, By) is absolutely continuous with respect to f.-t; therefore (as it is well-known) f3 can be represented as a Lebesgue-Stieltjes integral (with respect to f.-t) of a density, i.e. of the function (of the argument x) denoted by p(Bylx) in eq. (18.3). Is this function entitled to be called "conditional probability"? Of course, in order to interpret p(Bylx) as the conditional probability of By given {X = x} it is necessary that p( ·lx) be a (countably additive) probability measure: this is true under suitable regularity conditions (that hold in the most common situations). Notice that p(·lx), being a density with respect to x, could be arbitrarily modified on a set of zero measure. Moreover, in the particular case that Ax reduces to a singleton {x} with probability J-t( {x}) > 0, we must have
P({x} n By)= J-t({x})p(Bylx);
(18.4)
and in fact in this case eq.(18.3) becomes eq.(18.4). For the sake of simplicity, we have considered a random vector (X, Y), but it should be clear that the previous results could have expressed by referring to two suitable partitions of the certain event !1 and by relying on the relevant extensions of the concept of integral and of the related measure-theoretic tools. Now, the main question is the following: is the above function p( ·I·) a coherent conditional probability? Let us consider the following
196
CHAPTER18
Example 32 - Given a E IR, let Hn = [a, a+~] for every n E IN, with J.L(Hn) > 0, and let H = {a}, with J.L(H) = 0. Given an event E, by Kolmogorov's definition we have
p
(E)H. ) = P(Hn 1\ E).
n
J.L(Hn)
'
then take a density p(E)x) defined by means of {18.3) P(Hn 1\ E) = { p(E)x)J.L(dx). }Hn Under usual regularity and continuity conditions we can write (by "mean-value" theorem) p(E)Hn) = p(E)x 0 ) for a suitable X 0 E Hn, so that lim p(E)Hn) = lim p(E)xo) = p(E)H). (18.5) n-too n-too Now, if we consider the events E' = E V H and E" = E 1\ He (recall that the probability of H is zero), we get P(E' 1\ Hn) = P(E" 1\ Hn) = P(E 1\ Hn) and then P(E')Hn) = P(E")Hn) = P(E)Hn) for every n E IN; it follows that also the three corresponding limits ( 18. 5) are equal, so, in particular, P(E'IH) = P(E"IH). But notice that coherence requires P(E'IH) = 1 and P(E"IH) = 0. The conclusion is that the adoption of the classical RadonNikodym procedure to define conditional probability, while syntactically correct from a pure mathematical point of view, can (easily) give rise to assessments which are not coherent (since they do not satisfy all relevant axioms of a conditional probability). Not to mention that it requires to refer not just to the given elementary conditioning event, but rather it needs the knowledge of the whole conditioning distribution: this circumstance is clearly unsound, especially from an inferential point of view, since P(E)x)
PARADIGMATIC EXAMPLES
197
comes out to depend not only on x, but on the whole a-algebra to which x belongs. A rigorous measure-theoretic approach to the relevant problems concerning a comparison between de Finetti's and Kolmogorov's settings in dealing with null conditioning events is in (11]; for an elementary exposition, see (112]. A complete and exhaustive expository papers (in particular, see its section 4) is (7].
18.4
A changing "world"
The situation described in the next example concerns the problem on how to assess "new" conditional probabilities when the set of (conditional) events changes. It has been already discussed (but only from the "logical" point of view concerning the possibility of "finer" subdivision into atomic events) in Chapter 2, Example 3. Example 33 - Given an election with three candidates A, B, C, we learn (or we assume} that C withdraws and that then all his votes will go to B: according to Schay {1 08}, this situation involves probabilities for which the product rule
P(B A H) = P(H)P(BIH),
(18.6)
with H = (A V B), does not hold. Assuming that the {initial} probability of either one winning is 1/3, and denoting by the same symbols also the corresponding events, so that
P(A) = P(B) = P(C) = 1/3, Schay argues as follows: since one has P(A V B) = 2/3 and P(BIH) = 2/3 {but notice that the only coherent choice for the latter conditional probability is 1/2, since both B and H have positive probability!), then, taking into account that BA H = B gives for the left-hand side of {18.6} the value P(B) = 1/3, while the right-hand side of the product rule is (2/3)(2/3) = 4/9.
CHAPTER18
198
Actually, a careful singling-out of the "right" conditioning event (as it has been discussed in Example 3) shows that it is not the event H = A V B, but the event, outside the initial "space" {A, B, C}, E = C withdraws and all his votes go to B, with E C H; so giving P(BIE) the value 2/3 looks like a more "convincing" assignment than giving P(BIH) this {incoherent) value. It is not difficult to prove that the assignment P(BIE) = 2/3 is not only convincing, but also coherent if P(E) :5 1/2: more precisely, a cumbersome {but simple) computation shows that coherent assignments of P(BIE) are those in the interval 1 1 - 3P(E)
1
:5 P(BIE) :5 3P(E) ;
in particular, if P(E) :5 1/3 any value {between 0 and 1) is coherent. So we cannot agree with Schay's conclusion that "it may along these lines be possible to incorporate the probabilities of quantum mechanics in our theory" . On the contrary, certain paradoxes concerning probabilities that do not satisfy (putatively) the product rule and arising in the statistical description of quantum theory, may depend on the fact that observed frequencies, relative to different (and possibly incompatible) experiments, are arbitrarily identified with the values of a conditional probability on the same given space. Before discussing these aspects in the next example, some remarks are now in order, recalling also what has been discussed at the end of Chapter 2 and in Chapter 8 about the careful distinction that is needed between the meaning of probability and its methods of evaluations.
18.5
Frequency vs. probability
Even if it is true that in "many" cases the value of a probability is "very near" to a suitable frequency, in every situation in which
PARADIGMATIC EXAMPLES
199
something "very probable" is looked on as "practically certain", there are "small" probabilities that are actually ignored, so making illegitimate also any probabilistic interpretation of physical laws. For example, a probabilistic explanation of the diffusion of heat must take into account the fact that the heat could accidentally move from a cold body to a warmer one, making the former even colder and the latter even warmer. This fact is very improbable only because the "unordered" configurations (i.e., heat equally diffused) are far more numerous than the "ordered" ones (i.e., all the heat in one direction), and not because unordered configurations enjoy some special status. Analogously, when pressing "at random" 18 keys on a typewriter and forecasting the occurrence of any sequence different from ''to be or not to be", we cannot consider it impossible that that piece of "Hamlet" could come out: in fact, if we were arguing in this way, it would mean also denying the possibility of explaining why we got just that sequence which we actually got, since it had the same probability as "to be or not to be" of being typed. So, why it is so difficult to see that piece by Shakespeare coming out - or else: to see water freezing on a fire - even in a long series of repetitions of the relevant procedure? It is just because their (expected) "waiting times" (inversely proportional to the corresponding probabilities) are extremely large (it has been computed that they are much larger that the expected life of our universe!) Notice that the difference between an impossible fact and a possible one - also with a very small probability, or even zero (it is well-known that we may have "many" possible events with zero probability) - is really enormous, since it is not a matter of a numerical difference, but of a qualitative (i.e., logical) one. Going back to the connections between probability and observed frequency, the classical two-slit experiment, discussed from a probabilistic point of view by Feynman [62], is an interesting illustration
200
CHAPTER 18
of the quantum mechanical way of computing the relevant probabilities (an interpretation in term of coherent probability has been given in [113]).
Example 34 - A source emits "identically prepared" particles {in the jargon of quantum community, preparation is the physical counterpart of the notion of "conditioning") toward a screen with two narrow openings, denoted SI and S2 . Behind the screen there is a film which registers the relative frequency of particles hitting a small given region A of the film. Measurements are performed in three different physical situations: both slits open, only slit SI open, only slit S2 open. We introduce, for a given particle, the following event, denoted (by abusing notation) by the same symbol of the corresponding physical device: A = the particle reaches the region A , and, fori= 1, 2, the following two events: si =the particle goes through slit si. Moreover, since all the particles are identically prepared, we may omit the further symbol H (referring to preparation} in all conditioning events. The experimentally measured frequencies are usually identified, respectively, with the three probabilities P(A), P(AISI) and P(AIS2). Repeated experiments can be performed letting a particle start from the source, and then measuring its final position on the film, to determine whether it is in the region A or not; moreover we could "measure" P(AIS1) or P(AIS2 ) letting be put in function an experimental device allowing the particle going to hit the region A only through the slit S 1 or only through the slit S2 . The latter corresponding frequencies (of going through the relevant slit} are also identified with the probabilities P(SI) and P(S2). Now, irrespective of whether the device has been activated or not, and of what was the issue in case of activation, we may obviously
201
PARADIGMATIC EXAMPLES
write, by the disintegration formula {see {16.3}, Chapter 16}, (18.7)
since. this is an elementary property of conditional probability, easy consequence of the relevant axioms. Instead physical experiments give an inequality between left and right hand side of {18. 7}. Well, this circumstance cannot be used to "falsify" anything or to introduce a sort of "new kind of probability", since it refers in fact only to observed frequencies. Actually, observed frequencies (pertaining to different experiments) may not be necessarily identified with (and so used to compute) probabilities, and the previous discussion can be seen as an instance of the problem of finding a coherent extension of some beforehand given (conditional) probabilities {see Chapter 13}. Interpreting A as AIO and si as SilO, the value P(A) given by {18. 7} is a coherent extension of the conditional probabilities P(AISi) and P(SiiO), while in general a value of P(A) obtained by measuring a relevant frequency may not. In other words: while a convex combination (a sort of "weighted average") of conditional probabilities can be - as in eq. ( 18. 7} a probability, there is no guarantee that it could be expressed as a convex combination of conditional frequencies (corresponding to different and incompatible experiments). In the previous example, the two incompatible experiments are not (so to say) "mentally" incompatible if we argue in terms of the general meaning of probability (for example, P(AISI) is the degree of belief in A under the assumption - not necessarily an observation, but just an assumed state of information- "S1 is true"): then, for a coherent evaluation of P(A) we must necessarily rely only on the above value obtained by resorting to eq. (18.7), even if such probability does not express any sort of "physical property" of the given event.
202
18.6
CHAPTER 18
Acquired or assumed (again)
The previous remarks pave the way for another important aspect involving the concepts of event and conditioning, and the ensuing "right" interpretation of the conditional probability P(EIH): we refer to the necessity of regarding an event always as an assumed and not asserted proposition, as discussed in Chapter 2 and at the beginning of Chapter 11. The following example has been discussed by de Finetti in Chapter 9 of the book cited under [52].
Example 35 - Consider a set of five balls {1, 2, 3, 4, 5} and the probability of the event E that a number drawn from this set at random is even {which is obviously 2/5}: this probability could instead be erroneously assessed (for instance) equal to 1/3, if we interpret P(EIH) = p as '1the probability of E], given H" (that would literally mean "if H occurs, then the probability of E is p"), and not as a whole, i.e. as ''the probability of [E given H] ". In fact, putting H 1 = {1, 2, 3} and H 2 = {3, 4, 5} the probability of E conditionally on the occurrence of each one of the events H 1 and H 2 is 1/3, and one (possibly both} of them will certainly occur.
18.7
Choosing the conditioning event
Another illuminating example, concerning the "right" choice of the conditioning event, is the following.
Example 36 - Three balls are given: two of them are white and distinguishable (marked 1 and 2), the third one is black. One out of the three corresponding events W17 W2 , B is the possible outcome of the following experiment: a referee tosses a dice and put in a box the black ball or the two white ones, according to whether the result is "even" (event E) or "odd" (event 0 }. In the former case the final outcome of the experiment is B, whereas in
PARADIGMATIC EXAMPLES
203
the latter the referee chooses (as the final outcome of the experiment) one of the two white balls (and we do not know how the choice is done). Then we learn that, if W1 was not the final outcome, ''the referee shows 1 as one of the two remaining balls" {denote by A the event expressed by this statement). Actually, the referee shows indeed that one of the two remaining balls is 1: what is the probability that B was the final outcome of the experiment? This example is an "abstract" version of a classical one, expressed in various similar forms in the relevant literature {the three prisoners, the two boys in a family with two children one of which is a boy, the car and the goats, the puzzle of the two aces, etc.). Here also the problem is that to correctly express the available evidence, which is the event A = ''the referee shows 1 as one of the two remaining balls " and not "1 is one of the two remaining balls". Obviously, the conditional probability of B is affected by one or the other choice of the conditioning event. Now, since A = E V ( 0 1\ W2) {in words: either the result of the dice tossing is E, i.e. the final outcome is B, or the result is 0 and the referee has chosen, as final outcome, the ball2; in both cases, W1 is not the final outcome, and so the referee shows 1 as one of the two remaining balls), it follows that 1 1 P(A) = P(E) + P(O)P(W2 IO) = 2 [1 + P(W2IO)] = 2 (1 + x), where x is the probability that the referee chooses the ball 2 when the result of the dice tossing is 0. Notice that, even if the number x is not (or not yet) determined, it always makes sense, since {in our general framework in which probability is a degree of belief in a proposition) it refers to the statement ''the referee chooses the ball 2 ", which is a logical entity that can be either true or false.
204
CHAPTER 18
Then we get P(BIA) = P(B A A) = P(B) = _1_ P(A) P(A) 1 +X
'
and, since x can be any number between 0 and 1, it follows that a coherent choice of P(BIA) is any number such that
~
::::; P(BIA) ::::; 1.
In conclusion, for a sound interpretation of a conditional event and of conditional probability, also a careful exam of subtleties of this kind is essential. For example, if the referee is "deterministic", in the sense that he takes always the same ball (2 or 1} when the result of the dice tossing is 0, then P(BIA)
1
= (2 or 1),
while if he chooses between the two balls by tossing a coin (x = then P(BIA) = ~.
18.8
V,
Simpson's paradox
To study the effect of alternative treatments T and Tc on the recovery R from a given illness, usually a comparison is made between the two conditional probabilities P(RIT) and P(RITc) (evaluated by means of relevant frequencies): then, considering two different subpopulations M and Me (for example, males and females) and the corresponding pairs of conditional probabilities P(RIT A M) and P(RITc A M), or P(RIT A Me) and P(RITc A Me), situations may occur where one gets P(RIT A M)
< P(RITc A M)
PARADIGMATIC EXAMPLES
205
and
P(RIT 1\ MC)< P(RITC 1\ MC) for both subpopulations, while
This phenomenon is called Simpson's paradox or "confounding effect" (and M is the confounding event). If a confounding event (e.g., M) has been detected, then Simpson's paradox can be ignored taking as frame of reference either the whole population or the two separate subpopulations, but there are not guiding lines for this choice and, anyway, this process may be endless, since there may exist, besides M, many other confounding events not yet detected. A resolution has been given in [4], and it is discussed in the following example.
Example 37 - Referring to the symbols introduced above, the consideration of the conditional events RIT and RITe corresponds to conditioning to given {and incompatible} facts {see also the discussion of Example 34}; in other words, they try to answer the question "given the treatment T (or Tc), did the patient recovery?". Then it appears as more sensible to refer instead to the conditional events TIR and TciR {by the way, the first one is enough), which correspond to the question "given the recovery, has the patient been treated by r or by re ? " Moreover, with this choice Simpson's paradox is avoided. In fact, suppose we agree that the inequality (18.8)
means that the treatment r is more beneficial than re (with respect to the recovery R}. Then, starting from the analogous inequalities referring to any (even unknown) confounding event C, that is
CHAPTER 18
206
we get easily P(TIR) = P(CIR)P(TIR 1\ C)
+ P( CciR)P(TIR 1\ cc) >
> P(CIR)P(TciR 1\ C)+ P(CciR)P(TciR 1\ cc)
= P(TciR),
that is formula (18. 8).
18.9
Belief functions
Finally, we discuss a classical example that is claimed (by Shafer, see [117]) as being not solvable without resorting to belief functions. We show instead that it is possible to find a simple probabilistic solution by means of conditional lower and upper probabilities (for the sake of brevity, we will deal only with lower probability). We start by recalling only the main definitions concerning belief functions and Dempster's rule of combination, making use (as much as possible) of our terminology. A Dempster's space V is a four-tuple V= {S, 7, r, J.L}, where S and 7 are two different set of atoms (i.e., two different finite partitions of n ) and to each element of s E S there corresponds an element r(s) belonging to the algebra A generated by the elements of 7; moreover, J.L is a probability distribution on S such that J.L(So) > 0, where So is the set of regular points s E S, i.e. those such that r(s) f. 0 (while an element s E S is called singular if r(s) = 0). For the sake of simplicity, assume that So = S , otherwise the regularization of J.L is defined as
J.Lo(s) =
J.L( s) J.t(s) .
L
sES0
Starting from V, a function m : A
~
[0, 1] is a basic probability
PARADIGMATIC EXAMPLES
207
assignment if m(A) = { J.t{s =.r(s) =A}= J.t{r- 1 (s)}, 0, 1f A= 0.
if
A¥: 0
Notice that this function is not a probability: for example, it is not monotone with respect to implication between events ("inclusion", in terms of the corresponding sets); in fact, since different elements of A cannot be images (through r) of the same element of S , then A ~ B does not necessarily imply m(A) ::; m(B) . Nevertheless, if the elements of the algebra A are looked on as "points" of a new "space" , then m is a probability on A , since for any AEA m(A) ~ 0
L
and
m(A)
= 1.
AEA
Then a belief function is defined as Bel(A)
=L
m(B),
B~A
and this function turns out to be n-monotone for every n E 1N (see the last Section of Chapter 15) and satisfying Bel(0) = 0 and Bel(O) = 1. In particular, the function Bel reduces to a probability if and only if the function m is different from 0 only on the atoms of A (that is, on the atoms of T). Consider now two Dempster's spaces and relative to the same set T (and so to the same algebra A). The purpose is to find a "common framework" for V 1 and 'D2 , SO that the function r on the product space S = 81 X 82 (and range in T) is defines by putting
CHAPTER 18
208
Then the following condition concerning probability distributions on 8 1 and 8 2 is assumed
that is stochastic independence (in the classic sense) of any pair of elements s1 E 81 and 82 E 82 . In conclusion, we get the space v1
x v2 = { s = si x S2 , 7, r = r I
A
r 2 , J..t =
J..LI • J..t2}
,
but there is no guarantee that ri(si) A r2(s2) # 0 for some pair (s1,s2)ES. This requires the introduction of the regularization of the space V = VI X v2 ' called Dempster's sum
where the measure J..to is defined as
The corresponding basic probability assignment m given, for A E A, by
2:
EEl m 2 is
mi(B)m2(C)
BIIC=A
m(A) =
= mi
2:
m 1 (B) m2(C) '
with
B,C EA,
if
A# 0
BIIC/-0
0,
if
A= 0.
Finally, the function Bel relative to V can be deduced from the function m = m 1 EEl m 2 in the same way as done above for (*). Example 38 - In {117} the following example is considered: "Is Fred, who is about to speak to me, going to speak truthfully, or is
PARADIGMATIC EXAMPLES
209
he, as he sometimes does, going to speak carelessy, saying whatever comes into his mind?". Shafer denotes ''truthful" and "careless" as the possible answers to the above question: since he knows from previous experience that Fred's announcements are truthful reports on what he knows 80% of the time and are careless statements the other 20% of the time, he writes P(truthful) = 0.8 , P(careless) = 0.2.
(18.9)
If we introduce the event E =the streets outside are slippery and Fred announces that E is true (let us denote by A the latter event, i.e. Fred's announcement}, the usual belief function argument gives Bel(E) = 0.8 , Bel(Ec) = 0. (18.10) In fact T is - in this example - { E, Ec}, while
sl = {truthful ' careless} and Jll is the P given by {18.9}; moreover f1(truthful)
= E, f 1(careless) = E V Ec = 0.
It follows m 1(0) = 0, m 1(E) = J.L1(truthful) = 0.8, m 1(Ec) J.L1 (0) = 0, m1 (0) = J.l1 (careless) = 0.2 . Then in {117] the putative merits of {18.10} are discussed with respect to what is called "a Bayesian argument" and to its presumed "inability to model Fred when he is being careless" and "to fit him into a chance picture at all". Successively, another evidence about whether the streets are slippery is considered, that is the event T = a thermometer shows a temperature of 31° F. It is known that streets are not slippery at this temperature, and there is a 99% chance that the thermometer is working properly;
CHAPTER 18
210
moreover, Fred's behavior is independent of whether it is working properly or not. In this case we have S 2 = {working, not working} , while J.t 2 (working) = 0.99 and J.t2 (not working) = 0.01; moreover f 2 (working) = Ec, f 2 (not working) = E V Ec = n. It follows m 2 (E) = J.t 2 (not working) = 0.01, m 2 (Ec) = J.t 2 (working) = 0.99. Then a belief function is obtained through the procedure of "combination of independent items of evidence" (Dempster's sum), getting a result which should reflect the fact that more trust is put on the thermometer than in Fred, i.e. Bel(E)
= 0.0384::: 0.04 ,
Bel(Ec)
= 0.9519 ::: 0.95.
(18.11)
In fact
f(truthful/\ working) = f(truthful/\not working) so that J.t(E)
= 0.008,
Bel(E)
0, f(careless 1\ working) = Ec,
= E,
J.t(Ec)
f(careless/\not working)
= 0.198,
J.t(n)
= 0 _008 + ~:~~~ + 0 _002 = Bel(Ec)
= 0.002
= EV Ec,
and
~:~~~ = 0.0384,
= 0 ·198 = 0.9519.
0.208 Finally, other computations follow concerning the case of the so-called "dependent evidence". Our probabilistic solution of the above example is very simple and fits and encompasses the solution obtained via the belief function approach. First of all, we challenge the possibility of defining the probabilities in {18.g}, since "truthful" and "careless" cannot be considered events: in fact, their truth or falseness cannot be directly verified, while we can instead ascertain, recalling that A = Fred announces that streets outside are slippery, whether are true or false (assuming A) the conditional events
PARADIGMATIC EXAMPLES
211
EIA and EeiA; moreover, the equalities in (18.9) must be replaced by inequalities, since E may be true also in the case that Fred's announcement is a careless statement. So we have P(EIA)
2:: 0.8 , P(EeiA)
~
0.2.
(18.12)
The belief values ( 18.10) can be seen, referring to the conditional events considered in {18.12}, as lower and upper conditional probability assessments; on the other hand, as far as the belief values ( 18.11) are concerned, in a probabilistic context we are actually interested in coherently assessing, for example, the lower conditional probability of EIA 1\ T and EeiA 1\ T, that should be consistent with the lower conditional probability assigned to EIA, EeiA, ElT, and EeiT. Notice that, since there is a 99% chance that the thermometer is working properly, we have P(EIT) ~ 1/100. Actually, by a simple application of the general theorem concerning lower conditional probabilities {Theorem 9, Chapter 15}, we can prove that P(EIA)
= 0.8
, P(EeiA)
P(EIA 1\ T)
=0
= 0.04
, P(EIT)
=0
, P(EeiT)
, P(EeiA 1\ T)
= 0.99,
= 0.95
is a coherent lower conditional probability assessment. We need to write down six systems, one for each conditional event; the unknowns are the probabilities of the atoms (contained in A VT) A1 A4
= E 1\ T 1\ Ae ,
= Ee 1\ T 1\ Ae,
A2
A5
= E 1\ T 1\ A ,
= Ee 1\ T 1\ A,
A3
= E 1\ re 1\ A ,
, A6
= Ee 1\ re 1\ A.
Consider the system (S~) referring to the conditional event EIA:
CHAPTER 18
212
x2 + xa = I~ (x2 + xa + Xs + xs) Xs + Xs ~ O(x2 + xa + Xs + xs)
XI+ x2 ~ O(xi + x2 + X4 + xs) X4 + X5 ~
{i0(XI + X2 + X4 + Xs)
x2 ~ I~o (x2
+ xs)
Xs ~ {050 (x2
+ xs)
XI + X2
+ Xg + X4 + X5 +
Xs = 1
Xi~ 0
a solution is
2
xa
= 3'
X4 = Xs =
1
6;
so, since all the atoms contained in the conditioning event A A T have zero probability, we need to consider, for the conditional event El A, the second system (Sl)
x~ ~ lcio (x~ + x~) '>95('+ Xs100 X2 Xs') x~+x~=1 Xi~ 0
which has, e.g., the solution I
24
Xs=-.
25
Solutions of the systems (S!) , i = 2, ... , 6, relative to the other five conditional events are the following: one of (S~), corresponding to EciA is, e.g., Yt = Y2 = Ys = Ys = 0 ,
1 2
Ya = Y4 = -
PARADIGMATIC EXAMPLES
213
(and the second system (Sl) has the solution y~ = 2~ , y~ = ~: ) . A solution of the system (S!) corresponding to ElT is, e.g., Z1
=
Z2
=
Z5
= 0,
Z3
=
1
2,
3
Z4
1
=-' 8
Z6
=8
(and a solution z~, z~ of the second system (Sf) is again the same as above}; one corresponding to EciT is, e.g.,
(and the second system has the solution t~ corresponding to EIA 1\ T is, e.g.,
=
2~
,
t~ = ~: ) ; one
(and the second system has again the solution u~ = 215 , u~ = ~: ). Finally, the solution of the system (S~) corresponding to EciA 1\ T is, e.g., V3
=
1
V4
=-
2
io ,
(while the second system has the solution v~ = v; = ~~ ) . For the sake of brevity, we did not write down explicitly all the relevant system (except that corresponding to the first conditional event). In conclusion, not only the chosen values of P(EIA) , P(EciA) , P(EjT), P(EcjT) , P(EjA 1\ T) P(EciA 1\ T) constitute a coherent lower conditional probability assessment, but since the above systems have clearly many other solutions, we might find other coherent evaluations.
Chapter 19 Fuzzy Sets and Possibility as Coherent Conditional Probabilities Our aim is to expound an interpretation (introduced in [30] and [34]) of fuzzy set theory (both from a semantic and a syntactic point of view) in terms of conditional events and coherent conditional probabilities: a complete account is in [38]. During past years, a large number of papers has been devoted to support either the thesis that probability theory is all that is required for reasoning about uncertainty, or the negative view maintaining that probability is inadequate to capture what is usually treated by fuzzy theory. In this Chapter we emphasize the role of coherent conditional probabilities to get rid of many controversial aspects. Moreover, we introduce the operations between fuzzy subsets, looked on as corresponding operations between conditional events endowed with the relevant conditional probability. Finally, we show how the concept of possibility function naturally arises as a coherent conditional probability.
215
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
216
19.1
CHAPTER 19
Fuzzy sets: main definitions
The concept of fuzzy subset goes back to the pioneering work of Zadeh [128]. On this subject there is a vast literature (for an elementary exposition, see (96]; another relevant reference is [86]); so we recall here only the main definitions. Given a set (universe) Y, any of its ("crisp") subsets A is singledout either by a "well-defined" property or by its characteristic function CA : Y -+ {0, 1} , with cA(x) = 1 for x EA and cA(x) = 0 for x (/_A. A fuzzy subset B of Y is defined through a membership function Jl.B : y -+
(0, 1] ,
that is a function that gives any element x E Y a "measure of its belonging" to B: in particular, the values J-LB(x) = 0 and J-LB(x) = 1 correspond, respectively, to x (/_ B and x E B in the sense of crisp sets. So the role of a membership function is that of interpreting (not uniquely) a property not representable by a (Boolean) proposition. Example 39 - Let Y = m, and consider the two statements A = "x is greater or equal to 3" and B = "x is about 10". Clearly, A is a crisp set, singled-out by its characteristic function, while the fuzzy subset B can be represented by many different membership functions, according to the different subjective numerical interpretation of the property B . Remark 25 - Even if it is true -from a syntactic point of view - that membership functions, in a sense, generalizes characteristic functions, allowing infinite values in [0, 1] and not only a twovalued range {0, 1}, nevertheless there is a strong qualitative jump from an "objective world" to another one in which
FUZZY SETS AND POSSIBILITY
217
a semantic (and "subjective"!) component plays a fundamental role. Now, before introducing the operations between fuzzy subsets, let us recall that for crisp sets the operations U (union), n (intersection), (f (complement) can be defined through characteristic functions by putting
so it appears at all natural to define similarly the analogous operations for fuzzy subsets using membership functions in place of characteristic functions, that is
The first and most significant difference with respect to crisp sets is that the previous definitions entail
while for characteristic functions the same operation gives (obviously) the function identically equal to 1. A further generalization for defining composition rules between J..lA and J..lB in order to get J..lAuB and J..lAnB is that of introducing suitable binary operations ("triangular" norms, in short T -norms) from [0, 1] 2 to [0, 1] endowed with similar significant properties as those of max and min. Definition 16 - A T-norm is a function T : [0, 1]2 ~ [0, 1] satisfying the following properties: (1} aTb=bTa (symmetric) (2} (aT b) T c =a T(bT c) (associative) (3) a~ x, b ~ y =} aTb ~ xTy (monotony) (4) aT 1 = a {1 is neutral element)
218
CHAPTER 19
Examples ofT-norms widely used in the relevant literature are the following:
= min{ x, y},
TM
(the minimum) : x TM y
Tp
(the product):
xTpy
TL
(Lukasiewicz):
xTL y = max{x + y- 1, 0},
(the weakest) :
= xy,
xT0 y= {
min{x, y}, ifmax{x, y} = 1 0,
otherwise .
The T-norm T0 is the minimum and the T-norm TM is the maximum in the pointwise ordering (even if the class ofT-norms is not linearly ordered by this pointwise relation, since some of them are not comparable). The notion ofT-norm plays the role of the intersection by defining /-tAnB = /-tAT /-tB .
The role of the union is played by the concept of T -conorm :
Definition 17 -A T-conorm is a function B: [0, 1]2---+ [0, 1] satisfying properties {1), {2), {3) of Definition 16, with B in place of T, and (4) aB 0 =a {0 is neutral element) Then we define /-tAUB
= /-tAB /-tB ·
Examples of T-conorms are
BM
(the maximum):
x BM y
Bp
(probabilistic sum):
BL
(Lukasiewicz):
= max{x, y},
x Bp y = x
x BL y
+ y- xy,
= min{x + y, 1}.
We recall now a generalization of the concept of complement (or negation), given by the following
FUZZY SETS AND POSSIBILITY
219
Definition 18 - A strong negation is a map 1J : [0, 1] -+ [0, 1] satisfying the following properties {1} 17(0) = 1 , 17(1) = 0, {2) 1J is decreasing, {3} TJ(TJ(x)) = x. Finally, we recall the notion of dual T-norm and T-conorm: T and S are called dual when
xTy = TJ(TJ(x) STJ(y)), or vice versa (exchanging T and S ) . Taking TJ(x) = 1- x, the pairs {Tx,Sx}, with X equal, respectively, to M, P, L, are pairs of dual T-norm and T-conorm.
19.2
Fuzziness and uncertainty
In the literature on fuzzy sets it is usually challenged the suitability of interpreting a statement such as E = "Mary is young" as an event, and the values of the membership function corresponding to the relevant fuzzy set as probabilities. In fact E is a vague statement, and vagueness is looked on as referring to the intended meaning (i.e. a sort of "linguistic" uncertainty) and not as an uncertainty about facts. The arguments usually brought forward to distinguish grades of membership from probabilities often refer to a restrictive interpretation of event and probability, while the probabilistic approach adopted in this book differs radically from the usual theory based on a measure-theoretic framework, which assumes that a unique probability measure is defined on an algebra (or a-algebra) of events constituting the so-called sample space n. It has been widely discussed that directing attention to events as subsets of the sample space (and to algebras of events) may be unsuitable for many real world situations, which make instead
220
CHAPTER 19
very significant both giving events a more general meaning and not assuming any specific structure for the set where probability is assessed. Another usual argument against any kind of probabilistic interpretation of fuzzy theory is based on the (putative) non compositional character of probability. Apart from the fact that in Chapter 9 (with a "relaxed" interpretation of the concept of truthfunctional belief) we challenged this view (at least with respect to our approach, based on coherent probability), we underline anyway that our definition of membership function in probabilistic terms will refer to a suitable conditional probability, looked on as a function of the conditioning event, and the relevant operations (which will correspond in very natural way to the basic T-norms and Tconorms, bound by coherence) come out to be truth-functional in the strict sense. In fact, in our view an essential role is played by conditioning, a concept that is not always sufficiently and properly emphasized, even in those articles (we mention here just Cheeseman [20], Giles [68], Hisdal [83], Dubois, Moral and Prade [57]) based on somehow similar ideas as those expressed here (they refer to terms such as label, context, information, state of mind, likelihood, ... ): in fact often a clear and precise mathematical frame is lacking. On the other hand, our approach cannot be compared to those that deal with fuzzy reasoning versus traditional probabilistic reasoning without referring to conditioning: in fact the very concept of conditional probability is deeper than the usual restrictive view emphasizing P(EIH) only as a probability for each given H (looked on as a given fact). Regarding instead also the conditioning event H as a ''variable", we get something which is not just a probability: see the (often mentioned) discussion in Chapter 11. We can refer to an event H corresponding to the "crisp part" of a fuzzy property; in this way a conditional event EIH can be seen also as a three-valued logical entity, which reduces to a "crisp"
FUZZY SETS AND POSSIBILITY
221
event when H is true. So the "fuzziness" is driven by suitably interpreting the situation corresponding to the case when it is not known whether H is true. The role of a conditioning event is that of setting out clearly (and in a rigorous way) the pragmatic view that "everything" depends on the relevant state of information, so overcoming loose concepts such as "label", "context" , etc .. Let us go back to the intuitive idea of fuzzy subset: where does it come from and what is its "operational" meaning? We start by recalling two examples; the first is a classical one and has already been discussed (mainly from a semantic point of view) in [115] and [30], while the next Example 41 has been the starting point (see [114]) for the interpretation of fuzzy sets as presented in this book. Example 40 -Is Mary young? From a pragmatic point of view, it is natural to think that You have some information about possible values of Mary's age, which allows You to refer to a suitable membership function of the fuzzy subset of "young people" (or, equivalently, of "young ages"). For example, for You the membership function may be put equal to 1 for values of the age less than 25, while it is put equal to 0 for values greater than 40; then it is taken as decreasing from 1 to 0 in the interval from 25 to 40. One of the merits of the fuzzy approach is that, given the range of values from 0 to 1, there is no restriction for the assignment of a membership function, in contrast to probability that obeys certain rules such as, for example, the axiom of additivity: it follows that, when You assign a subjective probability of (say) 0.2 to the statement that Mary's age is between 35 and 36, You inescapably must assign a degree of belief of 0. 8 to the contrary, and You may not have for the latter fact any justification apart from the consistency argument represented by the additivity rule. In our probabilistic framework the way-out is indeed {through conditioning) very simple. Notice that the above choice of the membership function implies that, for You, women whose age is less than
222
CHAPTER 19
25 are "young" , while those with an age greater than 40 are not. So the real problem is that You are uncertain on being or not "young" those women having an age between 25 and 40: then the interest is in fact directed toward conditional events such as EIAx, with E = You claim that Mary is young, Ax = the age of Mary is x, where x ranges over the interval from 25 to 40. It follows that You may assign a subjective probability P(EIAx) equal to 0.2 without any need to assign a degree of belief of 0.8 to the event E under the assumption A~ {i.e., the age of M ary is not x}, since an additivity rule with respect to the conditioning events does not hold. In other words, it seems sensible to identify the values of the membership function with suitable conditional probabilities: in particular, putting Ho = Mary's age is greater than 40, H 1 = Mary's age is less than 25, then we may assume that E and Ho are incompatible and that H 1 implies E, so that, by the properties of a conditional probability, P(EIHo) = 0 and P(EIHI) = 1. Notice that the conditional probability P(EIAx) has been directly introduced as a function on the set of conditional events (without assuming any given algebraic structure), bound to satisfy only the requirement of coherence, so that it can be assessed and makes sense for any pair of events. Now, given the event E, the value P(EIAx) is then a function J.L(x) of x, that could be taken as membership function. In the usual (Kolmorogovian) approach to conditional probability, the introduction of P(EIAx) would require the consideration (and the assessment) of P(E 1\ Ax) and P(Ax) (assuming positivity of the latter), that is a very difficult task!
FUZZY SETS AND POSSIBILITY
223
Remark 26 - Putting H = H 0 V H 1 , the conditional probability P(EIHc) is a measure of how much You are willing to claim or not that Mary is young if the only fact you know is that her age is between 25 and 40. And this will is "independent" of your beliefs corresponding to the single ages x: in fact, even if H false corresponds to the truth of {V x Ax : x = 25, ... , 40}, nevertheless there is no additivity requirement, since conditional probability (as already noticed a few lines above) is not additive with respect to the disjunction of conditioning events. These remarks will pave the way for the introduction in our context (in Section 4 of this Chapter) of possibility functions.
Example 41 - This example, taken from {97}, concerns the long term safety assessment of a radioactive waste repository in salt. After the disposal of waste has been finished, "almost impermeable" dams are built at strategic positions within an underground gallery system in order to prevent the transport of fluid possibly intruding at later times. The problem is to predict the future development of the permeability of these dams for time periods of hundreds or thousands of years. Available information about possible values of dams permeability is used to construct a subjective membership function of a fuzzy set (of "almost impermeable" dams}: for values of the permeability between 10- 21 and 10- 17 the membership function is put equal to 1, while it is put equal to 0 for values greater than 10- 15 ; finally, the membership function is decreasing from 1 to 0 in the interval from 10- 17 to 10- 15 . The motivation given by the authors rests on the usual argument that, given the range of values from 0 to 1, there is no restriction for the assignment of a membership function, in contrast to probability: in fact, as soon as You assign a probability of {say) 0.4 to the statement that in the future the permeability of the dam will be between 10- 17 and 10- 16 , You must assign a degree of belief of 0.6 to the contrary.
224
CHAPTER19
The way-out from this putative difficulty is (again, as in Example 40) very simple, since the above choice of the membership function implies that dams whose permeability is less than 10- 17 are "almost impermeable", while those with a permeability greater than 10- 15 are not. So the real problem is that You are uncertain on being or not "almost impermeable" those dams having a permeability between 10- 17 and 10- 15 : then the interest is in fact directed toward the conditional event EIH, with E =You claim that the dam is "almost impermeable", H =the permeability of the dam is between 10- 17 and 10- 15 . It follows that You may assign a subjective probability P(EIH) equal to (say) 0.25 without any need to assign a degree of belief of 0. 75 to the event E under the assumption He {i.e., the permeability of the dam is not between 10- 17 and 10- 15 ) . In [114] it is shown that also a second argument brought forward in [97] to contrast probabilistic methods versus the fuzzy approach can be overcome: it concerns the merits of the rules according to which the possibility of an object belonging to two fuzzy sets is obtained as the minimum of the possibilities that it belongs to either fuzzy set. The issue is the computation of the probability that the value of a safety parameter belongs to a given (dangerous) interval for all four components (grouped according to the similarity of their physicochemical conditions) of the repository section. For each component this probability is computed as equal to 1/5, and the conclusion is that "in terms of a safety assessment, the fuzzy calculus is more conservative", since in the fuzzy calculus (interpreting those values as values of a membership function) the possibility of a value of the parameter in the given interval for all components is still 1/5 (which is the minimum taken over numbers all equal to 1/5), while the same event is given (under the assumption of independence) the small probability (1/5) 4 • Anyway, we do not report here the way-out to this problem
FUZZY SETS AND POSSIBILITY
225
.suggested in [114], since its general (and rigorous) solution is a trivial consequence of the formal definitions and operations between fuzzy subsets (also in the form we are going to define in the next Section).
19.3
Fuzzy subsets and coherent conditional probability
Before undertaking the task of introducing (from the point of view of our framework) the definitions concerning fuzzy set theory, we need to deepen some further aspects of coherent conditional probabilities. First of all, among the peculiarities (which entail a large flexibility in the management of any kind of uncertainty) of the concept of coherent conditional probability versus the usual one, we recall the interpretation of the extreme values 0 and 1 of P(AIB) for situations which are different, respectively, from the trivial ones A A B = 0 and B ~ A ; moreover, we underline the "natural" looking at the conditional event AIB as "a whole", and not separately at the two events A and B. Nevertheless, notice the following corollary to Theorem 5 (Chapter 11).
Theorem 20 - Lei C be a family of conditional events { EIHihEI, where card( I) is arbitrary and the events Hi's are a partition of n, and let P(·l·) be a coherent conditional probability such that P (E IHi) E { 0, 1}. Then the following two statements are equivalent (i) P(·l·) is the only coherent assessment on C; (ii) Hi A E = 0 for every Hi E 1lo and Hi ~ E for every HiE 1l1, where 1lr ={Hi: P(EIHi) = r}, r = 0, 1. We are ready now to re-read fuzzy set theory by resorting to our framework.
226
CHAPTER 19
Let X be a (not necessarily numerical) random quantity wit4 range Cx, and, for any x E Cx, let Ax be the event {X= x}. The family {Ax}xEC., is obviously a partition of the certain event n. If r.p is any property related to the random quantity X, consider the event
Ecp = You claim r.p. and a coherent conditional probability P(EcpiAx), looked on as a real function defined on Cx. Since the events Ax are incompatible, then (by Theorem 5) every /-LE..,(x) with values in [0, 1] is a coherent conditional probability.
Remark 27 - Given XI, x 2 E Cx and the corresponding conditional probabilities /-LEcp (xi) and /-LEcp (x 2 ), a coherent extension of P to the conditional event Ecpi(Ax 1 V Ax 2 ) is not necessarily additive with respect to the conditioning events. Definition 19 - Given a random quantity X with range Cx and a related property r.p, a fuzzy subset E~ of Cx is the pair
So a coherent conditional probability P(EcpiAx) is a measure of how much You, given the event Ax = {X = x}, are willing to claim or not the property r.p, and it plays the role of the membership function of the fuzzy subset E~. Notice also that (as already remarked above) the significance of the conditional event EcpiAx is reinforced by looking on it as "a whole", avoiding a separate consideration of the two propositions Ecp and Ax.
FUZZY SETS AND POSSIBILITY
227
Remark 28 - It is important to stress that our interpretation of membership function as a conditional probability P(EcpiAx) has little to do both with the "frequentist's temptation" discussed in the Introduction of the book {80} by Hajek, and with the usual distinction which is done between uncertainty (which could be reduced by "learning") and vagueness {which is, in a sense, unrevisable: there is nothing to be learned about whether, e.g., a 35 years old person is old or not). In fact, when we put, for example, P(EcpiAx) = 0.70, this means that the degree of membership of the element x to the fuzzy subset defined by the property cp is identified with the conditional probability (relative to the conditional event EcpiAx seen as a whole) that You claim cp. Then, concerning the first ''frequentist" issue, we are not willing, e.g., to ask n people {knowing her age) "Is Mary young?" and to allow them to answer "yes" or "no", imagining that 70% of them say "yes"; in fact "Mary is young" is not an event, while "You claim that Mary is young" (notice that its negation is not "You claim that Mary is not young") is an event. Rather we could ask: "Knowing that Mary's age is x, how much are You willing to claim that Mary is young?". As far as the second issue {"learning") is concerned, notice that in probability theory "learning" is usually ruled by conditional probability, so "learning" about the age x would require to condition with respect to x itself: in our definition, instead, x {that is, Ax) is the conditioning event. Moreover, the assignment 0.70 is given- so to say- once for all, so that the "updating" (of Ecp !) has already been done at the very moment {knowing x) the conditional event EcpiAx has been taken into account together with the relevant evaluation P(EcpiAx); then it makes no sense, with respect to the event Ecp , to learn "more" (i.e., to consider a further conditioning).
CHAPTER 19
228
Example 42 - Let X = gain in a financial transaction, with range IR, and let cp = very· convenient. The corresponding fuzzy subset of Cx = 1R is the set of those gains {in the financial transaction) that You judge to be very convenient according to the membership function P(E'PIAx), with x E JR, assigned by You. Definition 20 - A fuzzy subset E; is a crisp set when the only coherent assessment J.LEcp(x) = P(E"'IAx) has range {0, 1}. By Theorem 20, it is obvious the following
Proposition 5 - A fuzzy subset E; is a crisp set when the property cp is such that, for every x E C x, either E'P 1\ Ax = 0 or Ax ~ E'P. Example 43 - Consider the same X of the previous example and the property '1/J =between 10 and 30 millions. In this case, clearly, P(E'rJ!IAx) (necessarily) assumes only values in {0, 1}, and in fact the subset singled-out by '1/J is a crisp one. Given two fuzzy subsets E;, E;, corresponding to the random quantities X and Y (possibly X = Y), assume that, for every x E Cx and yE Cy, both the following equalities hold
= P(E"'IAx),
(19.1)
P(E'rJ!IAx 1\ Ay)= P(E'rJ!IAy),
(19.2)
P(E"'IAx 1\ Ay)
with Ay= {Y = y}. These conditions (which are trivially satisfied for X = Y, which entails x = y : a conditioning event cannot be equal to 0 ) are quite natural, since they require, for X i= Y, that an event E"' related to a fuzzy subset E; of C x is stochastically independent (conditionally
FUZZY SETS AND POSSIBILITY
229
to any element of the partition {Ax}xEC.,) of every element of the partition {Ay}yEC11 relative to a fuzzy subset E~ of Cy. We introduce now a general definition of the binary operations of union and intersection and that of complementation.
Definition 21 - Given two fuzzy subsets (respectively, of Cx and Cy) E; and E~ , define
E; U E; = {E'P V E,p, JlE~pvE.p}, E; n E; = { E'P (E;)'
A
E,p , JlE~pAE.p} ,
= {E,'P,
JLE~'P},
where the functions /lE~pVE.p (x,
y) = P(E'P V E,piAx A Ay),
JlE~pAE.p (x,
y) = P(E'P
A
E,piAx
A
Ay)
have domain
Cxy =Cx x Cy. Remark 29 - Notice the following implication:
E,'P
~
(E'P)c,
where (E'P)c denotes the contrary of the event E'P (and the equality holds only for a crisp set); for example, the proposition "You claim not young" implies "You do not claim young', but not conversely. Then, while E'P V (E'P)c = Cx, we have instead E'P V E,'P ~ Cx. Therefore, if we consider the union of a fuzzy subset and its complement
E; U (E;)' = {E'P V E,'P, JlE~pvE~ 0}. On the other hand, if the 'Pa 's are possibilities (recall that in this case E9 is the "max", and take 0 as the "min"), a "good" atom (in the same sense as above for probabilities) is an atom A such that 'Pa(A) = 1.
Nevertheless, there are E9 -decomposable measures that do not satisfy, for some a, the aforementioned requirement of proper inclusion
DECOMPOSABLE MEASURES OF UNCERTAINTY
265
between the relevant classes of subsets of atoms. To prove this, consider the following simple
Example 48 -Let E be the algebra spanned by the atoms A, B, C and let 1l0 = {H1 = A VB, H 2 = A VC, H 3 = A VB VC}. Consider now the following E9 -decomposable measure, with E9 the Lukasiewicz T -conorm (that is, x E9 y = min{ x + y, 1} (and let x 0 y = max{x + y- 1, 0} be the Lukasiewicz T-norm):
cp0 (A) cp 0 (Av B)
= 2/3,
=0,
cp0 (B) = 2/3 , cp0 ( C) = 1/2,
cp0 (AVC)
= 1/2,
cp0 (BVC)
= cp
0
(AV BVC)
= 1,
It easy to prove that the above assessment satisfies the properties of a decomposable measure, but it is not possible to construct a class of almost generating measures. In fact the equations
cp0 (A)
=X
0 (cp 0 (A) E9 'Po(B))
(21.2)
'Po(A)
= X
0 ('Po(A) E9 'Po( C))
(21.3)
and have an infinite set of solutions, and so there is no atom A* such that for every Hi 2 A* the equation {21.1} has a unique solution for every element of the algebra E.
Definition 29 - A (E9 , 0 ) -decomposable conditional measure cp defined on C = E x 1l0 is reducible if for any Hi E 1£ 0 there exists an Hs E 1£ 0 , with H 8 C Hi, such that for any Hi E 1£ 0 , with Hs ~ Hi C Hi, the equation
has the unique solution x
= cp(EiiHi).
Remark 39 - It is easy to prove that the above condition is satisfied (for instance) by conditional probability, but also by a measure cp with E9 = max and any 0 , or by a cp with any E9 and with 0 strictly increasing.
266
CHAPTER21
We state now the following important result (for the proof, see [35]), which extends the analogous characterization theorems for conditional probability (Theorem 4, Chapter 11) and for conditional possibility (proved in [13], [14]).
Theorem 25 - Given a finite family C = £ x 1l0 of conditional events, with £ a Boolean algebm, 1l an additive set, 1l ~ £ and 1{,0 = 1l \ {0}, let A = { Ar} denote the set of atoms of£. If cp is a real function defined on C, and E9 , 0 are two operations from cp(C) x cp(C) to m:+, the following two statements are equivalent: (a) cp is a reducible (E9 , 0 ) -decomposable conditional measure on C; {b) there exists a (unique) class of genemting E9 -decomposable measures such that, for any Ei!Hi E C, there is a unique a such that x = cp(Ei!Hi) is the unique solution of the equation
Notice that the class {cp0 } of E9 -decomposable measures considered under {b) has a unique element only in the case that there exists no conditional event Ei!Hi E C such that equation (21.4) does not admit cp(Ei!Hi) as its unique solution: for instance, for conditional probability this occurs if there is no Hi with P(Hijn) = 0; for conditional possibility (with the usual max-min operations), this occurs if there is no conditional event EIH such that cp(Hjn) =
cp( (E 1\ H) jn) < cp(E!H).
21.3
Weakly decomposable measures
The results discussed until now suggest that the operations E9, 0 on cp(C) x cp(C), involved in the definition of a conditional measure, should satisfy specific conditions only on suitable subsets of the cartesian product cp(C) x cp(C).
DECOMPOSABLE MEASURES OF UNCERTAINTY
267
Definition 30 - If e is a Boolean algebra, a function cp : e --+ [0, 1] is a weakly EB -decomposable measure if
cp(O)
=1
, cp(0)
=0
and there exists an operation EB from cp(e) x cp(e) to JR+ such that the following condition holds: for every Ei, E; E e, Ei I\ E; = 0,
From the above condition it is easily seen that the restriction to the following subset of cp(e) x cp(e)
is commutative, associative and increasing, and admits 0 as neutral element. Nevertheless it need not be extensible to a function defined on the whole cp(e) x cp(e) (and so neither on [0, 1] 2 ), satisfying the same properties. To deepen these aspects, see Examples 2 and 3 in [35], which show that there does not exist a strictly increasing associa-
tive extension of EB , and even the existence of an increasing associative extension is not guaranteed. We introduce now what it seems to be the most natural concept of conditional measure.
Definition 31 - Given a family c = e X 1l0 of conditional events, where e is a Boolean algebra, 1l an additive set, with 1l ~ e and 1l0 = 1l \ {0}, a real function cp defined on C is a weakly (EB, 0)decomposable conditional measure if h1) cp(EIH) = cp(E I\ HIH), for every E E e and HE 1l0 ; ('y2 } there exists an operation EB : cp(e) x cp(e) --+ cp(C) whose restriction to the set
268
CHAPTER21
is commutative, associative and increasing, admits 0 as neutral element, and is such that, for any given HE 1/,0 , the function cp(·IH) is a weakly ED -decomposable measure;
{'Y3 ) there exists an operation 0 : cp(£) x cp(£) -+ cp(C) whose restriction to the set
r
= {(cp(E!H), cp(AIE 1\ H)): A
E £, E, H, E
1\
HE 1/,0 }
is commutative, associative and increasing, admits 1 as neutral element and is such that, for every A E £and E, HE 1l0 , EI\H =j:. 0, cp((E A A)IH)
= cp(EIH) 0
cp(AI(E A H));
('Y4 ) The operation 0 is distributive over ED only for relations of the kind cp(HIK) 0 (cp(EIH A K) ED cp(FIH A K)), with K, H 1\ K E 1/,0 , E 1\ F 1\ H 1\ K =
0.
Remark 40 - It is easily seen that, with respect to the elements of ~ and r the operations, respectively, ED and 0 are commutative and associative. On the other hand, it is possible to show {see {81}) that 0 is not necessarily extensible as an operation defined on {cp(C)p (and so neither on [0, 1]2 ) satisfying all the usual properties. Definition 32 - The elements of a class of almost generating measures will be called weak generating measures if distributivity of 0 over ED is required only for relations of the kind (x ED y) 0 'Pa(Hi), for all x and y that are unique solutions of the equations {21.1} relative, respectively, to Ei and E; with Ei 1\ E; 1\ Hi = 0. So we are able to extend the characterization theorem (Theorem 25) to weakly decomposable measures:
DECOMPOSABLE MEASURES OF UNCERTAINTY
269
Theorem 26 - Let C = £ x 1l0 , with £ a Boolean algebra, 1l an additive set, 1l ~ £ and 1l0 = 1l \ {0}, a finite family of conditional events, and let A = { Ar} denote the set of atoms of £. If cp is a real function defined on C, and EB , 0 are two operations from cp(C) X cp(C) to m+' then the following two statements are equivalent: (a) cp is a reducible weakly (EB , 0) -decomposable conditional measure on the family C; {b) there exists a (unique) class of weak m-decomposable (generating) measures such that, for any Ei!Hi E C, there is a unique a such that x = cp(Ei!Hi) is the unique solution of the equation
ffi
'Pa(Ar)
~~~~
=X
0
ffi
'Pa(Ar) ·
~~~
For the proof, see [35]. Notice that, due to the relaxation of the associative and distributive properties, the class of weakly decomposable conditional measures is quite large. So the requirement of being reducible in order to characterize them as in Theorem 26 appears, all the more reason, essential. In fact, for an arbitrary decomposable conditional measure, the existence of a class {Aa} and of the relevant class {'Pa} generating it in the sense of Definition 28, is not assured. To prove this, consider the following example (which is a suitable extension of Example 48):
Example 49 - Let£ and 1l0 be defined as in Example 48. Consider now the following conditional decomposable (EB, 0) -measure, where EB , 0 are the Lukasiewicz T -norm and T -conorm :
cp(AIH3) =
o,
cp(BIH3) = 2/3, cp(CIH3) = 1/2,
cp(A V BIH3) = 2/3 ' cp(A V CIH3) = 1/2' cp(B V CIH3)
= cp(A VB V CIH3) = 1'
cp(AIH2) = 1/4' cp(CIH2) = 1 ' cp(A V CIH2) = 1' cp(AIHI) = 1/6 ' cp(BIHI) = 1 ' cp(A V BIHI) = 1 '
CHAPTER 21
270
rp(EIHi)
=0
(E E £, E 1\ Hi=
0, i = 1, 2, 3 ).
It easy to prove that the above assessment satisfies the properties of a decomposable conditional measure and that rp is not reducible. On the other hand, there is no class {Aa} and no relevant almost generating measures {rp0 } : in fact rp 0 , defined in Ao ={A, B, C}, coincides with rp(·IH3 ), but, since the equations {21.2} and {21.3} have an infinite set of solutions {i.e., they do not have rp(AIHI) and rp(AIH2 ) as unique solution, respectively), there is no atom A* such that for every Hi 2 A* the equation {21.1} has a unique solution for every element of£.
21.4
Concluding remarks
The class of reducible weakly decomposable conditional measures (that can be "generated" - as in the case of conditional probabilities - by a suitable family of weakly decomposable unconditional measures) is much larger than the class of measures which are continuous transformations of a conditional probability. This is due to the fact that we deal with not necessarily continuous or strictly increasing operations (so that also min, for instance, can be considered as the operation 8); moreover, we consider operations satisfying commutative and associative properties only in specific subsets of [0, 1j2, and which are not necessarily extensible (preserving the same properties) to the whole set. Nevertheless our results point out that it is not possible to escape any form (even weak) of distributivity of 8 over EB. Finally, we note that the approach based on a direct assignment of the conditional measure removes any difficulty related, e.g., to the problem of conditioning with respect to events of zero-measure.
Bibliography [1] E. Adams. The Logic of Conditionals. Reidel, Dordrecht, 1975 [2] M. Baioletti, A. Capotorti, S. Tulipani, and B. Vantaggi. "Elimination of Boolean variables for probabilistic coherence", Soft Computing 4 (2): 81-88, 2000. [3] M. Baioletti, A. Capotorti, S. Tulipani, and B. Vantaggi. "Simplification rules for the coherent probability assessment problem", Annals of Mathematics and Artificial Intelligence, 35: 11-28, 2002. [4] B. Barigelli. "Data Exploration and Conditional Probability", IEEE Transactions on Systems, Man, and Cybernetics 24(12): 1764-1766, 1994. [5] S. Benferhat, D. Dubois and H. Prade. "Nonmonotonic Reasoning, Conditional Objects and Possibility Theory", Artificial Intelligence 92: 259-276, 1997. [6] P. Benvenuti and R. Mesiar. "Pseudo-additive mesures and triangular-norm-based conditioning", Annals of Mathematics and Artificial Intelligence, 35: 63-70, 2002. [7] P. Berti and P. Rigo. "Conglomerabilita, disintegrabilita e coerenza" Serie Ricerche Teoriche, n.ll, Dip. Statistico Univ. Firenze, 1989. 271
272
BIBLIOGRAPHY
[8] V. Biazzo, A. Gilio. "A generalization of the fundamental theorem of de Finetti for imprecise conditional probability assessments", International Journal of Approximate Reasoning, 24: 251-272, 2000.
[9] V. Biazzo, A. Gilio, and G. Sanfilippo. "Efficient checking of coherence and propagation of imprecise probability assessments", in: Proceeding IPMU 2000, Madrid, pp. 1973-1976, 2000. [10] P. Billingsley. Probability and Measure, Wiley, New York, 1995.
[11] D. Blackwell and L.E. Dubins. "On existence and non existence of proper, regular, conditional distributions", The Annals of Probability, 3: 741-752, 1975. [12] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probability, Macmillan, Cambridge, 1854. [13] B. Bouchon-Meunier, G. Coletti and C. Marsala. "Possibilistic conditional events", in: Proceeding IPMU 2000, Madrid, pp. 1561-1566, 2000. [14] B. Bouchon-Meunier, G. Coletti and C. Marsala. "Conditional Possibility and Necessity", in: Technologies for Constructing Intelligent Systems (eds. B. Bouchon-Meunier, J. Gutierrez-Rios, L. Magdalena, and R.R. Yager), Springer, Berlin, 2001. [15] G. Bruno and A. Gilio. "Applicazione del metodo del simplesso al teorema fondamentale per le probabilita nella concezione soggettivistica", Statistica 40: 337-344, 1980.
BIBLIOGRAPHY
273
[16] G. Bruno and A. Gilio. "Confronto tra eventi condizionati di probabilita nulla nell'inferenza statistica bayesiana", Rivista Matem. Sci. Econ. Soc., 8: 141-152, 1985 [17] P. Calabrese. "An algebraic synthesis of the foundations of logic and probability", Information Sciences, 42: 187-237, 1987. [18] A. Capotorti and B. Vantaggi. "Locally Strong Coherence in Inference Processes", Annals of Mathematics and Artificial Intelligence, 35: 125-149, 2002. [19] A. Capotorti, L. Galli and B. Vantaggi. "Locally Strong Coherence and Inference with Lower-Upper Probabilities", Soft Computing, in press. [20] P. Cheeseman. "Probabilistic versus fuzzy reasoning", in: Uncertainty in Artificial Intelligence (Eds. L.N.Kanal and J.F.Lemmer), pp 85-102, North-Holland, 1986 [21] G. Coletti. "Numerical and qualitative judgments in probabilistic expert systems", in: Proc. of the International Workshop on Probabilistic Methods in Expert Systems (Ed. R. Scozzafava), SIS, Roma, pp. 37-55, 1993. [22] G. Coletti. "Coherent Numerical and Ordinal Probabilistic Assessments", IEEE Transactions on Systems, Man, and Cybernetics, 24: 1747-1754, 1994. [23] G. Coletti, "Coherence Principles for handling qualitative and quantitative partial probabilistic assessments", Mathware & Soft Computing, 3: 159-172, 1996. [24] G. Coletti, A. Gilio, and R. Scozzafava, "Conditional events with vague information in expert systems" , in: Lecture Notes in Computer Sciences n.521 (Eds. B. Bouchon-Meunier, R.
274
BIBLIOGRAPHY R. Yager, and L. A. Zadeh), Springer-Verlag, Berlin, pp. 106114, 1991.
[25] G. Coletti and R. Scozzafava. "Characterization of Coherent Conditional Probabilities as a Tool for their Assessment and Extension", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 4: 103-127, 1996. [26] G. Coletti and R. Scozzafava, "Exploiting zero probabilities", in: Proc. EUFIT '97, Elite Foundation, Aachen, pp. 14991503, 1997. [27] G. Coletti and R. Scozzafava. "Conditional measures: old and new", in: Proc. of "New Trends in Fuzzy Systems", Napoli 1996 (eds. D.Mancini, M.Squillante, and A.Ventre), World Scientific, Singapore, pp. 107-120, 1998. [28] G. Coletti and R. Scozzafava. "Null events and stochastical independence", Kybernetika 34(1): 69-78, 1998. [29] G. Coletti and R. Scozzafava. "Zero probabilities in stochastic independence", in: Information, Uncertainty, Fusion (eds. B. Bouchon-Meunier, R.R. Yager, and L.A. Zadeh), Kluwer, Dordrecht (Selected papers from IPMU 1998, Paris), pp. 185196, 2000. [30] G. Coletti and R. Scozzafava. "Conditional Subjective Probability and Fuzzy Theory", in: Proc. of 18th NAFIPS International Conference, New York, IEEE, pp 77-80, 1999. [31] G. Coletti and R. Scozzafava. "Conditioning and Inference in Intelligent Systems", Soft Computing, 3: 118-130, 1999. [32] G. Coletti and R. Scozzafava. "The role of coherence in eliciting and handling "imprecise" probabilities and its application to medical diagnosis", Information Sciences, 130: 41-65, 2000.
BIBLIOGRAPHY
275
[33] G. Coletti and R. Scozzafava. "Stochastic Independence for Upper and Lower Probabilities in a Coherent Setting", in: Technologies for Constructing Intelligent Systems (eds. B. Bouchon-Meunier, J. Gutierrez-Rios, L. Magdalena, and R.R. Yager), Springer, Berlin (Selected papers from IPMU 2000, Madrid), vol.2, 2001. [34] G. Coletti and R. Scozzafava. "Fuzzy sets as conditional probabilities: which meaningful operations can be defined?", in: Proc. of 20th NAFIPS International Conference, Vancouver, IEEE, pp 1892-1895, 2001. [35] G. Coletti and R. Scozzafava. "From conditional events to conditional measures: a new axiomatic approach", Annals of Mathematics and Artificial Intelligence, 32: 373-392, 2001. [36] G. Coletti and R. Scozzafava. "Stochastic independence in a coherent setting", Annals of Mathematics and Artificial Intelligence, 35: 151-176, 2002. [37] G. Coletti and R. Scozzafava. "Bayes' theorem in a coherent setting", in: Fifth World Meeting of the International Society for Bayesian Analysis (ISBA}, Istanbul (Abstract), 1997. [38] G. Coletti and R. Scozzafava. "Conditional probability, fuzzy sets and possibility: a unifying view", Fuzzy Sets and Systems, 2002, to appear. [39] G. Coletti, R. Scozzafava and B. Vantaggi. "Probabilistic Reasoning as a General Unifying Tool", in: Lecture Notes in Computer Sciences (eds. S. Benferhat and P. Besnard), Vol. LNAI 2143, pp. 120-131, Springer-Verlag, Berlin, 2001. [40] G. Coletti, R. Scozzafava and B. Vantaggi. "Coherent Conditional Probability as a Tool for Default Reasoning", in: Proceeding IPMU 2002, Annecy, France, pp. 1663-1670, 2002.
276
BIBLIOGRAPHY
[41] I. Couso, S. Moral, and P. Walley. "Examples of independence for imprecise probabilities", in: Int. Symp. on Imprecise Probabilities and their applications (ISIPTA '99}, Ghent, Belgium, 121-130, 1999. [42] R. T. Cox. "Probability, frequency and reasonable expectation", American Journal of Physics, 14 (1): 1-13, 1946. [43] A. Csaszar. "Sur la structure des espaces de probabilite conditionelle", Acta mathematica Academiae Scientiarum Hun9aricae, 6: 337-361, 1955. [44] L. M. De Campos and S. Moral. "Independence concepts for convex sets of probabilities", in: Uncertainty in Artificial Intelligence (UAI '95}, Morgan and Kaufmann, San Mateo, 108-115, 1995. [45] B. de Finetti. "Sui passaggi allimite nel calcolo delle probabilita", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 155156, 1930. [46] B. de Finetti. "A proposito dell'estensione del teorema delle probabilita totali alle classi numerabili", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 901-905, 1930. [47] B. de Finetti. "Ancora sull'estensione alle classi numerabili del teorema delle probabilita totali", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 1063-1069, 1930. [48] B. de Finetti. "Sul significato soggettivo della probabilita", Fundamenta Mathematicae, 17: 298-329, 1931 - Engl. transl. in: Induction and Probability {Eds. P. Monari, D. Cocchi), CLUEB, Bologna: 291-321, 1993 [49] B. de Finetti. "La logique de la probabilite", in: Actes du Congres International de Philosophie Scientifique, Paris 1935, Hermann: IV, 1-9, 1936.
BIBLIOGRAPHY
277
[50] B. de Finetti. "Les probabilites nulles", Bull. Sci. Math. 60: 275-288, 1936. [51] B. de Finetti. "La prevision: ses lois logiques, ses sources subjectives", Ann. Institut H. Poincare 7: 1-68, 1937. [52] B. de Finetti. "Sull'impostazione assiomatica del calcolo delle probabilita", Annali Univ. Trieste, 19: 3-55, 1949. - Engl. transl. in: Ch.5 of Probability, Induction, Statistics, Wiley, London, 1972. [53] B. de Finetti. Teoria della probabilitd, Einaudi, Torino, 1970Engl. transl.: "Theory of Probability", Voll. 1 and 2., Wiley, Chichester, 1974. [54] B. de Finetti. "Probability: beware of falsifications!", in: Studies in Subjective Probability (eds. H. E. Kyburg and H.E. Smokier), Krieger Publ., New York, pp. 193-224, 1980. [55] A. P. Dempster. "Upper and Lower Probabilities Induced by a Multivalued Mapping", Annals of Mathematical Statistics 38: 325-339, 1967. [56] L. E. Dubins. "Finitely Additive Conditional Probabilities, Conglomerability and Disintegration", The Annals of Probability 3: 89-99, 1975. [57] D. Dubois, S. Moral and H. Prade. "A semantics for possibility theory based on likelihoods", Journal of Mathematical Analysis and Applications 205: 359-380, 1997. [58] D. Dubois and H. Prade. "Conditioning in Possibility and Evidence Theories: a Logical Viewpoint" , in: Lecture Notes in Computer Sciences (eds. B.Bouchon-Meunier, L.Saitta, R.R.Yager), n.313, pp. 401-408, Springer-Verlag, Berlin, 1988.
278
BIBLIOGRAPHY
[59] D. Dubois and H. Prade. "Conditional Objects as Nonmonotonic Consequence Relationships", IEEE Transactions on Systems, Man, and Cybernetics, 24: 1724-1740, 1994. [60] D. Dubois and H. Prade. "Possibility theory, probability theory and multiple-valued logics: a clarification", Annals of Mathematics and Artificial Intelligence, 32: 35-66, 2001. [61] K. Fan. "On Systems of Linear Inequalities", in: Linear Inequalities and Related Systems, Annals of Mathematical Studies, Vol.38, Princeton University Press, 1956. [62] R. Feynman. "The concept of probability in quantum mechanics", in: Proc. 2nd Berkeley Symp. on Mathematical Statistics and Probability, University of California Press, Berkeley, pp. 533-541, 1951. [63] M. Frechet. "Sur !'extension du theoreme des probabilites totales au cas d'une suite infinie d'evenements, I", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 899-900, 1930. [64] M. Frechet. "Sur !'extension du theoreme des probabilites totales au cas d'une suite infinie d'evenements, 11", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 1059-1062, 1930. [65] A. M. Frisch and P. Haddawy. "Anytime Deduction for Probabilistic Logic", Artificial Intelligence 69(1-2): 93-122, 1993. [66] D. Gale. The Theory of Linear Economic Models, MacGrawHill, New York, 1960. [67] P. Gardenfors. Knowledge in Flux, MIT Press, Cambridge (Massachusetts), 1988. [68] R. Giles. "The concept of grade of membership", Fuzzy Sets and Systems, 25: 297-323, 1988.
BIBLIOGRAPHY
279
[69] A. Gilio. "Criterio di penalizzazione e condizioni di coerenza nella valutazione soggettiva della probabilita", Boll. Un. Mat. !tal. (7) 4-B: 645-660, 1990. [70] A. Gilio. "Probabilistic consistency of knowledge bases in inference systems", in: Lecture Notes in Computer Science (eds. M. Clarke, R. Kruse, S. Moral), Vol. 747, pp. 160-167, Springer-Verlag, Berlin, 1993.
[71] A. Gilio. "Probabilistic consistency of conditional probability bounds", in: Advances in Intelligent Computing (eds. B. Bouchon-Meunier, R.R. Yager, and L.A. Zadeh), Lectures Notes in Computer Science, Vol. 945, Springer-Verlag, Berlin, 1995. [72] A. Gilio. "Probabilistic reasoning under coherence in System P", Annals of Mathematics and Artificial Intelligence, 34: 534, 2002. [73] A. Gilio and S. lngrassia. "Totally coherent set-valued probability assessments", Kybernetika 34(1): 3-15, 1998. [74] A. Gilio and R. Scozzafava. "Vague distributions in Bayesian testing of a null hypothesis", Metron 43: 167-174, 1985. [75] A. Gilio and R. Scozzafava. "Conditional events in probability assessment and revision", IEEE Transactions on Systems, Man, and Cybernetics 24(12): 1741-1746, 1994. [76] M. Goldszmidt and J. Pearl. "Qualitative probability for default reasoning, belief revision and causal modeling", Artificial Intelligence 84: 57-112, 1996. [77] I. R. Goodman and H. T. Nguyen. "Conditional objects and the modeling of uncertainties", in: Fuzzy Computing (Eds. M.Gupta, T.Yamakawa), pp. 119-138, North Holland, Amsterdam, 1988.
280
BIBLIOGRAPHY
[78] I. R. Goodman and H. T. Nguyen. "Mathematical foundations of conditionals and their probabilistic assignments", International Journal of Uncertainty, Fuzziness and Knowledge-Based System, 3: 247-339, 1995. [79] T. Hailperin. "Best possible inequalities for the probability of a logical function of events", Amer. Math. Monthly 72: 343359, 1965. (80] P. Hajek. Metamathematics of Fuzzy Logic, Kluwer, Dordrecht, 1998. [81] J. Y. Halpern. "A counterexample to theorems of Cox and Fine", J. of Artificial Intelligence Research 10: 67-85, 1999. [82] P. Hansen, B. Jaumard, and M. Poggi de Aragao. "Column generation methods for probabilistic logic", ORSA Journal on Computing 3: 135-148, 1991. (83] E. Hisdal. "Are grades of membership probabilities?", Fuzzy Sets and Systems, 25: 325-348, 1988. [84] S. Holzer. "On coherence and conditional prevision", Boll. Un. Mat. !tal. (6)4: 441-460, 1985. [85] H. Jeffreys. Theory of Probability, Oxford University Press, Oxford, 1948. [86] E. P. Klement, R. Mesiar, E. Pap. Triangular Norms, Kluwer, Dordrecht, 2000. [87] B. 0. Koopman. "The Bases of Probability", Bulletin A.M.S., 46: 763-774, 1940. [88] P. H. Krauss. "Representation of Conditional Probability Measures on Boolean Algebras", Acta Math. Acad. Scient. Hungar., 19: 229-241, 1968.
BIBLIOGRAPHY
281
[89] F. Lad. Opemtional Subjective Statistical Methods, Wiley, New York, 1996. [90] F. Lad, J. M. Dickey and M. A. Rahman. "The fundamental theorem of prevision", Statistica 50: 19-38, 1990. [91] F. Lad, J. M. Dickey and M. A. Rahman. "Numerical application of the fundamental theorem of prevision" , J. Statist. Comput. Simul. 40: 135-151, 1992. [92] F. Lad and R. Scozzafava. "Distributions agreeing with exchangeable sequential forecasting" , The American Statistician, 55: 131-139, 2001. [93] D. Lehmann and M. Magidor. "What does a conditional knowledge base entail?", Artificial Intelligence 55: 1-60, 1992. [94] R. S. Lehman. "On confirmation and rational betting", The J. of Symbolic Logic 20: 251-262, 1955. [95] D. V. Lindley. "A statistical paradox", Biometrika 44: 187192, 1957. [96] H. T. Nguyen and E. A. Walker. A first course in fuzzy logic, CRC Press, Boca Raton, 1997. [97] A. Nies and L. Camarinopoulos. "Application of fuzzy set and probability theory to data uncertainty in long term safety assessment of radioactive waste disposal systems", in: Probabilistic Safety Assessment and Management (G.Apostolakis, Ed.), Vol. 2, pp. 1389-1394, Elsevier, N.Y., 1991. [98] N. J. Nilsson. "Probabilistic Logic", Artificial Intelligence 28: 71-87, 1986.
282
BIBLIOGRAPHY
[99] N. J. Nilsson. "Probabilistic Logic Revisited", Artificial Intelligence 59: 39-42, 1993. [100] J. B. Paris. The Uncertain Reasoner's Companion, Cambridge University Press, Cambridge, 1994. [101] K. R. Popper. The Logic of Scientific Discovery, Routledge, London, 1959. [102] E. Regazzini. "Finitely additive conditional probabilities", Rend. Sem. Mat. Fis. Milano 55: 69-89, 1985. [103] R. Reiter. "A Logic for Default Reasoning", Artificial Intelligence, 13(1-2): 81-132, 1980. [104] A. Renyi. "On a New Axiomatic Theory of Probability", Acta mathematica Academiae Scientiarum Hungaricae, 6: 285335, 1955. [105] P. Rigo. "Un teorema di estensione per probabilita condizionate finitamente additive", in: Atti XXXIV Riunione Scientifica S.I.S., Siena, Vol.2, pp. 27-34, 1988. [106] A. Robinson. Nonstandard Analysis, Princeton University Press, Princeton, 1996. [107] S. J. Russel and P. Norvig. Artificial Intelligence. A Modern Approach, Prentice-Hall, New Jersey, 1995. [108] G. Schay. "An Algebra of Conditional Events", Journal of Mathematical Analysis and Applications, 24: 334-344, 1968. [109] R. Scozzafava. "Probabilita u-additive e non", Boll. Unione Mat. !tal., (6) 1-A: 1-33, 1982. [110] R. Scozzafava. "A survey of some common misunderstandings concerning the role and meaning of finitely additive probabilities in statistical inference", Statistica, 44: 21-45, 1984.
BIBLIOGRAPHY
283
[111] R. Scozzafava. "A merged approach to stochastics in engineering curricula", European Journal of Engineering Education, 15(3): 241-248, 1990. [112] R. Scozzafava. "Probabilita condizionate: de Finetti o Kolmogoroff?", in: Scritti in omaggio a L.Daboni, pp. 223-237, LINT, Trieste, 1990. [113) R. Scozzafava. "The role of probability in statistical physics", Transport Theory and Statistical Physics, 29{1-2): 107-123, 2000. [114) R. Scozzafava. "How to solve some critical examples by a proper use of coherent probability", in: Uncertainty in Intelligent Systems (Eds. B.Bouchon-Meunier, L.Valverde, R.R.Yager), pp. 121-132, Elsevier, Amsterdam, 1993. [115) R. Scozzafava. "Subjective conditional probability and coherence principles for handling partial information", Mathware & Soft Computing, 3: 183-192, 1996. [116] G. Shafer. A mathematical theory of evidence, University of Princeton Press, Princeton, 1976. [117) G. Shafer. "Probability judgement in Artificial Intelligence and Expert Systems", Statistical Science, 2: 3-16, 1987. [118) P. P. Shenoy. "On Spohn's Rule for Revision of Beliefs", International Journal of Approximate Reasoning, 5: 149-181, 1991. [119] R. Sikorski. Boolean Algebras, Springer-Berlin, 1964. [120) W. Spohn. "Ordinal conditional functions: a dynamic theory of epistemic states", in: Causation in Decision, Belief Change, and Statistics (eds. W. L. Harper, B. Skyrms), Vol.II, Dordrecht, pp. 105-134, 1988.
284
BIBLIOGRAPHY
[121] W. Spohn. "On the Properties of Conditional Independence", in: Scientific Philosopher 1: Probability and Probabilistic Causality (eds. P. Humphreys and P. Suppes), Kluwer, Dordrecht, pp. 173-194, 1994. [122] W. Spohn. "Ranking Functions, AGM Style", Research Group "Logic in Philosophy", 28, 1999. [123] B. Vantaggi. "Conditional Independence in a Finite Coherent Setting", Annals of Mathematics and Artificial Intelligence, 32: 287-313, 2001. [124] B. Vantaggi. "The 1-separation Criterion for Description of cs-independence Models", International Journal of Approximate Reasoning, 29: 291-316, 2002. [125] P. Walley. Statistical Reasoning with Imprecise Probabilities, Chapman and Hall, London, 1991. [126] P. M. Williams. "Notes on conditional previsions", School of Mathematical and Physical Sciences, working paper, The University of Sussex, 1975. [127] P. M. Williams. "Indeterminate probabilities", in: Formal Methods in the Methodology of Empirical Sciences (eds. M. Przelecki, K. Szaniawski, and R. Vojcicki), Reidel, Dordrecht, pp. 229-246, 1976. [128] L. A. Zadeh. "Fuzzy sets", Information and Control, 8: 338353, 1965.
Index additive set, 73, 76, 92, 263, 269 agreeing classes of probabilities, 77, 81, 85, 87, 88, 94, 99, 100, 109, 111, 113, 114, 165, 166, 168, 182, 234, 246 algebra of subsets, 21, 25, 90 almost generating measures, 263 alternative theorem, 37, 42, 53 assumed vs. asserted proposition, 8, 18, 19, 75, 146, 152, 156, 201, 202 atoms, 23, 31-35, 39, 42-44, 49,51,69,80, 109,111113, 117, 119, 120, 124, 130,138,142-147,152154,211,212,232,234, 238, 263-266, 269 axioms of conditional probability, 73 axioms of subjective probability, 14
143,144,153,155,157, 158, 161, 189, 190 Bayesian approach, 11, 75, 138, 139, 146, 155, 157, 189, 209 belief functions, 136, 206, 207 belief network, 8 betting interpretation of coherence, 37, 41, 77 Boolean algebra, 10, 20, 33, 76, 85, 92, 232, 262, 263, 267, 269 Boolean support of a conditional event, 64, 67 Borel sets, 194 car and goats paradox, 203 causality, 164 certain and impossible events, 14, 18, 19, 100 characteristic function, 216, 217 checking coherence, 11, 33, 46, 48, 95, 117, 120, 122, 124-126, 133, 139, 140, 142,143,145,147,150, 152-154,245,250,255 coherence, 9, 11, 15, 24, 31,
basic probability assignment, 206, 208 Bayes' theorem, 75, 107, 140, 285
286 32, 40, 42, 103, 189 coherent conditional probability, 11, 70, 72, 76, 102, 119, 251 coherent conditional probability (characterization theorem), 81 coherent extensions, 9, 26, 33, 41, 43, 46, 48, 58, 80, 87, 106, 109-111, 113, 117,120,123,127,129, 133, 138, 144, 151, 152, 201,226,233,236-238, 246, 247, 253 coherent prevision, 50 combinatorial evaluations, 7, 14, 54, 55 commensurable events, 100 compositional belief, 57, 220 conditional event, 7, 9, 10, 22, 45, 50, 63, 204, 225, 251 conditional independence, 8, 163, 228 conditional measure, 258 conditional possibility, 259, 262, 266 conditional probability, 7, 9, 22, 29, 45, 259 conjunction, 20 consistency of default rules, 245, 251, 253 contrary (of an event), 18
INDEX countable additivity, 12, 28, 71, 90, 155, 191, 194 Cox's axioms, 259 crisp subset, 216, 217, 228-230 Csaszar's dimensionally ordered classes, 91 DAG (directed acyclic graph), 8 de Finetti's axioms, 71 de Finetti's coherence, 15, 32, 37, 80 de Finetti's fundamental theorem of probabilities, 10,44 De Morgan's laws, 20 decomposable conditional measure, 262 decomposable conditional measure (characterization theorem), 266 decomposable uncertainty measures, 260 default logic, 11 default rules, 241, 244, 245 degree of belief, 8, 13, 24, 47, 54, 201, 203, 221-224, 241 degree of truth, 62 Dempster's rule of combination, 136, 206 Dempster's theory, 134 dilation, 189
INDEX
disintegration formula, 146, 153, 201 disjunction, 20 dominating family, 129, 130 Dutch-Book, 34, 40 entailment (of default rules), 246, 247, 252, 254 epistemic state, 19, 106 event, 7, 17, 227 exchangeability, 12, 55, 186 exploiting zero probabilities, 112, 117, 120, 122 finitely additive probability, 25, 28, 191, 194 first digit problem, 28, 192 frequentist evaluations, 12, 14, 54-56, 139, 198, 199, 201 frequentist's temptation, 227 fuzziness and uncertainty, 219 fuzzy subset, 215-217, 219, 221, 223-230, 238, 240 fuzzy theory, 11, 215, 225, 232 g-coherence, 133 Gardenfors approach, 15, 19, 34 graphical models, 163 heads and tails, 17, 18, 28, 95, 101, 166, 167, 204 i-minimal conditional probability, 130, 181, 184
287 if-then rules, 9 imprecise probabilities, 46, 128, 133, 151 inclusion for conditional events, 68, 88, 114 incompatible events, 18, 20 indicator of an event, 17 indicator vector, 23 inference, 11, 45, 46, 103, 107, 137, 155, 189, 247 infinitesimals, 107, 254 iperreal field 1R*, 107 iperreal probability, 107 Jeffreys-Lindley paradox, 157 Kolmogorov's approach, 13, 29, 181, 194, 196 likelihood, 11, 75, 147, 161, 220 likelihood principle, 155 linear programming, 10, 46, 112, 134 locally strong coherence, 122126, 133, 154 logic of certainty, 14, 19 logical dependence, 9, 43, 106, 109, 163, 166 logical implication, 17, 20, 243, 244 logical independence, 23, 32, 167,172,179,182,183 logical relations, 18, 23, 87, 117, 122, 138, 142, 151, 153,
288 245 lower and upper coherent conditional probabilities, 128, 130,131,147,149,206, 211 lower and upper probabilities, 48, 151 medical diagnosis, 10, 23, 33, 138, 139, 141, 144 membership function, 216, 217, 219-224,226-228,232 multi-valued logic, 61 n-monotone function, 135, 207 Nilsson's probabilistic logic, 46 nonstandard numbers, 107,254 Ockham's razor, 242 operations among events, 20 operations between fuzzy subsets,217,220,225,229 operations with conditional events, 10, 12, 65, 72, 258 partial assessments, 1, 10, 13, 19,33,46,93, 106,108, 117, 132 partial likelihood, 137, 138, 144, 145, 158 partition of n, 11, 24, 26, 49, 55,64,65,89,129,135, 137,138,143,144,146, 152,225,226,229,233, 234, 238
INDEX Popper measure, 13, 71, 106, 107 possibility distribution, 232, 233, 238 possibility functions, 11, 223, 232,233,237,238,240, 257, 264 possible worlds, 23 prevision, 12, 49, 72, 73 prior and posterior probabilities, 11, 75, 137, 139, 141, 144, 153, 155, 157, 159, 161 probabilified ranks, 106, 107 probability 1 ("full" or "plain" belief), 11, 19, 96, 102, 106,241-244,246,253, 254 probability density, 155, 195 proposition, 7, 17, 61, 62 pseudodensity, 157 quantum theory, 198, 200 quasi conjunction, 251 Renyi's axioms, 71, 90 Radon-Nikodym approach to zero probability conditioning, 77, 194, 196 random gain, 39, 40, 51, 78 random quantity, 12, 49, 77, 226, 233 random variable, 8, 49, 64, 65, 72, 73, 260
INDEX
sample space, 7, 21 second order probabilities, 11 Simpson's paradox, 204 Spohn's irrelevance (independence), 167 Spohn's ranking function, 101103, 167, 253 state of information, 7, 56, 63 statistical data, 8, 18, 75 stochastic independence, 9, 108, 163-166,172,175,177, 193 stochastic independence for lower probability, 179, 181, 182, 184-186, 190 Stone's theorem, 21 strong negation, 219 subjective probability, 10, 13, 53 superadditivity, 134, 193 symmetry in stochastic independence, 174, 177 System Z, 253 T-conorm, 218, 219, 231, 240, 265, 269 T-norm,217-219,231,240,265, 269 three prisoners paradox, 203 three-valued logic, 62 truth-functional belief, 57, 220 truth-value of a conditional event, 63, 69
289 uncertainty and vagueness, 227 uncertainty measures, 11, 186, 257 updating, 1, 10, 33, 48, 103, 106,107,127,138,139, 141,144-147,150,155, 189, 227 urn of unknown composition, 186, 187, 189 weak implication, 242 weakly compositional belief, 58 weakly decomposable conditional measure, 267 weakly decomposable conditional measure (characterization theorem), 268 You,8, 13, 20, 27,41, 54,151, 164,221,223,226-228, 237 zero lower probability, 128, 150 zero probability, 9, 11, 12, 25, 28, 29, 34, 35, 71, 76, 77, 87, 95, 100, 107, 112, 119, 120, 123, 150, 156,158,160,174,190, 194,196,199,212,235, 252 zero-layer, 94, 99-104, 107, 108, 158, 161,165, 166,172, 182, 238, 244, 253
TRENDS IN LOGIC 1.
G. Schurz: The Is-Ought Problem. An Investigation in Philosophical Logic. 1997 ISBN 0-7923-4410-3
2.
E. Ejerhed and S. Lindstrom (eds.): Logic, Action and Cognition. Essays in Philo-
sophical Logic. 1997
ISBN 0-7923-4560-6
3.
H. Wansing: Displaying Modal Logic. 1998
ISBN 0-7923-5205-X
4.
P. Hajek: Metamathematics of Fuzzy Logic. 1998
ISBN 0-7923-5238-6
5.
H.J. Ohlbach and U. Reyle (eds.): Logic, Language and Reasoning. Essays in Honour ofDov Gabbay. 1999 ISBN 0-7923-5687-X
6.
K. Dosen: Cut Elimination in Categories. 2000
7.
R.L.O. Cignoli, I.M.L. D'Ottaviano and D. Mundici: Algebraic Foundations ofmanyvalued Reasoning. 2000 ISBN 0-7923-6009-5
8.
E.P. Klement, R. Mesiar and E. Pap: Triangular Norms. 2000
ISBN 0-7923-5720-5
ISBN 0-7923-6416-3 9.
V.F. Hendricks: The Convergence of Scientific Knowledge. A View From the Limit. 2001 ISBN 0-7923-6929-7
10.
J. Czelakowski: Protoalgebraic Logics. 2001
11.
G. Gerla: Fuzzy Logic. Mathematical Tools for Approximate Reasoning. 2001 ISBN 0-7923-6941-6
12.
M. Fitting: Types, Tableaus, and Godel's God. 2002
ISBN 1-4020-0604-7
13.
F. Paoli: Substructural Logics: A Primer. 2002
ISBN 1-4020-0605-5
14.
S. Ghilardi and M. Zawadowki: Sheaves, Games, and Model Completions. A Categorical Approach to Nonclassical Propositional Logics. 2002 ISBN 1-4020-0660-8
15.
G. Coletti and R. Scozzafava: Probabilistic Logic in a Coherent Setting. 2002 ISBN 1-4020-0917-8;Pb: 1-4020-0970-4
16.
P. Kawalec: Structural Reliabilism. Inductive Logic as a Theory of Justification. 2002 ISBN 1-4020-1013-3
ISBN 0-7923-6940-8
KLUWER ACADEMIC PUBLISHERS- DORDRECHT I BOSTON I LONDON