CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ...

Author:
D. V. Lindley

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics under the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD SALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations and Stability of Nonautonomous Ordinary Differential Equations D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and Semi-Group Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations (continued on inside back cover)

BAYESIAN STATISTICS, A REVIEW D. V. UNDLEY University College London

SOCIETY for INDUSTRIAL and APPLIED MATHEMATICS P H I L A D E L P H I A , PENNSYLVANIA 19103

Copyright 1972 by the Society for Industrial and Applied Mathematics. All rights reserved. Second Printing 1978 Third Printing 1980 Fourth Printing 1984 Fifth Printing 1989 Sixth Printing 1995

Printed for the Society for Industrial and Applied Mathematics by Capital City Press, Montpelier, Vermont

/s a registered trademark.

Contents 1. Introduction

1

2. Coherence

3

3. Sampling-theory statistics

10

4. Basic ideas in Bayesian statistics

17

5. Sequential experimentation

32

6. Finite population, sampling theory

35

7. Robustness

42

8. Multiparameter problems

49

9. Tolerance regions and predictive distributions

56

10. Multinomial data

59

11. Asymptotic results

61

12. 1. 2. 3. 4. 5. 6. 7.

64 66 68 69 71 72 74

Empirical Bayes and multiple decision problems Nonparametric statistics Multivariate statistics Invariance theories Comparison of Bayesian and orthodox procedures Information Probability assessments

Bibliography

75

iii

This page intentionally left blank

Preface I was invited by the Statistics Department at the Oregon State University to give ten lectures on Bayesian Statistics in July 1970. This monograph is a slightly expanded version of the content of those lectures. An adherent of the school of subjective probability might be forgiven for presenting a subjective view of the subject. Although I have tried to give a reasonably complete account of the present position in the study of statistical procedures that use the notion of a probability distribution over parameter space, both the emphasis and choice of topics reflect my own interests. I am most grateful to H. D. Brunk, Lyle D. Calvin and Don A. Pierce who suggested the idea and to the National Science Foundation for finance. The encouragement to put my knowledge into some reasonably tidy shape was most welcome. London October 1970

D. V. LINDLEY

This page intentionally left blank

Bayesian Statistics A Review D. V. Lindley 1. Introduction. The mathematical model that has been found convenient for most statistical problems contains a sample space X of elements x endowed with an appropriate cr-field of sets over which is given a family of probability measures. These measures are indexed by a quantity, 9, called a parameter, belonging to the parameter space 0. The values x are referred to variously as the sample, observations or data. Notice that 6 is merely an index for the various probabilities and that as a result this model includes nonparametric statistics, the special techniques in that field being necessitated by the complexity of the space 0 which, for example, may include all distributions on the real line (see § 12.2, below). For almost all problems it is sufficient to suppose that these probability measures are dominated by a cr-finite measure, so that they may be described through their density functions, p(x\9), with respect to this measure, yu(x), in the sense that

where A is any member of the a-field and P(A\d) is the probability of A according to the measure indexed by 6. In this review we shall always suppose this to be so and in (1.1) shall write simply dx for dn(x). In practice the dominating measure will usually be either Lebesgue or counting measure; an exception arises towards the end of § 4. Furthermore, we shall not usually distinguish between a random variable and the values it takes. When it is necessary to do so we shall use the tilde notation and write x for the variable. Distinction can then be made between P(x < 9) and P(x < $). This, admittedly rather sloppy, notation avoids expressions like Px(x\9) which are complicated for both statistician and printer alike and, in my experience, enables the meanings of statements to be more easily appreciated. It is worth remarking that this model is not used in a few branches of statistics. For example, some aspects of significance testing need only a single parameter value (or, more generally, a subset of 0) called the null value, 90, and the densities p(x\90), no reference being made to alternative values of 9. The early historical examples of significance tests based on tail-area considerations require only this, but, so far as I am aware, no attempt to formalize the intuitively sensible procedure has been successful. Stone's (1969a) comments are intriguing. Again the likelihood 1

2

D. V. LINDLEY

principle, that will be discussed below, uses only the observed data x and makes no reference to the other elements of the sample space. Despite these qualifications the model described in the first two paragraphs is used in all formal analyses of statistical problems. Most descriptions go a stage further and introduce a decision space D of elements d and a nonnegative loss function L(d, (?) on D x 0. For example, in estimation problems squared error los is often assumed, whilst in hypothesis testing a zero-one loss function is used. The Bayesian argument extends the basic model in a different direction and supposes that 0 supports a cr-field and a probability measure over it. This supposition I shall take to be the defining property of what constitutes a Bayesian argument, and a Bayesian solution is one that uses such a distribution. Again it will be convenient to describe the measure through its density function p(9) with respect to some dominating - / 2 , given A, if l± > 12, provided both lotteries are to be called off if A occurs. In other words you can effect the comparison of / t and 12 either before or after A provided that, in the former case, no change takes place if A does not occur. This leads immediately to Bayes' result. A recent exposition at text-book level has been provided by De Groot (1970) based on the work of Villegas (1964). He develops an axiom system first for probabilities and then for utilities. He points out that a set of axioms which might appear to be enough are not in fact adequate to derive probabilities and he adds the assumption of the existence of a random variable uniformly distributed in [0,1] to complete the argument. The point is discussed in detail by Kraft et al. (1959) who demonstrated that a conjecture of de Finetti was wrong. Fishburn (1969b) argues that it is sometimes difficult to defend the transitivity assumption and explores the possibility of proceeding without it; the result is that only qualitative "probabilities" are obtained. De Finetti (1964), unaware of Ramsey's work, produced in the mid-1930's an argument which is different in spirit from those so far discussed. Let E t , E2, • • • , En be n exclusive and exhaustive events held with beliefs p ^ , p 2 , • • • , Pn- These are not yet probabilities but merely numerical measures of belief derived from the consideration that the gambles [(x;/p,- — x,-)Ej, — x,-£,-] are all equivalent, at least for reasonable values of x,-. Here xt is a stake which returns a prize, x,/p,-, if £, obtains but is otherwise lost—the expectation (xf/p,- — xi}pi — xt(\ — pt) being zero. Suppose now that a gambler puts stakes x,- on £,, i — 1,2, • • • , n. If £; occurs he will win the amount x/p, — ^ = 1 xh = g,, say. Considered as linear equations in the stakes they will have a solution unless the determinant is zero. Consequently unless this happens we could choose gj > 0 for all i and determine stakes that would be certain to win (not just expected to win) whatever event occurred. Hence the determinant must be zero, and this easily gives ^"h= 1 ph = 1. This justifies the addition rule for beliefs. The notion of called-off bets enables the multiplication rule to be derived. We shall see later (§ 8) that de Finetti's notion of a successful gambling system can be used with advantage to criticize some orthodox statistical procedures. His arguments do, however, suffer from the disadvantage that by introducing stakes there is some confusion with utility ideas. A nice treatment has recently been provided by Freedman and Purves (1969) who establish that bookies must be Bayesians. De Finetti's other important argument concerned with exchangeability will be discussed later (§ 6). A decision-maker whose actions agree with these axioms has been variously described as rational, consistent or coherent. We shall use the last term because it effectively captures the idea that the basic principle behind the axioms is that our judgements should fit together, or cohere. The axioms do not refer to single decisions or inference but to the way in which separate ones cohere, for example, in the transitivity requirement. The concept of coherence has been discussed

BAYES1AN STATISTICS

7

recently within the framework of modern statistics in a particularly illuminating article by Cornfield (1969). A justification for the Bayesian approach of an unusual type that might appeal to an orthodox statistician has been provided by Shubert (1969). A statistical tradition closely related to, but much weaker than, coherence derives from the work of Wald (1950). Unlike the expositions of Ramsey, de Finetti and others this is expressed in terms of the (X, D, 0) model discussed in the introduction. A loss function L(d, 9) is used, together with the notion of a decision function 6 which maps X onto D, 6(x) being the decision taken if x is observed. The risk function for 6 is defined as

6 is said to be inadmissible if there exists another decision function 6* with R(d*, 9) ^ R(6,9) for all 9, with strict inequality for some 9; otherwise it is admissible. Wald's major result can be summarized by saying that he proved that 6 is only admissible if it is a Bayes solution for some prior density p(9)—though the notion of a probability measure has to be extended somewhat to make this statement rigorous: specifically, improper priors, that is those for which

p(9) d6 diverges,

have to be included. Improper priors will be discussed below (§8). Wald's argument is considerably weaker than the others that have been discussed mainly because it assumes the existence of a loss function. (On the other hand the mathematical treatment is commendably complete, though it has been criticized by Stein.) A utility function is derived in such a way that its expectation is the inevitable and sole criterion by which a decision should be judged. It is by no means obvious that such a function should exist, and the precise meaning of a loss function is obscure.. In applications it typically seems true that

and we shall regard it as such. It is often useful to note that in the Bayesian solution of a decision problem it is always permissible to subtract from U(d, 9) any convenient function of 9 without affecting the result. This is clear from considering

J U(d, 9)p(9\x) d9, which will only have a quantity not

involving d subtracted

from it. The principle applies to (2.2). The results described in this section can be summarized by saying that any reasonable consideration of the way in which decisions or inferences cohere leads to the existence of p(0), U(d, 9) and the principle of maximization of expected utility. This has been rigorously demonstrated when D and 0 are finite. Savage's work deals with more complicated spaces but there still remain some points, apparently of detail, but possibly of practical importance that remain unclear. Presumably the utility function is bounded since otherwise paradoxes of the St. Petersburg type arise. It is not quite so clear whether the probability measure

8

D. V. LINDLEY

should be cr-additive, as we have required, or whether it is enough to be finitely additive. A recent general approach is that of Fishburn (1969a). He provides a set of axioms, which includes the idea of a canonical experiment (under the name of extraneous measurement probabilities), that place no real restrictions on C and 0 and establish the existence of a utility function, a finitely-additive probability measure and the principle of maximization of expected utility. A mild restriction on the probability suffices to make the utility bounded. This conclusion of finite-additivity agrees with de Finetti but the situation is unclear to me. We shall see below (§ 12.4) that requirements of invariance, that it seems sensible to impose on some statistical problems, would imply the use of improper probability distributions, but that these can cause difficulties. What does seem clear is that the use of a bounded utility function and a proper cr-additive density cannot lead to difficulties. Some coherent decisions and inferences may be possible outside these limits. It should be particularly noticed, since this affects the use of the ideas, that the arguments establish the existence of a distribution over 0. One often reads statements in the literature to the effect that "a prior distribution does not exist in this problem." Within the framework of coherence this is demonstrably not true. However much a person may rebel at the thought of it, the fact remains that if his statements are not to be found incoherent, then they will be interpretable in terms of such a prior (to misuse the adjective). The probability that the axiom system imposes is to be interpreted as a subjective probability possessed by the decision-maker, "you," whose judgements cohere. P(A\B) is the degree of belief you have in A. given B. It should not be confused with the so-called objective probability derived from long-term frequency considerations. For example, suppose we have a coin that is judged to be fair, then the subjective probability for heads will be ^; but on repeated tosses the frequency may be demonstrably not near 0.5. The relationship between the two ideas is explained by de Finetti's notion of exchangeability to be discussed later (§ 6). The view of probability that emerges from these axiomatic considerations is entirely subjective and the attitude will be adopted in this review that all probabilities are to be so interpreted. Objections to this attitude are numerous but none that I am aware of have gone to the axioms and criticized those. Indeed, it is hard to see how such criticism could be sustained since the requirements imposed by coherence are so modest. An excellent discussion on Bayesian ideas, Savage et al. (1962), includes contributions from speakers with widely differing viewpoints, though, to me, the eight years that have elapsed since then make much of it seem dated. An excellent, up-todate critique by one of the contributors is Bartlett (1967). The objections are usually at a nonmathematical level. A common one is that expressed by Le Cam in Barnard et al. (1968) who argues that the results are personalistic and therefore unsuitable for science which is objective. To reply to this, notice that the theory deals with a single decision-maker whom we have called "you" but equally it could be a firm or even a government. If science were really objective, then presumably the results could be described as those held by the scientific community, but surely the scientific community should be just as coherent as a single

BAYESIAN STATISTICS

9

individual scientist. If so, the scientific community would act as if it had a prior and a utility. In fact science is not objective as any practising scientist must realize, simply because scientists do not and could not perform as a single decisionmaker. The theory does not deal with two or more decision-makers, and does not say how people's ideas should be handled when disagreement exists. It is unreasonable to criticize a theory for not doing what it did not set out to do. My view is that a major gap in our knowledge is the lack of an adequate theory of conflict. Game theory, which only applies to the two-person zero-sum game, and then only to the equilibrium strategy, is not enough. A game should be played to maximize one's expected utility and the expectation based on one's assessment of the opponent's strategy: thus one should not minimax against an inexperienced player. Another, though weaker, reply to Le Cam's criticism is that the orthodox methods are also personalistic. Thus in Lehmann's (1959) book there is a discussion of the choice of a risk function on intuitive grounds. This will be considered in § 7. Dempster in a series of papers, a convenient reference is (1968), and Smith (1961, 1965) have made constructive criticisms, concerned particularly with the "firmness" with which a probability statement may be held, and have suggested that a single probability statement over 0 be replaced by upper and lower probabilities; only if these were equal would an ordinary probability obtain. Smith's theory is not developed at a formal level but Dempster's is and Aitchison in the discussion to the paper just referred to, presented the following criticism. Let X = {x 1 ,x 2 ,x 3 }, 0 = {9l,92}, let the probabilities p(x,-|0,-) be as in the table

and suppose p*^) — 0, p*^) = 1, where the asterisks denote Dempster's lower and upper probabilities by their positions. Then calculations show that p^O^x^ — 15 P*($il x i) = ?> Yet intuition suggests that x t gives no information about whether 91 or 92 is true. The coherence arguments provide a complete description of the decision problem in terms of (X, D, 0, p(x|0), p(0), U(d, 9)), the laws of probability and the principle of maximization of expected utility, and the formal framework is there for the resolution of any decision situation. The view will be taken in this review that the inference problem is similarly described in terms of (X, 0, p(0),p(x|0)) and solved by calculating p(0|x) or some margin thereof. Objections to this last resolution have been made on the grounds that inference is not to be confused with decision-making and that our coherence ideas deal with this latter problem. This is not strictly true since the coherence argument can be applied directly to the events of 0 (see De Groot or de Finetti). My view is that the purpose of an inference is to enable decision problems to be solved using the data upon which the inference is based, though at the time at which the inference is made no decisions may be envisaged. If this is correct, then the posterior must be quoted since it alone is needed for any decision situation. To quote Ramsey (1964): "A lump

10

D. V. LINDLEY

of arsenic is called poisonous not because it actually has killed or will kill anyone, but because it would kill anyone if he ate it." A different distinction between inference and decision-making has been presented by Blyth (1970) without reference to the ideas described in this section. Takeuchi (1970) gives a Bayesian reply. 3. Sampling-theory statistics. In this section the implications of the coherence argument for present day (orthodox) statistics is discussed. The bulk of the material consists of a series of counterexamples designed to demonstrate the incoherence of most statistical procedures. One immediate deduction from the coherence ideas is the likelihood principle. This says that if x l 5 x2 are two data sets with the same likelihood function apart from a multiplicative constant (that is, p(xl\6) — kp(x2\9) for all 0e0, where k does not depend on 9), then inferences and decisions should be identical for xl and x2. This principle can be defended with its own axiom system: see, for example, Birnbaum (1962) and Barnard et al. (1962). Further discussion of the principle of conditionality used by Birnbaum in his derivation has been given by Durbin (1970), and replied to by Savage (1970) and Birnbaum (1970) (see also Hartigan (1967) and §12.4 below). The principle follows from the Bayesian argument since equality of the likelihoods implies p(9\xl) = p(9\x2) for all 9. It is surprising that many statistical methods violate the principle; indeed, all methods that necessitate reference to some property of X other than the observed x do so. For example, the requirement that an estimate t(x) be unbiased, that is,

for all 9, violates the principle, since t(x) will typically depend on X through the integration involved in (3.1). A simple, oft-quoted example is interesting. Consider a sequence of binomial trials, that is,3 x = (x l s x 2 , • • • > X J> where, given 9, a real number, the x,-, all zero or one, are independent with p(xi = 1|0) = 9. Then if X consists of all such sequences of length n the only reasonable unbiased estimate of 9 is r/n, where r = £"= { x £ . On the other hand, if X consists of all such sequences with fixed r (inverse binomial sampling), the equivalent estimate of 9 is (r - l)/(n - 1). Yet, in both cases, the likelihood function is 9r(l - 9)"~r. Since many statistical procedures utilize the structure of X the specification of X constitutes a problem for the orthodox statistician. (It is because of this reference to the sample space that Box has introduced the adjective "samplingtheory.") Consider the following practical example due to Edwards (1970). In a mathematical model of the mutations that have produced the present distribution of blood groups in the human population of the world at the present time it is required to estimate, inter alia, 9, the mutation rate. The data x are the numbers with blood of each group. Analysis seems possible at an intuitive level, but what 3 In describing A" or x it will often be convenient to use bold fount, X or x, and reserve italic face for elements of the description.

BAYESIAN STATISTICS

11

is XI Realization of other possible worlds seems rather strained. My own view is that the orthodox statistician's choice of X has an arbitrariness about it comparable with the arbitrariness in p(&) of which the Bayesian is often accused. Our first set of examples will therefore deal with the choice of X. A statistic t(x) (that is, some function of x) is called ancillary if its probability distribution, derived from that of x, does not depend on 9. Sometimes the additional requirement is added that t, when combined with the maximum likelihood estimate should be sufficient4 (see below). The suggestion is often made to make inferences conditional on the observed value of an ancillary statistic. That is, if x0 is the observed data, restrict X to all x such that t(x) = t(x0). A standard example is bivariate regression where x = (x, y) and 9 being the set of regression parameters ; then r(x) = x is ancillary and it is common practice to regard the independent (or regressor) variable x as fixed. The general practice is obvious from the Bayesian viewpoint since and the two likelihoods are proportional. Our first example concerns a case where it seems natural to condition on an ancillary and yet the resulting procedures do not have the usual optimum samplingtheory properties. Here (Cox (1958), Basu (1964)) x = ( x l 5 x 2 ) ; xl = 0 or 1 with equal probabilities; if x t =0, then5 x 2 ~ N(6, GO); if X j = 1, then x 2 ~ JV(0, erf), with al » GO. (9 is measured either by a precise apparatus (cr0) or an imprecise one (0-j), the choice of apparatus being decided by the flip of a coin.) Clearly xl is ancillary and yet it can be shown that tests based on restricting X to the observed value of x1 are not the best possible in the Neyman-Pearson sense (Cornfield (1969)). Even the standard error of x 2 , the natural estimate of 9, is unclear since the computation of a standard error involves X (Buehler, (1959)). A variant of this example is to let X j be an integer and to consider x 2 ~ N(9, a2/n) with n = ^xj. Generalization to include various distributions for n have been discussed by Cohen (1958). Durbin (1969) shows that either the tests with n held fixed, or the unrestricted tests, can be uniformly most powerful depending on the situation, at least asymptotically. The most complete study of ancillarity has been made by Basu (1964), and his beautiful counterexamples are worth repeating. Let x be uniformly distributed in tne (real) interval [9, 1+9). Then it is easy to see that the fractional part of x is ancillary; in fact, it is uniformly distributed in [0, 1). If one was to condition on it, then x, given the fractional part, has a one-point distribution with rather 4 Durbin (1969) has given an example where a natural ancillary is not part of the sufficient statistic. Here x = (x,, x 2 ), and x, = 0 or 1 with equal probabilities. If x, = 0, x 2 is the result of n binomial trials (see above), if xv = 1, x2 is the result of r inverse binomial trials, n and r both having known (that is, not involving 9) distributions. Then xl is ancillary but not part of x 2 , the sufficient statisti 5 The relation "~" is to be read: "is distributed as." N((t, a2) refers to the normal distribution of mean /i and variance a2.

12

D. V. LINDLEY

limited distributional properties! A second example demonstrates the difficulty that ancillary statistics are typically not unique and consequently it is not clear which one to condition on.6 The following table lists in the first row the six

elements of X; the second provides the relevant densities for each 9, —I — 9 = I ; the third and fourth give the values of two ancillary statistics. (The reader will be able easily to construct for himself four other ancillaries.) To illustrate the difficulty suppose the data x = 5 is observed. If t^ is used as the ancillary statistic, then the maximum likelihood estimate (here 9 = 1) has a distribution on (- 1,1) with probabilities [(2 — 0)/4, (2 + 0)/4]. If t2 is used the corresponding distribution is quite different, namely [(3 - 0)/5, (2 + 0)/5]. The choice of ancillary, and generally the choice of sample space, presents a major difficulty in orthodox statistics. This difficulty is, from a Bayesian viewpoint, inevitable, since the use of X violates the likelihood principle and is therefore incoherent. An attempt to avoid the difficulty has been made by Fraser (1968 and earlier papers referred to therein) who argues that the model we have used is inadequate and omits certain important requirements. When these are inserted the ancillary is unique and inferences can proceed.7 Eraser's work will be discussed below (§8). Closely related to the likelihood principle is the method of maximum likelihood. Except in a detail to be mentioned in connection with the asymptotic theory, this does not violate the principle, but nevertheless can give rise to difficulties. The following example due to Kiefer and Wolfowitz (1956) is elegant and occurs in practice. Let x = (xj, x2, • • • , xn) be a random sample of size n from the density where 9 = (//, a2), (x is either JV(yu, 1) or N(/LL, a2), each possibility being equally likely, but, unlike Cox's example, we do not know which.) Let /j, = xt; then the likelihood tends to infinity as a -> 0, and this for all i = 1,2, • • • , n. Hence there is no maximum 8 in a strict sense, or n in a loose sense. Again suppose x = (xu, x2i; i = 1,2, • • • , n) with xti ~ N(/^» a2), 9 = (/^, f i 2 , • • • , /z n ,cr 2 ), all Xti being independent, given 9. (Pairs of measurements of equal precision are made on each ^.) The maximum likelihood (m.l.) estimate of a 2 is £(*!,- — x2i)2/4n and converges in probability to ^cr2 as n -> oo, which is hardly satisfactory. Barnard (1969) argues that (x u + x2i}, i — 1,2, • • • , n, are "irrele6

The concept of a maximal ancillary, analogous to a minimal sufficient, statistic does not seem to be realizable. 7 He does not use this language, but his restriction to orbits is mathematically equivalent to the choice of an ancillary. 8 A distribution for a (convergent as a -> 0) would resolve the difficulty and typically there would be a unique mode for the distribution of 8 posterior to x.

BAYESIAN STATISTICS

13

vant" so that using only dt = xli; — x2i and writing down the likelihood for this, the new m.l. estimate is I.df/2n which does tend to a2. This type of argument is typical of the ad hoc procedures that orthodox statisticians have to resort to in default of the Bayesian argument. A systematic study of this particular form, and a serious attempt to remove the improvization element has been made in their studies of marginal likelihoods by Kalbfleisch and Sprott (1970). A criticism of the argument that inferences should be based on the likelihood alone (and not in conjunction with the prior) will be postponed until §6 when sampling from a finite population is discussed. The phenomenon displayed in the last example is typical of what happens when incidental parameters (like the ji's) appear. A more extreme case arises when fitting a straight line with both variables subject to error. (The model is described in § 7 below.) There it was thought for a long time that the m.l. estimate of the slope was equal to the ratio of the m.l. estimates of the two standard deviations, an absurd situation. In fact Solari (1969) has shown th&t this supposed maximum is only a saddle point. The likelihood function has essential singularities and the likelihood can be made to approach any value between plus and minus infinity in any neighborhood of such points. Continuing with counterexamples, we turn to the topic of significance tests, a branch of statistics which has a more completely developed formal theory than most others (see, for example, Lehmann (1959)). We begin with the test of a simple null hypothesis against a simple alternative where the orthodox theory is most complete. That theory uses a and ft, the errors of the two kinds. Formally, with arbitrary X, 0 = (ftx, 82), D = ( d 1 , d2) and L(dt, Oj) = 0, if i = ;; 1, if i ^ j. Then for a decision function (test) a. In view of the close connection between tests and confidence intervals—the interval being roughly those null values which the data do not reject—these last two examples are embarrassing to an advocate of such intervals, the interval of smaller content not being included in the larger one. But the main attack on confidence intervals (or sets) lies elsewhere. Let A be a confidence statement, say that 6 e I an interval of the real line, /, or /, being the random quantity; then we have p(A\9) — a for all 9 (or, more generally, p(A\9) ^ a for all 9). This is a quasi degree-of-belief statement about 9 and unless effectively based on a distribution for 9 can be incoherent in a way now to be described. An important criticism of confidence intervals is due to Fisher (1956). A formal expression of his point appears to run as follows. A subset C of X is relevant11 (or recognizable) if p(A\C, 9) ^ a + £ for all 9 and for some £ > 0. The importance of a relevant subset is that whenever x e C we know that the true confidence coefficient is strictly greater than the a-value quoted, which seems absurd. The simplest example arises when A sometimes includes the whole real line, taking C to be the set of x-values for which this happens, then p(A\C, 9) = 1. Thus let x = ( x j , x2) with x,- ~ N(0i, 1) and X j , x2 independent. A confidence interval for OJ02 is provided by noting that (0 2 x t - 0iX 2 )(0i + #2)" 1/2 ~ N(0,1) and depends only on 0!/02. If (xl + x 2 ) < A«, where 4 is the upper, two-sided, a-point for the standard normal density, the resulting "interval" includes all values of 0!/02. Fisher's original idea in introducing the concept seems to have been to criticize Welch's (1947) solution to Behren's problem by demonstrating that a recognizable subset exists in that situation. Buehler (1959) has discussed the ideas in detail and Buehler and Feddersen (1963) have demonstrated the remarkable fact that relevant subsets exist in the common Student-f situation (since this is also a fiducial interval, Fisher's remarks have come full circle). Specifically they show that if x = (x t ,x 2 ) with x,-, independent, N(/n,a2), so that a 50% interval for /i is x min fS /i ^ x m a x » and if C is the set |Xj — x 2 | ^ 4|x|/3 (so that the two readings are rather discrepant), then p(A\C,6) ^ 0.5181. Consequently even the most frequently used confidence statement is unsound and it seems a reasonable conjecture that recognizable subsets exist for almost all situations. Hartigan has pointed out to me that relevant subsets always exist for one-sided confidence intervals on the real line. For let the confidence statement be p(9 > t(x)\9) = a, then it is easy to demonstrate that the set t(x) < 0 is relevant. Peculiar phenomena that can arise with confidence intervals have been expounded by Pratt (1963). We have already pointed out that in point estimation, unbiased estimates could be incoherent because of their dependence on the sample space. A simple example is provided in Ferguson's text book (1967). Here x is a Poisson variable of mean 9 11

A (frequency) theory of probability using the notion of relevant subsets has been developed by Kyburg(1969).

BAYESIAN STATISTICS

17

and an unbiased estimate of e~26 is required. (We observe a Poisson process for, say, an hour, and require to estimate the chance of no events in a subsequent two hour period.) To be unbiased we must have

or

on multiplying both sides by e6 and using the series for e~°. By the uniform convergence of the series, it follows that the only unbiased estimate is t(x) = ( — }x. The idea of estimating a probability as — 1 is particularly ludicrous. An indication at a more general level of the conflict between unbiased and Bayes' procedures has been given by Bickel and Blackwell (1967). The theory of unbiased estimation that forms so popular a part of most courses in mathematical statistics is therefore of doubtful value, especially when it is remembered that the final estimate that is produced as the best one is only best because the class of estimates has been so constrained that it has only a single member. An interesting practical problem in which the use of unbiased estimates and the related concepts of mean square error, or variance, give rise to difficulties, is that of calibration where a large class of reasonable estimates has infinite mean square error. The reader is referred to Krutchkoff (1969), Williams (1969), and, for a Bayesian reply, to Hoadley (1970). We have tried, in this section, to show that the principle of coherence has practical implications of considerable importance and that many orthodox statistical ideas are unsatisfactory when judged by this criterion. We know that difficulties of this sort cannot arise if Bayesian methods are used. Furthermore, Bayesian methods provide a general formulation and solution of most statistical problems. The system provides a general method of describing and analyzing any such situation without the appeal to ad hoc procedures or ingenious tricks. In this sense it is more objective than sampling-theory methods. We now examine some of the basic ideas in Bayesian statistics. 4. Basic ideas in Bayesian statistics. Despite the substantial criticisms of the last section, many important sampling-theory ideas do have a Bayesian interpretation. The most widely used methods are those based on least squares theory and the related technique of the analysis of variance. We begin this section by describing how these ideas can be expressed through posterior distributions. We do not attempt full generality but only aim to illustrate the basic ideas (for details the reader is referred to Jeffreys (1967) and Lindley (1965)). The numerous papers of Good are valuable; convenient references are (1950, 1965), and (1969) provides a bold attempt to apply the ideas. Let x = (xl,x2, • • • , xn) with x,- independent and normally distributed with constant variance, 0, say, which is unknown. Let £(x) = A6 with 0 = (0 l5 92, • • • , 0S)

18

D. V. LINDLEY

and A known. (6, in the earlier notation, is now (0, ).) Suppose A r A is nonsingular. For a distribution over parameter space, suppose the 6t and log 0 to be all uniformly and independently distributed. Then it is easy to show that

where S2 is the residual sum of squares, namely,

and S2(9) is a positive-definite quadratic form in 9r+1,6r +2, •• • , Os whose exact form need not concern us. This density is constant on ellipsoids S2(6) = const, with a maximum at the least squares estimates. The set consisting of the interior of any one of these ellipsoids has the property that the probability for any point inside the set is greater than that for any point exterior to it. Such sets have been called sets of highest posterior density, Box (1965), Bayesian confidence sets, Lindley (1965) and credible sets, Edwards et al. (1963); we shall use the last term. 12 The probability (posterior to the data) of 0 r + 1 ,0 r + 2 , • • • , 9S lying in the credible ellipsoids can easily be found from (4.1) in terms of the F-distribution. In fact, (S2(6)/(s - r)}/{S2/(n - s)} is F(s - r,n - s). The set, Aa, with total probabilit a is a credible set of credibility a. It is easy to see that it has exactly the same form as the confidence set for 9r+ j , Or +2, • • • , Qs based on the sampling distributions of S2(6) and S2, with confidence coefficient a. In fact we have both p(A^\\) = a, where the random elements in Aa are the s — r parameter values, and p(Aa\Q) = a, where the random element is x. The normal distribution has the remarkable property that equivalent statements can be made with either X or 0 as the relevant space supporting the probability distributions. A Bayesian interpretation of the common F-test is then available by rephrasing the sampling-theory notion that a null value is significant if the confidence interval does not include it, confidence being replaced by credible. Thus the hypothesis 9r+1 = 6r+2 = ••• = Os = 0 is tested by referring {S2(0)/(s — r)}/ {S2/(n — s)} to the F-table on s — r and n — s degrees of freedom in the usual way. Essentially in rejecting the null value we are saying that it has not got high posterior probability (density) in comparison with other values. Although these ideas enable orthodox practice to be interpreted in probability terms, it does not follow that the practice is to be adopted. Inferences should be expressed in the form of a posterior distribution. Practical circumstances may suggest some summary of the distribution because of the difficulties in describing a density, particularly in more than one dimension, but whether intervals are the most convenient forms of summary is unclear. Posterior means, modes or variances may be preferable. Another difficulty associated with the Bayesian description is that it uses improper prior distributions. We shall see later (§ 8) that there is 12 Even in one dimension such intervals are not always too easy to compute since typically two "tails" with equal bounding ordinates will have to be found. Tiao and Lochner (1967) discuss this for F. An example of the use of these interval estimates in assessing the reliability of systems is provided by Springer and Thompson (1966, 1968), a problem also considered by Bhattacharya (1967).

BAYESIAN STATISTICS

19

reason to suspect these, yet a reanalysis using a proper prior will not give orthodox results. The above discussion of least squares ideas can be extended to other orthodox practices. For example, maximum likelihood methods are often sensible for a Bayesian, at least asymptotically, though the posterior mode is perhaps a more reasonable substitute. The usual /2-tests for goodness-of-fit and for the analysis of contingency tables may also be justified asymptotically, though again, as we shall see below, other methods are more advantageous. We now turn from sampling-theory concepts to an honest Bayesian analysis of a decision problem (and hence of an associated inference problem). There are two ways to proceed. 1. Normal form. Let d be a decision function mapping X into D and describing the decision 6(x) to be adopted when x is observed. The performance of 3 (prior to the data being available) may be assessed for any value of 9 by calculating the expected utility conditional on 9; that is, by

(Compare the definition of a risk-function, equation (2.1).) Denote this by Ud(9). The Bayesian argument says that 6 should be selected by maximizing the expected value of U8(9), the expectation being with respect to the distribution of 9 prior to x, that is, by

Essentially this is the Bayesian solution to a decision problem when it is expressed in the sampling-theory form in which the distribution over X is paramount. A simpler analysis is possible. 2. Extensive form. This is the form already given in (1.4) and consists in evaluating

the posterior expected utility. At least if utility is bounded and p(9) proper the two forms are equivalent. For (4.4) is

20

D. V. LINDLEY

where Fubini's theorem has been used twice to interchange double and repeated integrals, and the passage from the second to third lines has been effected by Bayes' theorem, (1.2). The main difference between the normal and extensive forms is that in the former the decision-maker considers the situation before the data is available, whereas in the latter only the decision for that x observed is contemplated. The basic idea of "called-off" bets is relevant. The extensive form is simpler. The terminology is due to Raiffa and Schlaifer (1961), as are most of the ideas which follow in this section. An elementary exposition of some of them is given by Raiffa (1968). In the extensive form no expectation over X is required and the likelihood principle obtains. In the design of experiments, however, X can be selected and expectations are required. A triplet e = (X, 0, p(x\9)) is called an experiment. Consider a collection, E, of experiments e having a common 0, together with a decision space D. Prior to having selected e and observed x, we ask which is the best e to choose from E. The decision is now in two parts, the selection of e and the choice of d, and a general utility function will be of the form 13 U(d, 8, e, x), allowing for the fact that some experiments will cost more than others. For any e the expected performance of the best decision function is given by one of the equivalent forms in (4.6) and the best e maximizes these. Hence the formal Bayesian solution to the experimental design problem14 is provided by

This is perhaps most easily appreciated by using a decision tree (see Fig. 3). The sequence of events in time order is that e is selected, on performance it yields data x, when d is chosen and finally 9 yields the utility U(d,6,e,x). A decision tree is analyzed in reverse time order. We first average over 9, the appropriate distribution being p(B\x,e) since, at that time, e and x are available. Then d is selected to maximize the resulting average (or expectation). Next we average over x, the relevant density being p(x\e), and finally e selected to maximize the resulting expectation. Notice that the operations of expectation and maximization alternate in the sequence. In the decision tree the points where expectation is relevant 13

There is no difficulty in including x in the utility function. In the extensive form the quantity to

be maximized is then 14

U(d, 6, x)p(6\x) dd. In most applications U does not depend on x.

An alternative approach to experimental design, more in the spirit of inference than decision theory, uses the concept of information (see § 12.6).

BAYESIAN STATISTICS

21

have been indicated by circles (and are called random nodes); the others are shown as rectangles, termed decision nodes, and maximization is required. These simple ideas are extremely general and enable the Bayesian ideas to be extended to sequential experimentation to be described later. The analysis at the last two nodes, max d

dO, is called terminal analysis; the rest, maxc

dx, is called preposterior

analysis. Preposterior analysis involves the sample space; terminal analysis does not and uses the likelihood principle. Despite this very general formal solution to the problem of experimental design, few explicit results15 are available in the field that this title ordinarily covers. However one important consequence is immediately apparent and we pause to discuss this. Randomization. Let £ 0 , a subset of £, be the set of experiments 16 satisfying (4.7). If it contains a single member, then this is the best experiment to perform. If it contains more than one member then all e e £0 are equivalent from a Bayesian viewpoint and any may be selected. Consequently it is never necessary to randomize in experimental design, though randomization over £0 would not do any harm (nor any good). This goes counter to a popular sampling-theory canon. On reflection the Bayesian conclusion seems correct to me. Certainly I find it hard to see how the fact that a result was obtained by randomization rather than by deliberate choice can have any effect on the subsequent analysis; in particular, the randomization theory of tests seems unconvincing. How can the fact that a different result might have been obtained, but was not, influence you once the data is on view? The point has been well argued by Jeffreys (1967). There might, nevertheless, be some sense in randomizing but then using an orthodox or Bayesian argument. However it is clear that randomization can only be a last resort. If some factor is present which is thought likely to influence the data, then this should be allowed for in the design, for example, by using blocking devices. Randomization, therefore, even to an orthodox statistician, is only used to guard against the unforeseen. The Bayesian could therefore select a haphazard sample: that is, one which, as far as he can see, will provide a good inference and not be disturbed by other effects. At best randomization can only be a convenient device to simplify the subsequent calculations. Stone (1969b) disagrees. We shall return to this topic when discussing sampling from a finite population in § 6. Sufficiency. One topic on which all statisticians seem to be in complete agreement is that of sufficiency. The Bayesian definition is that t(x) is sufficient if p(6\t) = p(9\x) for every x and every-distribution, p(9), prior to x: that is, if the posterior given x is the same as given the statistic. This is easily seen to be equivalent to the orthodox definition. The extension to minimal sufficient, in terms of sub-o--fields over X, proceeds exactly as in the sampling theory. Like most writers we shall use sufficiency, when strictly minimal sufficiency is meant. In the important 15

Draper and Hunter (1966, 1967a, 1967b) have discussed the design problem from a Bayesian viewpoint but not using the formal loss structure here described. 16 We suppose £0 is not empty.

22

D. V. LINDLEY

case of random sampling it is necessary to include the sample size as part of the (minimal) sufficient statistic. Notice that if 9 = (0i,0 2 ), marginal sufficiency for 6l is, in general, undefined. (For example, is s2 marginally sufficient for a2 in sampling from a normal distribution? The answer would appear to be, "no".) If p(x|6) = p(t]\Bl)p(t2\G2] and the prior similarly factors, then ^(x) is marginally sufficient, but this is a very special case. The point arises in discussing robustness (§7). Exponential family. The case where x = (x t , x 2 , • • • , xn) and p(x|0) = Y["= i P(xil$)> so th a * x is a random sample of size n from the distribution of density17 p(x,-|0), is of common occurrence. A special case, where the Bayesian (and orthodox) arguments are rather simpler, arises when the distribution is a member of the exponential family, that is,

Here (/>,-( 0) are k real functions of the parameter, ^(xj) and H(x t ) are k + 1 statistics and G(0) is a normalizing factor defined in terms of the 's, fs and H to make the density have integral (over X) equal to unity. It is immediately apparent that for x, £"=1 fj(x./X i = 1,2, • • • , k, and n, are sufficient for 0. Consequently, whatever be the size of sample the dimensionality of a sufficient statistic is constant, at k+ 1. The importance of this remark in a Bayesian analysis is that the posterior distribution of 0 given x will, under these circumstances, depend only on k + 1 values however large the sample is. In fact, if p(0) is the distribution prior to x, the posterior will be proportional to

with a,- = Yjj= i ?i(xj)' '' — 1, 2, • • • , /c, and /? = n. As x ranges over X this generates a family of densities all of the form (4.9) depending on hyperparameters a t , a 2 , • • • , a fc ,/?. Consequently not only is the density of x finitely parameterized, so is that of 0. This would not be true without the existence of sufficient statistics of fixed dimensionality. In this connection an important concept is due to Barnard (see Wetherill (1961)). A family $ of distributions over 0 is closed under sampling from the distribution with density p(x,-|0) if whenever p(6) e 5, p(#M e 5 for every x (and n). This means that provided the prior belongs to 5 any data will result in a posterior distribution in 5- If P(*il#) is a member of the exponential family, then 5 will depend on a finite number of hyperparameters. In connection with (4.8) the family with densities proportional to

17

The same symbol p has been used for the density of x and for any component x,.

BAYESIAN STATISTICS

23

is called the natural conjugate family (to p(xi\9)). Here al,a2, ••• ,ak and b are hyperparameters, with possible restrictions on their values in order that the integral of (4.10) over 0 converges. If p(9) has this form then, by (4.9), the posterior is of the same form with hyperparameters at + a f , i = 1,2, • • • , k, and ft + b replacing at, b in (4.10). The natural conjugate family is closed under sampling. It occupies an important role in current Bayesian research for no other reason than mathematical convenience. Two examples follow. Example 1. If xt ~ N(fi, a2} the likelihood for xl, x 2 , • • • , xn is

where, as usual, x = £ xjn, vs2 = £ (x; — x)2 and v = n — 1. If the prior is proportional to

the correspondence between the two functions is: n -» n', x -> m, s2 -» f 2 , v -» v' except in the power of a. Clearly as (n', m, t2, v') vary this gives a family closed under sampling. A convenient interpretation is that v't2/d2 is j2 on v' degrees of freedom and, conditional on a, p. ~ N(m,o2/n'}. Here tildes have been used to indicate the random quantities (and thereby prevent confusion with sampling theory ideas). The power of a has been arranged to make this interpretation possible. These ideas extend to the multivariate case and a comprehensive account of the distributional theory has been provided by Ando and Kaufmann (1965). Example 2. If xt is 0 or 1, with p(xt = l\9) = 9, the likelihood (see above) is 9r(l — 9)"~r, with r = ^Xj, the usual combinatorial being unnecessary. The natural conjugate family is the Beta distribution with density proportional to 9°~l(\ — 9)b~l, with a,b > 0. The extension to the case where xt takes k(> 2) distinct values leads to the Dirichlet family discussed by Dickey (1968b). Hald (1968a) has studied the dichotomy as n -> oo with h = r/n fixed for a general prior p(9). To quote a typical result, he shows that

to order n~l. Noninformative stopping. Continuing with the case of x, a random sample from p(xt\9), we have seen that Bayesian (terminal) analysis uses only the likelihood function and that the usual orthodox restriction to fixed n (in order to define X) is not needed. However, care is needed to ensure that the sampling rule does not itself contain information about 9. The following analysis is due to Raiffa and Schlaifer (1961). Define q(n\xl,x2, • • • , x n - i , 9, \{/) to be the chance, given x l 5 x 2 , • • • , *„_!, 9 and a nuisance parameter if/, of observing another sample, so that q defines the rule for stopping sampling. If x = ( x l 9 x 2 , • • • , x n ), then

D. V. LINDLEY

24

In an obvious notation this expression may be written where Q is the product of all the g-factors and p is as usual. The sampling rule is said to be noninformative if the Q-factor in (4.13) can be ignored: that is, if the posterior for 9 given x is unaffected by its exclusion. Sufficient conditions are that Q does not depend on 9 and 9 and ij/ are independent prior to x. Two examples follow. Example 1. Suppose xt ~ N(9,1) and the stopping rule is to continue sampling until \x\ > 2n~ 1 / 2 . (Sample until the null hypothesis that 9 = 0 is conventionally rejected at the 5% level.) This has been discussed by Armitage (1963). Here, perhaps surprisingly, the sampling rule is noninformative and the likelihood is as usual, though, at least when n (now n) is large almost all the information is contained in it. Example 2. The following practical application is due to Roberts (1967). The situation is the capture-recapture analysis that is presumably familiar enough to omit a detailed description. The marriage between the natural notation in this context and that of this review is as follows: 9 -» N, the size of the population, of which R are tagged, x -> r, the number found to be tagged in a second sample of n, \l/ -> p, the chance of catching a fish (say) in that sample. We make all the usual assumptions; for example, that all fish have the same chance of capture irrespective of whether or not they have been tagged in the first sample. Roberts points out that the sampling rule may reasonably be informative. As usual, we have the likelihood

where s = n — r, S = N — R. But reasonably it might also be true that

corresponding to the Q-factor in (4.13). If so the full likelihood is proportional to

18

Notice that in writing down this formula it has been assumed that p(xJ0) = p(xt\6, i//); that is, given 9, x, is independent of \l/. In Bayesian statistics all quantities are random variables and care is needed in making the probability specification. Usually the most convenient method is through a sequence of conditional probability statements: here p(6, ) and so on in the natural order.

BAYESIAN STATISTICS

25

with S and p as the two parameters. Roberts supposes S to be uniform over the nonnegative integers and p to have the conjugate Beta density pr'~l(\ — p ) R ' ~ r ' ~ l , the distributions being independent and prior to the data. Integration with respect to p gives

with mean R + (R + R' - 2)(s + l)/(r + r' - 2) - 1, compared with the m.l. estimate R + Rs/r. Notice that r' and R may be related to the experience gained in capturing the first sample of R for tagging. The value of experiments. In expression (4.7) we saw how to solve the experimental design problem within the Bayesian framework. This expression is now studied further in order to assess the value of an experiment e. We suppose U(d, 9, e, x) = U(d, 9) + U(x, e) so that the terminal utility and experimental costs are additive. The expected utility of e before it is performed is

Consider the second of the two terms in the braces. It equals the expected utility of the best decision from e, given that x is observed. Hence the expectation of the utility from e will be the average of this over X. Whereas if e is not performed the best that can be obtained is maxd

U(d, 9)p(9) d9. The difference of these

two expressions, namely,

is called the expected value of e, denoted v(e). (Raiffa and Schlaifer call it the expected value of sample information, EVSI.) The expression is clearly nonnegative, since on reversing the orders of integration over X and maximization over d in the first term, an operation which can only decrease the value, the first and second terms become equal, by Bayes' theorem, and the difference is zero ; hence v(e) ^ 0. Hence any experiment is expected to be of value. Of course, when realized the value of x may result in a loss of utility. Writing U(x, e) = — c(x, e), the cost of e and x (in units of utility) the experiment is only worth performing if

on comparing with the first term in (4.16). A special case is where e isa perfect experiment ; that is, an experiment which is certain to inform you of the correct value of 9. Here p(9\x, e) becomes a Dirac (5-function and the integration over 0 in (4.16) gives just U(d, 9'), where 9' is the "revealed" value of 9, so that one obtains maxd U(d, 9'). But 9' has density p(9')

26

D. V. LINDLEY

prior to the perfect experiment e*. Hence,

in terms of the loss function (2.2). This expression, v(e*), is called the expected value of perfect information, EVPI. Reversal of the orders of integration over 0 and maximization over d in the first term of (4.17) clearly shows that v(e*) ^ v(e) (which is intuitively obvious). Hence the EVPI is a (useful) upper bound to the value of any experiment. It should be remembered that the exact connection between utility and experimental cost has to be considered carefully and involves considerations of the utility of money (see end of § 7). A detailed discussion has been given by LaValle (1968a, b, c) who discusses, inter alia, the buying and selling prices of a lottery. We next provide some examples designed to illustrate the above ideas. Example 1. This is a no-data decision problem with 0 the real line and D — ( d 1 , d 2 ) , the loss functions being linear in 9. Specifically, we suppose that

and otherwise zero, bl and b2 being nonnegative. The value 90 is therefore the "break-even" value; for 9 > 0 0 , dv is optimum, for 9 < 0 0 , d2 is the better. This is the most general linear-loss form, though without loss of generality we could put av = 0, bl = 1. The optimum decision is to select d^ if it has smaller expected loss, that is, if

(If data were available, p(9) would be replaced here by p(9\x).) Write p(0) = /0(0) and define

Integration by parts enables (4.20) to be written where E(9) is the expected value of 0. Evaluation of f^(Q) (recognizable as the distribution function) and /2(0) are necessary for solution of the decision problem. Had the loss functions been polynomials of degree m, then the /j(0) would be required up to degree m + 1, in general. (If bl = b2, /2(0) is not needed.) Notice that the normal distribution is particularly simple since /0(0) and /i(0) can be expressed in terms of 0(0 and O(f), the density and distribution functions of the

BAYESIAN STATISTICS

27

standardized normal curve, and the integral of oo; only the ratio b/c is relevant, so this is equivalent to c -> 0. 20 It is disappointing that Bayesian decision theory has had so little impact on the whole field of quality control which is still dependent upon sampling-theory ideas, though there are exceptions; for example, the comprehensive paper by Wetherill and Campling (1966) and Campling (1968).

30

D. V. LINDLEY

An interesting problem that arises in medical statistics has been discussed by Anscombe (1963), Colton (1963) and Canner (1970). Here N patients have a disease and two treatments Tj and T2 are available. A clinical trial is performed in which n patients are given 7i and n, T2. On the basis of the results of the trial the remaining N — In patients are treated with what appears to be the better treatment. The result of a trial is either success or failure and beta-priors are appropriate. The problems are how to select n and then 7^ or T2 for the remaining patients. The loss (or utility) function naturally needs careful consideration. Canner solves the problem by the usual inverse method corresponding to (4.7). He shows, for example, that the optimum value of n is about {(N + 2)/(12c + 2)}1/2, where c is the cost of each patient in the trial. Guthrie and Johns (1959) made an early Bayesian study of sampling from a batch of size N with a single sample of n and discuss the optimum sample size and decision procedure for large N. We conclude this material on basic ideas by discussing a Bayesian method of hypothesis testing different from those indicated at the beginning of the section and in Example 1. Let H be a subset of© and suppose that we wish to see, in the light of the data, whether it is reasonable to suppose 6 e H. It is customary to speak of this as testing the null hypothesis, H, that 9 e H, where H has been used to denote both the hypothesis and the subset of 0. The alternative hypothesis, H, is that 6 $ H, that is, 6 E H. One way of testing is to calculate P(H\x), the probability that 6 € H, given the data; or, more conveniently P(H\x)/P(H\x), the posterior odds in favor of H. Now

with a similar expression for H. The posterior odds are therefore given by

and do not involve p(x). A still more convenient expression is the ratio of posterior to prior odds which is easily seen to be given by

This expression has the advantage that it does not depend on p(H). However, it does involve the distributions of 6 conditional on H and on H prior to x. Its use first seems to have been suggested by Jeffreys (1967). A common special case is that of a sharp hypothesis. This arises when 6 = (£, r\), say, and H specifies the value of ^ = £ 0 , say, without specifying r\. H is simply £ 7^ £ 0 . Then r\ is a nuisance parameter. An obvious example is where we wish to test whether the mean ^ of a normal distribution is £0 without specifying the variance rj. It has been shown by Dickey and Lientz (1970) in an elegant paper that develops a general treatment, that in this case, under a reasonable additional assumption, (4.27) takes on a simple form.

BAYESIAN STATISTICS

31

Let us write p(9\H) = p(£, q\H) = /(£, q), say, where /(£, rj) is defined21 as the elementary derivative of the distribution of 6 obtained by taking a sphere of radius p about (

\f(^0,rj)dt], the usual conditional form. In words, the

conditional distribution of 77, given £, considered as a function of £, is smooth around £ = £ 0 , so that the only discontinuity in the joint distribution occurs in f, having a "concentration" of (prior) probability at the null value. The additional assumption is that

Returning then to (4.27), the denominator is simply p(x\H) and the numerator may be rewritten, using (4.28), giving the ratio to be

But

by Bayes' theorem

and

Consequently the ratio of posterior to prior odds is simply According to Dickey and Lientz this result is due to L. J. Savage. The simplicity of (4.29) is due to its containing only the marginal densities of £ (at £0) before and after the data. If conjugate densities can be used, then these are very simple to calculate, far simpler than the form (4.27). A simple example is where x — (rl,r2), 0 = (di,Q2) an^ >",• is the number of successes in nt binomial trials with probability of success 0;, i = 1, 2, the two sets of trials being independent, and we wish to test d1 — 82- If ^i ar|d #2 have, 2 ' A density is not unique since it may be changed on any set of dominating measure zero. Its definition here becomes critical since H is such a set.

32

D. V. LINDLEY

under H, independent prior beta distributions with parameters at, bh the posterior distributions will also be beta. The above result can be applied with 22 £ = 0j — 6 rj = |(0 j + 0 2 ). The calculations of p(£0\H) and p(£ 0 |x,H) with £0 = 0 follow easily from properties of the beta distributions. If the problem of testing H against H is regarded as a decision one with two decisions dH and dn we may, without loss of generality (see the remark after (2.2)) suppose

The expected utilities are then

and similarly,

and the posterior odds are directly relevant to the solution of the decision problem. If R is the ratio of the two integrals, then H is accepted if RLO exceeds unity, where 0 is the prior odds. The asymptotic theory will be discussed in § 1 1 below. 5. Sequential experimentation. In the last section the choice of a single experiment was discussed ; we now consider the selection of a sequence of experiments. This is a field in which the Bayesian approach promises to be more successful than standard theory, partly because it does not involve the complicated sample space in the same way, and partly because probabilities prior to xn seem more acceptable when data xl, x2, • • • , xn_l are already available. Consider a finite sequence of possible experimental choices; let E = E^ x E2 x • • • x En with E( = (X{, 0, p(x{\9)) so that 0 is fixed throughout. Let the cost function be additive: that is, c(x, e) = £"=1 c,-(x,-, et). The (terminal) decision space is D and the loss function is L(d, 9), supposed added to the experimental cost. we shall write \t — ( X j , x 2 , • • • , x f ). The idea is that e± is selected from £j, x t observed, then e2 chosen from E2, and so on, up to xn, when finally d is chosen from D. Typically each E{ will include a null experiment, that is, one in which no further data is collected, so that d is immediately taken. We saw that, even in the case of a single experimental choice, the analysis proceeds in a reverse time order (see (4.7) and the related decision tree). Consequently suppose that en_l = (e l 5 e2, •- ,en-i) has been performed, with result \n-i so that it is only necessary to consider the choice of en, the value of xn and the terminal decision. Then (4.7) 22

There are other possibilities: for example, £ = \og(Ql/Q2), r\ = log#,0 2 > but the results are invariant.

BAYESIAN STATISTICS

33

may be applied with the result

Write this Ln_l(xn_1,en-l); it is the expected loss of the best choice of en and d, given results x n _ 1 from e n _ j . The same principle can now be applied to the choice of

BAYESIAN STATISTICS, A REVIEW D. V. UNDLEY University College London

SOCIETY for INDUSTRIAL and APPLIED MATHEMATICS P H I L A D E L P H I A , PENNSYLVANIA 19103

Copyright 1972 by the Society for Industrial and Applied Mathematics. All rights reserved. Second Printing 1978 Third Printing 1980 Fourth Printing 1984 Fifth Printing 1989 Sixth Printing 1995

Printed for the Society for Industrial and Applied Mathematics by Capital City Press, Montpelier, Vermont

/s a registered trademark.

Contents 1. Introduction

1

2. Coherence

3

3. Sampling-theory statistics

10

4. Basic ideas in Bayesian statistics

17

5. Sequential experimentation

32

6. Finite population, sampling theory

35

7. Robustness

42

8. Multiparameter problems

49

9. Tolerance regions and predictive distributions

56

10. Multinomial data

59

11. Asymptotic results

61

12. 1. 2. 3. 4. 5. 6. 7.

64 66 68 69 71 72 74

Empirical Bayes and multiple decision problems Nonparametric statistics Multivariate statistics Invariance theories Comparison of Bayesian and orthodox procedures Information Probability assessments

Bibliography

75

iii

This page intentionally left blank

Preface I was invited by the Statistics Department at the Oregon State University to give ten lectures on Bayesian Statistics in July 1970. This monograph is a slightly expanded version of the content of those lectures. An adherent of the school of subjective probability might be forgiven for presenting a subjective view of the subject. Although I have tried to give a reasonably complete account of the present position in the study of statistical procedures that use the notion of a probability distribution over parameter space, both the emphasis and choice of topics reflect my own interests. I am most grateful to H. D. Brunk, Lyle D. Calvin and Don A. Pierce who suggested the idea and to the National Science Foundation for finance. The encouragement to put my knowledge into some reasonably tidy shape was most welcome. London October 1970

D. V. LINDLEY

This page intentionally left blank

Bayesian Statistics A Review D. V. Lindley 1. Introduction. The mathematical model that has been found convenient for most statistical problems contains a sample space X of elements x endowed with an appropriate cr-field of sets over which is given a family of probability measures. These measures are indexed by a quantity, 9, called a parameter, belonging to the parameter space 0. The values x are referred to variously as the sample, observations or data. Notice that 6 is merely an index for the various probabilities and that as a result this model includes nonparametric statistics, the special techniques in that field being necessitated by the complexity of the space 0 which, for example, may include all distributions on the real line (see § 12.2, below). For almost all problems it is sufficient to suppose that these probability measures are dominated by a cr-finite measure, so that they may be described through their density functions, p(x\9), with respect to this measure, yu(x), in the sense that

where A is any member of the a-field and P(A\d) is the probability of A according to the measure indexed by 6. In this review we shall always suppose this to be so and in (1.1) shall write simply dx for dn(x). In practice the dominating measure will usually be either Lebesgue or counting measure; an exception arises towards the end of § 4. Furthermore, we shall not usually distinguish between a random variable and the values it takes. When it is necessary to do so we shall use the tilde notation and write x for the variable. Distinction can then be made between P(x < 9) and P(x < $). This, admittedly rather sloppy, notation avoids expressions like Px(x\9) which are complicated for both statistician and printer alike and, in my experience, enables the meanings of statements to be more easily appreciated. It is worth remarking that this model is not used in a few branches of statistics. For example, some aspects of significance testing need only a single parameter value (or, more generally, a subset of 0) called the null value, 90, and the densities p(x\90), no reference being made to alternative values of 9. The early historical examples of significance tests based on tail-area considerations require only this, but, so far as I am aware, no attempt to formalize the intuitively sensible procedure has been successful. Stone's (1969a) comments are intriguing. Again the likelihood 1

2

D. V. LINDLEY

principle, that will be discussed below, uses only the observed data x and makes no reference to the other elements of the sample space. Despite these qualifications the model described in the first two paragraphs is used in all formal analyses of statistical problems. Most descriptions go a stage further and introduce a decision space D of elements d and a nonnegative loss function L(d, (?) on D x 0. For example, in estimation problems squared error los is often assumed, whilst in hypothesis testing a zero-one loss function is used. The Bayesian argument extends the basic model in a different direction and supposes that 0 supports a cr-field and a probability measure over it. This supposition I shall take to be the defining property of what constitutes a Bayesian argument, and a Bayesian solution is one that uses such a distribution. Again it will be convenient to describe the measure through its density function p(9) with respect to some dominating - / 2 , given A, if l± > 12, provided both lotteries are to be called off if A occurs. In other words you can effect the comparison of / t and 12 either before or after A provided that, in the former case, no change takes place if A does not occur. This leads immediately to Bayes' result. A recent exposition at text-book level has been provided by De Groot (1970) based on the work of Villegas (1964). He develops an axiom system first for probabilities and then for utilities. He points out that a set of axioms which might appear to be enough are not in fact adequate to derive probabilities and he adds the assumption of the existence of a random variable uniformly distributed in [0,1] to complete the argument. The point is discussed in detail by Kraft et al. (1959) who demonstrated that a conjecture of de Finetti was wrong. Fishburn (1969b) argues that it is sometimes difficult to defend the transitivity assumption and explores the possibility of proceeding without it; the result is that only qualitative "probabilities" are obtained. De Finetti (1964), unaware of Ramsey's work, produced in the mid-1930's an argument which is different in spirit from those so far discussed. Let E t , E2, • • • , En be n exclusive and exhaustive events held with beliefs p ^ , p 2 , • • • , Pn- These are not yet probabilities but merely numerical measures of belief derived from the consideration that the gambles [(x;/p,- — x,-)Ej, — x,-£,-] are all equivalent, at least for reasonable values of x,-. Here xt is a stake which returns a prize, x,/p,-, if £, obtains but is otherwise lost—the expectation (xf/p,- — xi}pi — xt(\ — pt) being zero. Suppose now that a gambler puts stakes x,- on £,, i — 1,2, • • • , n. If £; occurs he will win the amount x/p, — ^ = 1 xh = g,, say. Considered as linear equations in the stakes they will have a solution unless the determinant is zero. Consequently unless this happens we could choose gj > 0 for all i and determine stakes that would be certain to win (not just expected to win) whatever event occurred. Hence the determinant must be zero, and this easily gives ^"h= 1 ph = 1. This justifies the addition rule for beliefs. The notion of called-off bets enables the multiplication rule to be derived. We shall see later (§ 8) that de Finetti's notion of a successful gambling system can be used with advantage to criticize some orthodox statistical procedures. His arguments do, however, suffer from the disadvantage that by introducing stakes there is some confusion with utility ideas. A nice treatment has recently been provided by Freedman and Purves (1969) who establish that bookies must be Bayesians. De Finetti's other important argument concerned with exchangeability will be discussed later (§ 6). A decision-maker whose actions agree with these axioms has been variously described as rational, consistent or coherent. We shall use the last term because it effectively captures the idea that the basic principle behind the axioms is that our judgements should fit together, or cohere. The axioms do not refer to single decisions or inference but to the way in which separate ones cohere, for example, in the transitivity requirement. The concept of coherence has been discussed

BAYES1AN STATISTICS

7

recently within the framework of modern statistics in a particularly illuminating article by Cornfield (1969). A justification for the Bayesian approach of an unusual type that might appeal to an orthodox statistician has been provided by Shubert (1969). A statistical tradition closely related to, but much weaker than, coherence derives from the work of Wald (1950). Unlike the expositions of Ramsey, de Finetti and others this is expressed in terms of the (X, D, 0) model discussed in the introduction. A loss function L(d, 9) is used, together with the notion of a decision function 6 which maps X onto D, 6(x) being the decision taken if x is observed. The risk function for 6 is defined as

6 is said to be inadmissible if there exists another decision function 6* with R(d*, 9) ^ R(6,9) for all 9, with strict inequality for some 9; otherwise it is admissible. Wald's major result can be summarized by saying that he proved that 6 is only admissible if it is a Bayes solution for some prior density p(9)—though the notion of a probability measure has to be extended somewhat to make this statement rigorous: specifically, improper priors, that is those for which

p(9) d6 diverges,

have to be included. Improper priors will be discussed below (§8). Wald's argument is considerably weaker than the others that have been discussed mainly because it assumes the existence of a loss function. (On the other hand the mathematical treatment is commendably complete, though it has been criticized by Stein.) A utility function is derived in such a way that its expectation is the inevitable and sole criterion by which a decision should be judged. It is by no means obvious that such a function should exist, and the precise meaning of a loss function is obscure.. In applications it typically seems true that

and we shall regard it as such. It is often useful to note that in the Bayesian solution of a decision problem it is always permissible to subtract from U(d, 9) any convenient function of 9 without affecting the result. This is clear from considering

J U(d, 9)p(9\x) d9, which will only have a quantity not

involving d subtracted

from it. The principle applies to (2.2). The results described in this section can be summarized by saying that any reasonable consideration of the way in which decisions or inferences cohere leads to the existence of p(0), U(d, 9) and the principle of maximization of expected utility. This has been rigorously demonstrated when D and 0 are finite. Savage's work deals with more complicated spaces but there still remain some points, apparently of detail, but possibly of practical importance that remain unclear. Presumably the utility function is bounded since otherwise paradoxes of the St. Petersburg type arise. It is not quite so clear whether the probability measure

8

D. V. LINDLEY

should be cr-additive, as we have required, or whether it is enough to be finitely additive. A recent general approach is that of Fishburn (1969a). He provides a set of axioms, which includes the idea of a canonical experiment (under the name of extraneous measurement probabilities), that place no real restrictions on C and 0 and establish the existence of a utility function, a finitely-additive probability measure and the principle of maximization of expected utility. A mild restriction on the probability suffices to make the utility bounded. This conclusion of finite-additivity agrees with de Finetti but the situation is unclear to me. We shall see below (§ 12.4) that requirements of invariance, that it seems sensible to impose on some statistical problems, would imply the use of improper probability distributions, but that these can cause difficulties. What does seem clear is that the use of a bounded utility function and a proper cr-additive density cannot lead to difficulties. Some coherent decisions and inferences may be possible outside these limits. It should be particularly noticed, since this affects the use of the ideas, that the arguments establish the existence of a distribution over 0. One often reads statements in the literature to the effect that "a prior distribution does not exist in this problem." Within the framework of coherence this is demonstrably not true. However much a person may rebel at the thought of it, the fact remains that if his statements are not to be found incoherent, then they will be interpretable in terms of such a prior (to misuse the adjective). The probability that the axiom system imposes is to be interpreted as a subjective probability possessed by the decision-maker, "you," whose judgements cohere. P(A\B) is the degree of belief you have in A. given B. It should not be confused with the so-called objective probability derived from long-term frequency considerations. For example, suppose we have a coin that is judged to be fair, then the subjective probability for heads will be ^; but on repeated tosses the frequency may be demonstrably not near 0.5. The relationship between the two ideas is explained by de Finetti's notion of exchangeability to be discussed later (§ 6). The view of probability that emerges from these axiomatic considerations is entirely subjective and the attitude will be adopted in this review that all probabilities are to be so interpreted. Objections to this attitude are numerous but none that I am aware of have gone to the axioms and criticized those. Indeed, it is hard to see how such criticism could be sustained since the requirements imposed by coherence are so modest. An excellent discussion on Bayesian ideas, Savage et al. (1962), includes contributions from speakers with widely differing viewpoints, though, to me, the eight years that have elapsed since then make much of it seem dated. An excellent, up-todate critique by one of the contributors is Bartlett (1967). The objections are usually at a nonmathematical level. A common one is that expressed by Le Cam in Barnard et al. (1968) who argues that the results are personalistic and therefore unsuitable for science which is objective. To reply to this, notice that the theory deals with a single decision-maker whom we have called "you" but equally it could be a firm or even a government. If science were really objective, then presumably the results could be described as those held by the scientific community, but surely the scientific community should be just as coherent as a single

BAYESIAN STATISTICS

9

individual scientist. If so, the scientific community would act as if it had a prior and a utility. In fact science is not objective as any practising scientist must realize, simply because scientists do not and could not perform as a single decisionmaker. The theory does not deal with two or more decision-makers, and does not say how people's ideas should be handled when disagreement exists. It is unreasonable to criticize a theory for not doing what it did not set out to do. My view is that a major gap in our knowledge is the lack of an adequate theory of conflict. Game theory, which only applies to the two-person zero-sum game, and then only to the equilibrium strategy, is not enough. A game should be played to maximize one's expected utility and the expectation based on one's assessment of the opponent's strategy: thus one should not minimax against an inexperienced player. Another, though weaker, reply to Le Cam's criticism is that the orthodox methods are also personalistic. Thus in Lehmann's (1959) book there is a discussion of the choice of a risk function on intuitive grounds. This will be considered in § 7. Dempster in a series of papers, a convenient reference is (1968), and Smith (1961, 1965) have made constructive criticisms, concerned particularly with the "firmness" with which a probability statement may be held, and have suggested that a single probability statement over 0 be replaced by upper and lower probabilities; only if these were equal would an ordinary probability obtain. Smith's theory is not developed at a formal level but Dempster's is and Aitchison in the discussion to the paper just referred to, presented the following criticism. Let X = {x 1 ,x 2 ,x 3 }, 0 = {9l,92}, let the probabilities p(x,-|0,-) be as in the table

and suppose p*^) — 0, p*^) = 1, where the asterisks denote Dempster's lower and upper probabilities by their positions. Then calculations show that p^O^x^ — 15 P*($il x i) = ?> Yet intuition suggests that x t gives no information about whether 91 or 92 is true. The coherence arguments provide a complete description of the decision problem in terms of (X, D, 0, p(x|0), p(0), U(d, 9)), the laws of probability and the principle of maximization of expected utility, and the formal framework is there for the resolution of any decision situation. The view will be taken in this review that the inference problem is similarly described in terms of (X, 0, p(0),p(x|0)) and solved by calculating p(0|x) or some margin thereof. Objections to this last resolution have been made on the grounds that inference is not to be confused with decision-making and that our coherence ideas deal with this latter problem. This is not strictly true since the coherence argument can be applied directly to the events of 0 (see De Groot or de Finetti). My view is that the purpose of an inference is to enable decision problems to be solved using the data upon which the inference is based, though at the time at which the inference is made no decisions may be envisaged. If this is correct, then the posterior must be quoted since it alone is needed for any decision situation. To quote Ramsey (1964): "A lump

10

D. V. LINDLEY

of arsenic is called poisonous not because it actually has killed or will kill anyone, but because it would kill anyone if he ate it." A different distinction between inference and decision-making has been presented by Blyth (1970) without reference to the ideas described in this section. Takeuchi (1970) gives a Bayesian reply. 3. Sampling-theory statistics. In this section the implications of the coherence argument for present day (orthodox) statistics is discussed. The bulk of the material consists of a series of counterexamples designed to demonstrate the incoherence of most statistical procedures. One immediate deduction from the coherence ideas is the likelihood principle. This says that if x l 5 x2 are two data sets with the same likelihood function apart from a multiplicative constant (that is, p(xl\6) — kp(x2\9) for all 0e0, where k does not depend on 9), then inferences and decisions should be identical for xl and x2. This principle can be defended with its own axiom system: see, for example, Birnbaum (1962) and Barnard et al. (1962). Further discussion of the principle of conditionality used by Birnbaum in his derivation has been given by Durbin (1970), and replied to by Savage (1970) and Birnbaum (1970) (see also Hartigan (1967) and §12.4 below). The principle follows from the Bayesian argument since equality of the likelihoods implies p(9\xl) = p(9\x2) for all 9. It is surprising that many statistical methods violate the principle; indeed, all methods that necessitate reference to some property of X other than the observed x do so. For example, the requirement that an estimate t(x) be unbiased, that is,

for all 9, violates the principle, since t(x) will typically depend on X through the integration involved in (3.1). A simple, oft-quoted example is interesting. Consider a sequence of binomial trials, that is,3 x = (x l s x 2 , • • • > X J> where, given 9, a real number, the x,-, all zero or one, are independent with p(xi = 1|0) = 9. Then if X consists of all such sequences of length n the only reasonable unbiased estimate of 9 is r/n, where r = £"= { x £ . On the other hand, if X consists of all such sequences with fixed r (inverse binomial sampling), the equivalent estimate of 9 is (r - l)/(n - 1). Yet, in both cases, the likelihood function is 9r(l - 9)"~r. Since many statistical procedures utilize the structure of X the specification of X constitutes a problem for the orthodox statistician. (It is because of this reference to the sample space that Box has introduced the adjective "samplingtheory.") Consider the following practical example due to Edwards (1970). In a mathematical model of the mutations that have produced the present distribution of blood groups in the human population of the world at the present time it is required to estimate, inter alia, 9, the mutation rate. The data x are the numbers with blood of each group. Analysis seems possible at an intuitive level, but what 3 In describing A" or x it will often be convenient to use bold fount, X or x, and reserve italic face for elements of the description.

BAYESIAN STATISTICS

11

is XI Realization of other possible worlds seems rather strained. My own view is that the orthodox statistician's choice of X has an arbitrariness about it comparable with the arbitrariness in p(&) of which the Bayesian is often accused. Our first set of examples will therefore deal with the choice of X. A statistic t(x) (that is, some function of x) is called ancillary if its probability distribution, derived from that of x, does not depend on 9. Sometimes the additional requirement is added that t, when combined with the maximum likelihood estimate should be sufficient4 (see below). The suggestion is often made to make inferences conditional on the observed value of an ancillary statistic. That is, if x0 is the observed data, restrict X to all x such that t(x) = t(x0). A standard example is bivariate regression where x = (x, y) and 9 being the set of regression parameters ; then r(x) = x is ancillary and it is common practice to regard the independent (or regressor) variable x as fixed. The general practice is obvious from the Bayesian viewpoint since and the two likelihoods are proportional. Our first example concerns a case where it seems natural to condition on an ancillary and yet the resulting procedures do not have the usual optimum samplingtheory properties. Here (Cox (1958), Basu (1964)) x = ( x l 5 x 2 ) ; xl = 0 or 1 with equal probabilities; if x t =0, then5 x 2 ~ N(6, GO); if X j = 1, then x 2 ~ JV(0, erf), with al » GO. (9 is measured either by a precise apparatus (cr0) or an imprecise one (0-j), the choice of apparatus being decided by the flip of a coin.) Clearly xl is ancillary and yet it can be shown that tests based on restricting X to the observed value of x1 are not the best possible in the Neyman-Pearson sense (Cornfield (1969)). Even the standard error of x 2 , the natural estimate of 9, is unclear since the computation of a standard error involves X (Buehler, (1959)). A variant of this example is to let X j be an integer and to consider x 2 ~ N(9, a2/n) with n = ^xj. Generalization to include various distributions for n have been discussed by Cohen (1958). Durbin (1969) shows that either the tests with n held fixed, or the unrestricted tests, can be uniformly most powerful depending on the situation, at least asymptotically. The most complete study of ancillarity has been made by Basu (1964), and his beautiful counterexamples are worth repeating. Let x be uniformly distributed in tne (real) interval [9, 1+9). Then it is easy to see that the fractional part of x is ancillary; in fact, it is uniformly distributed in [0, 1). If one was to condition on it, then x, given the fractional part, has a one-point distribution with rather 4 Durbin (1969) has given an example where a natural ancillary is not part of the sufficient statistic. Here x = (x,, x 2 ), and x, = 0 or 1 with equal probabilities. If x, = 0, x 2 is the result of n binomial trials (see above), if xv = 1, x2 is the result of r inverse binomial trials, n and r both having known (that is, not involving 9) distributions. Then xl is ancillary but not part of x 2 , the sufficient statisti 5 The relation "~" is to be read: "is distributed as." N((t, a2) refers to the normal distribution of mean /i and variance a2.

12

D. V. LINDLEY

limited distributional properties! A second example demonstrates the difficulty that ancillary statistics are typically not unique and consequently it is not clear which one to condition on.6 The following table lists in the first row the six

elements of X; the second provides the relevant densities for each 9, —I — 9 = I ; the third and fourth give the values of two ancillary statistics. (The reader will be able easily to construct for himself four other ancillaries.) To illustrate the difficulty suppose the data x = 5 is observed. If t^ is used as the ancillary statistic, then the maximum likelihood estimate (here 9 = 1) has a distribution on (- 1,1) with probabilities [(2 — 0)/4, (2 + 0)/4]. If t2 is used the corresponding distribution is quite different, namely [(3 - 0)/5, (2 + 0)/5]. The choice of ancillary, and generally the choice of sample space, presents a major difficulty in orthodox statistics. This difficulty is, from a Bayesian viewpoint, inevitable, since the use of X violates the likelihood principle and is therefore incoherent. An attempt to avoid the difficulty has been made by Fraser (1968 and earlier papers referred to therein) who argues that the model we have used is inadequate and omits certain important requirements. When these are inserted the ancillary is unique and inferences can proceed.7 Eraser's work will be discussed below (§8). Closely related to the likelihood principle is the method of maximum likelihood. Except in a detail to be mentioned in connection with the asymptotic theory, this does not violate the principle, but nevertheless can give rise to difficulties. The following example due to Kiefer and Wolfowitz (1956) is elegant and occurs in practice. Let x = (xj, x2, • • • , xn) be a random sample of size n from the density where 9 = (//, a2), (x is either JV(yu, 1) or N(/LL, a2), each possibility being equally likely, but, unlike Cox's example, we do not know which.) Let /j, = xt; then the likelihood tends to infinity as a -> 0, and this for all i = 1,2, • • • , n. Hence there is no maximum 8 in a strict sense, or n in a loose sense. Again suppose x = (xu, x2i; i = 1,2, • • • , n) with xti ~ N(/^» a2), 9 = (/^, f i 2 , • • • , /z n ,cr 2 ), all Xti being independent, given 9. (Pairs of measurements of equal precision are made on each ^.) The maximum likelihood (m.l.) estimate of a 2 is £(*!,- — x2i)2/4n and converges in probability to ^cr2 as n -> oo, which is hardly satisfactory. Barnard (1969) argues that (x u + x2i}, i — 1,2, • • • , n, are "irrele6

The concept of a maximal ancillary, analogous to a minimal sufficient, statistic does not seem to be realizable. 7 He does not use this language, but his restriction to orbits is mathematically equivalent to the choice of an ancillary. 8 A distribution for a (convergent as a -> 0) would resolve the difficulty and typically there would be a unique mode for the distribution of 8 posterior to x.

BAYESIAN STATISTICS

13

vant" so that using only dt = xli; — x2i and writing down the likelihood for this, the new m.l. estimate is I.df/2n which does tend to a2. This type of argument is typical of the ad hoc procedures that orthodox statisticians have to resort to in default of the Bayesian argument. A systematic study of this particular form, and a serious attempt to remove the improvization element has been made in their studies of marginal likelihoods by Kalbfleisch and Sprott (1970). A criticism of the argument that inferences should be based on the likelihood alone (and not in conjunction with the prior) will be postponed until §6 when sampling from a finite population is discussed. The phenomenon displayed in the last example is typical of what happens when incidental parameters (like the ji's) appear. A more extreme case arises when fitting a straight line with both variables subject to error. (The model is described in § 7 below.) There it was thought for a long time that the m.l. estimate of the slope was equal to the ratio of the m.l. estimates of the two standard deviations, an absurd situation. In fact Solari (1969) has shown th&t this supposed maximum is only a saddle point. The likelihood function has essential singularities and the likelihood can be made to approach any value between plus and minus infinity in any neighborhood of such points. Continuing with counterexamples, we turn to the topic of significance tests, a branch of statistics which has a more completely developed formal theory than most others (see, for example, Lehmann (1959)). We begin with the test of a simple null hypothesis against a simple alternative where the orthodox theory is most complete. That theory uses a and ft, the errors of the two kinds. Formally, with arbitrary X, 0 = (ftx, 82), D = ( d 1 , d2) and L(dt, Oj) = 0, if i = ;; 1, if i ^ j. Then for a decision function (test) a. In view of the close connection between tests and confidence intervals—the interval being roughly those null values which the data do not reject—these last two examples are embarrassing to an advocate of such intervals, the interval of smaller content not being included in the larger one. But the main attack on confidence intervals (or sets) lies elsewhere. Let A be a confidence statement, say that 6 e I an interval of the real line, /, or /, being the random quantity; then we have p(A\9) — a for all 9 (or, more generally, p(A\9) ^ a for all 9). This is a quasi degree-of-belief statement about 9 and unless effectively based on a distribution for 9 can be incoherent in a way now to be described. An important criticism of confidence intervals is due to Fisher (1956). A formal expression of his point appears to run as follows. A subset C of X is relevant11 (or recognizable) if p(A\C, 9) ^ a + £ for all 9 and for some £ > 0. The importance of a relevant subset is that whenever x e C we know that the true confidence coefficient is strictly greater than the a-value quoted, which seems absurd. The simplest example arises when A sometimes includes the whole real line, taking C to be the set of x-values for which this happens, then p(A\C, 9) = 1. Thus let x = ( x j , x2) with x,- ~ N(0i, 1) and X j , x2 independent. A confidence interval for OJ02 is provided by noting that (0 2 x t - 0iX 2 )(0i + #2)" 1/2 ~ N(0,1) and depends only on 0!/02. If (xl + x 2 ) < A«, where 4 is the upper, two-sided, a-point for the standard normal density, the resulting "interval" includes all values of 0!/02. Fisher's original idea in introducing the concept seems to have been to criticize Welch's (1947) solution to Behren's problem by demonstrating that a recognizable subset exists in that situation. Buehler (1959) has discussed the ideas in detail and Buehler and Feddersen (1963) have demonstrated the remarkable fact that relevant subsets exist in the common Student-f situation (since this is also a fiducial interval, Fisher's remarks have come full circle). Specifically they show that if x = (x t ,x 2 ) with x,-, independent, N(/n,a2), so that a 50% interval for /i is x min fS /i ^ x m a x » and if C is the set |Xj — x 2 | ^ 4|x|/3 (so that the two readings are rather discrepant), then p(A\C,6) ^ 0.5181. Consequently even the most frequently used confidence statement is unsound and it seems a reasonable conjecture that recognizable subsets exist for almost all situations. Hartigan has pointed out to me that relevant subsets always exist for one-sided confidence intervals on the real line. For let the confidence statement be p(9 > t(x)\9) = a, then it is easy to demonstrate that the set t(x) < 0 is relevant. Peculiar phenomena that can arise with confidence intervals have been expounded by Pratt (1963). We have already pointed out that in point estimation, unbiased estimates could be incoherent because of their dependence on the sample space. A simple example is provided in Ferguson's text book (1967). Here x is a Poisson variable of mean 9 11

A (frequency) theory of probability using the notion of relevant subsets has been developed by Kyburg(1969).

BAYESIAN STATISTICS

17

and an unbiased estimate of e~26 is required. (We observe a Poisson process for, say, an hour, and require to estimate the chance of no events in a subsequent two hour period.) To be unbiased we must have

or

on multiplying both sides by e6 and using the series for e~°. By the uniform convergence of the series, it follows that the only unbiased estimate is t(x) = ( — }x. The idea of estimating a probability as — 1 is particularly ludicrous. An indication at a more general level of the conflict between unbiased and Bayes' procedures has been given by Bickel and Blackwell (1967). The theory of unbiased estimation that forms so popular a part of most courses in mathematical statistics is therefore of doubtful value, especially when it is remembered that the final estimate that is produced as the best one is only best because the class of estimates has been so constrained that it has only a single member. An interesting practical problem in which the use of unbiased estimates and the related concepts of mean square error, or variance, give rise to difficulties, is that of calibration where a large class of reasonable estimates has infinite mean square error. The reader is referred to Krutchkoff (1969), Williams (1969), and, for a Bayesian reply, to Hoadley (1970). We have tried, in this section, to show that the principle of coherence has practical implications of considerable importance and that many orthodox statistical ideas are unsatisfactory when judged by this criterion. We know that difficulties of this sort cannot arise if Bayesian methods are used. Furthermore, Bayesian methods provide a general formulation and solution of most statistical problems. The system provides a general method of describing and analyzing any such situation without the appeal to ad hoc procedures or ingenious tricks. In this sense it is more objective than sampling-theory methods. We now examine some of the basic ideas in Bayesian statistics. 4. Basic ideas in Bayesian statistics. Despite the substantial criticisms of the last section, many important sampling-theory ideas do have a Bayesian interpretation. The most widely used methods are those based on least squares theory and the related technique of the analysis of variance. We begin this section by describing how these ideas can be expressed through posterior distributions. We do not attempt full generality but only aim to illustrate the basic ideas (for details the reader is referred to Jeffreys (1967) and Lindley (1965)). The numerous papers of Good are valuable; convenient references are (1950, 1965), and (1969) provides a bold attempt to apply the ideas. Let x = (xl,x2, • • • , xn) with x,- independent and normally distributed with constant variance, 0, say, which is unknown. Let £(x) = A6 with 0 = (0 l5 92, • • • , 0S)

18

D. V. LINDLEY

and A known. (6, in the earlier notation, is now (0, ).) Suppose A r A is nonsingular. For a distribution over parameter space, suppose the 6t and log 0 to be all uniformly and independently distributed. Then it is easy to show that

where S2 is the residual sum of squares, namely,

and S2(9) is a positive-definite quadratic form in 9r+1,6r +2, •• • , Os whose exact form need not concern us. This density is constant on ellipsoids S2(6) = const, with a maximum at the least squares estimates. The set consisting of the interior of any one of these ellipsoids has the property that the probability for any point inside the set is greater than that for any point exterior to it. Such sets have been called sets of highest posterior density, Box (1965), Bayesian confidence sets, Lindley (1965) and credible sets, Edwards et al. (1963); we shall use the last term. 12 The probability (posterior to the data) of 0 r + 1 ,0 r + 2 , • • • , 9S lying in the credible ellipsoids can easily be found from (4.1) in terms of the F-distribution. In fact, (S2(6)/(s - r)}/{S2/(n - s)} is F(s - r,n - s). The set, Aa, with total probabilit a is a credible set of credibility a. It is easy to see that it has exactly the same form as the confidence set for 9r+ j , Or +2, • • • , Qs based on the sampling distributions of S2(6) and S2, with confidence coefficient a. In fact we have both p(A^\\) = a, where the random elements in Aa are the s — r parameter values, and p(Aa\Q) = a, where the random element is x. The normal distribution has the remarkable property that equivalent statements can be made with either X or 0 as the relevant space supporting the probability distributions. A Bayesian interpretation of the common F-test is then available by rephrasing the sampling-theory notion that a null value is significant if the confidence interval does not include it, confidence being replaced by credible. Thus the hypothesis 9r+1 = 6r+2 = ••• = Os = 0 is tested by referring {S2(0)/(s — r)}/ {S2/(n — s)} to the F-table on s — r and n — s degrees of freedom in the usual way. Essentially in rejecting the null value we are saying that it has not got high posterior probability (density) in comparison with other values. Although these ideas enable orthodox practice to be interpreted in probability terms, it does not follow that the practice is to be adopted. Inferences should be expressed in the form of a posterior distribution. Practical circumstances may suggest some summary of the distribution because of the difficulties in describing a density, particularly in more than one dimension, but whether intervals are the most convenient forms of summary is unclear. Posterior means, modes or variances may be preferable. Another difficulty associated with the Bayesian description is that it uses improper prior distributions. We shall see later (§ 8) that there is 12 Even in one dimension such intervals are not always too easy to compute since typically two "tails" with equal bounding ordinates will have to be found. Tiao and Lochner (1967) discuss this for F. An example of the use of these interval estimates in assessing the reliability of systems is provided by Springer and Thompson (1966, 1968), a problem also considered by Bhattacharya (1967).

BAYESIAN STATISTICS

19

reason to suspect these, yet a reanalysis using a proper prior will not give orthodox results. The above discussion of least squares ideas can be extended to other orthodox practices. For example, maximum likelihood methods are often sensible for a Bayesian, at least asymptotically, though the posterior mode is perhaps a more reasonable substitute. The usual /2-tests for goodness-of-fit and for the analysis of contingency tables may also be justified asymptotically, though again, as we shall see below, other methods are more advantageous. We now turn from sampling-theory concepts to an honest Bayesian analysis of a decision problem (and hence of an associated inference problem). There are two ways to proceed. 1. Normal form. Let d be a decision function mapping X into D and describing the decision 6(x) to be adopted when x is observed. The performance of 3 (prior to the data being available) may be assessed for any value of 9 by calculating the expected utility conditional on 9; that is, by

(Compare the definition of a risk-function, equation (2.1).) Denote this by Ud(9). The Bayesian argument says that 6 should be selected by maximizing the expected value of U8(9), the expectation being with respect to the distribution of 9 prior to x, that is, by

Essentially this is the Bayesian solution to a decision problem when it is expressed in the sampling-theory form in which the distribution over X is paramount. A simpler analysis is possible. 2. Extensive form. This is the form already given in (1.4) and consists in evaluating

the posterior expected utility. At least if utility is bounded and p(9) proper the two forms are equivalent. For (4.4) is

20

D. V. LINDLEY

where Fubini's theorem has been used twice to interchange double and repeated integrals, and the passage from the second to third lines has been effected by Bayes' theorem, (1.2). The main difference between the normal and extensive forms is that in the former the decision-maker considers the situation before the data is available, whereas in the latter only the decision for that x observed is contemplated. The basic idea of "called-off" bets is relevant. The extensive form is simpler. The terminology is due to Raiffa and Schlaifer (1961), as are most of the ideas which follow in this section. An elementary exposition of some of them is given by Raiffa (1968). In the extensive form no expectation over X is required and the likelihood principle obtains. In the design of experiments, however, X can be selected and expectations are required. A triplet e = (X, 0, p(x\9)) is called an experiment. Consider a collection, E, of experiments e having a common 0, together with a decision space D. Prior to having selected e and observed x, we ask which is the best e to choose from E. The decision is now in two parts, the selection of e and the choice of d, and a general utility function will be of the form 13 U(d, 8, e, x), allowing for the fact that some experiments will cost more than others. For any e the expected performance of the best decision function is given by one of the equivalent forms in (4.6) and the best e maximizes these. Hence the formal Bayesian solution to the experimental design problem14 is provided by

This is perhaps most easily appreciated by using a decision tree (see Fig. 3). The sequence of events in time order is that e is selected, on performance it yields data x, when d is chosen and finally 9 yields the utility U(d,6,e,x). A decision tree is analyzed in reverse time order. We first average over 9, the appropriate distribution being p(B\x,e) since, at that time, e and x are available. Then d is selected to maximize the resulting average (or expectation). Next we average over x, the relevant density being p(x\e), and finally e selected to maximize the resulting expectation. Notice that the operations of expectation and maximization alternate in the sequence. In the decision tree the points where expectation is relevant 13

There is no difficulty in including x in the utility function. In the extensive form the quantity to

be maximized is then 14

U(d, 6, x)p(6\x) dd. In most applications U does not depend on x.

An alternative approach to experimental design, more in the spirit of inference than decision theory, uses the concept of information (see § 12.6).

BAYESIAN STATISTICS

21

have been indicated by circles (and are called random nodes); the others are shown as rectangles, termed decision nodes, and maximization is required. These simple ideas are extremely general and enable the Bayesian ideas to be extended to sequential experimentation to be described later. The analysis at the last two nodes, max d

dO, is called terminal analysis; the rest, maxc

dx, is called preposterior

analysis. Preposterior analysis involves the sample space; terminal analysis does not and uses the likelihood principle. Despite this very general formal solution to the problem of experimental design, few explicit results15 are available in the field that this title ordinarily covers. However one important consequence is immediately apparent and we pause to discuss this. Randomization. Let £ 0 , a subset of £, be the set of experiments 16 satisfying (4.7). If it contains a single member, then this is the best experiment to perform. If it contains more than one member then all e e £0 are equivalent from a Bayesian viewpoint and any may be selected. Consequently it is never necessary to randomize in experimental design, though randomization over £0 would not do any harm (nor any good). This goes counter to a popular sampling-theory canon. On reflection the Bayesian conclusion seems correct to me. Certainly I find it hard to see how the fact that a result was obtained by randomization rather than by deliberate choice can have any effect on the subsequent analysis; in particular, the randomization theory of tests seems unconvincing. How can the fact that a different result might have been obtained, but was not, influence you once the data is on view? The point has been well argued by Jeffreys (1967). There might, nevertheless, be some sense in randomizing but then using an orthodox or Bayesian argument. However it is clear that randomization can only be a last resort. If some factor is present which is thought likely to influence the data, then this should be allowed for in the design, for example, by using blocking devices. Randomization, therefore, even to an orthodox statistician, is only used to guard against the unforeseen. The Bayesian could therefore select a haphazard sample: that is, one which, as far as he can see, will provide a good inference and not be disturbed by other effects. At best randomization can only be a convenient device to simplify the subsequent calculations. Stone (1969b) disagrees. We shall return to this topic when discussing sampling from a finite population in § 6. Sufficiency. One topic on which all statisticians seem to be in complete agreement is that of sufficiency. The Bayesian definition is that t(x) is sufficient if p(6\t) = p(9\x) for every x and every-distribution, p(9), prior to x: that is, if the posterior given x is the same as given the statistic. This is easily seen to be equivalent to the orthodox definition. The extension to minimal sufficient, in terms of sub-o--fields over X, proceeds exactly as in the sampling theory. Like most writers we shall use sufficiency, when strictly minimal sufficiency is meant. In the important 15

Draper and Hunter (1966, 1967a, 1967b) have discussed the design problem from a Bayesian viewpoint but not using the formal loss structure here described. 16 We suppose £0 is not empty.

22

D. V. LINDLEY

case of random sampling it is necessary to include the sample size as part of the (minimal) sufficient statistic. Notice that if 9 = (0i,0 2 ), marginal sufficiency for 6l is, in general, undefined. (For example, is s2 marginally sufficient for a2 in sampling from a normal distribution? The answer would appear to be, "no".) If p(x|6) = p(t]\Bl)p(t2\G2] and the prior similarly factors, then ^(x) is marginally sufficient, but this is a very special case. The point arises in discussing robustness (§7). Exponential family. The case where x = (x t , x 2 , • • • , xn) and p(x|0) = Y["= i P(xil$)> so th a * x is a random sample of size n from the distribution of density17 p(x,-|0), is of common occurrence. A special case, where the Bayesian (and orthodox) arguments are rather simpler, arises when the distribution is a member of the exponential family, that is,

Here (/>,-( 0) are k real functions of the parameter, ^(xj) and H(x t ) are k + 1 statistics and G(0) is a normalizing factor defined in terms of the 's, fs and H to make the density have integral (over X) equal to unity. It is immediately apparent that for x, £"=1 fj(x./X i = 1,2, • • • , k, and n, are sufficient for 0. Consequently, whatever be the size of sample the dimensionality of a sufficient statistic is constant, at k+ 1. The importance of this remark in a Bayesian analysis is that the posterior distribution of 0 given x will, under these circumstances, depend only on k + 1 values however large the sample is. In fact, if p(0) is the distribution prior to x, the posterior will be proportional to

with a,- = Yjj= i ?i(xj)' '' — 1, 2, • • • , /c, and /? = n. As x ranges over X this generates a family of densities all of the form (4.9) depending on hyperparameters a t , a 2 , • • • , a fc ,/?. Consequently not only is the density of x finitely parameterized, so is that of 0. This would not be true without the existence of sufficient statistics of fixed dimensionality. In this connection an important concept is due to Barnard (see Wetherill (1961)). A family $ of distributions over 0 is closed under sampling from the distribution with density p(x,-|0) if whenever p(6) e 5, p(#M e 5 for every x (and n). This means that provided the prior belongs to 5 any data will result in a posterior distribution in 5- If P(*il#) is a member of the exponential family, then 5 will depend on a finite number of hyperparameters. In connection with (4.8) the family with densities proportional to

17

The same symbol p has been used for the density of x and for any component x,.

BAYESIAN STATISTICS

23

is called the natural conjugate family (to p(xi\9)). Here al,a2, ••• ,ak and b are hyperparameters, with possible restrictions on their values in order that the integral of (4.10) over 0 converges. If p(9) has this form then, by (4.9), the posterior is of the same form with hyperparameters at + a f , i = 1,2, • • • , k, and ft + b replacing at, b in (4.10). The natural conjugate family is closed under sampling. It occupies an important role in current Bayesian research for no other reason than mathematical convenience. Two examples follow. Example 1. If xt ~ N(fi, a2} the likelihood for xl, x 2 , • • • , xn is

where, as usual, x = £ xjn, vs2 = £ (x; — x)2 and v = n — 1. If the prior is proportional to

the correspondence between the two functions is: n -» n', x -> m, s2 -» f 2 , v -» v' except in the power of a. Clearly as (n', m, t2, v') vary this gives a family closed under sampling. A convenient interpretation is that v't2/d2 is j2 on v' degrees of freedom and, conditional on a, p. ~ N(m,o2/n'}. Here tildes have been used to indicate the random quantities (and thereby prevent confusion with sampling theory ideas). The power of a has been arranged to make this interpretation possible. These ideas extend to the multivariate case and a comprehensive account of the distributional theory has been provided by Ando and Kaufmann (1965). Example 2. If xt is 0 or 1, with p(xt = l\9) = 9, the likelihood (see above) is 9r(l — 9)"~r, with r = ^Xj, the usual combinatorial being unnecessary. The natural conjugate family is the Beta distribution with density proportional to 9°~l(\ — 9)b~l, with a,b > 0. The extension to the case where xt takes k(> 2) distinct values leads to the Dirichlet family discussed by Dickey (1968b). Hald (1968a) has studied the dichotomy as n -> oo with h = r/n fixed for a general prior p(9). To quote a typical result, he shows that

to order n~l. Noninformative stopping. Continuing with the case of x, a random sample from p(xt\9), we have seen that Bayesian (terminal) analysis uses only the likelihood function and that the usual orthodox restriction to fixed n (in order to define X) is not needed. However, care is needed to ensure that the sampling rule does not itself contain information about 9. The following analysis is due to Raiffa and Schlaifer (1961). Define q(n\xl,x2, • • • , x n - i , 9, \{/) to be the chance, given x l 5 x 2 , • • • , *„_!, 9 and a nuisance parameter if/, of observing another sample, so that q defines the rule for stopping sampling. If x = ( x l 9 x 2 , • • • , x n ), then

D. V. LINDLEY

24

In an obvious notation this expression may be written where Q is the product of all the g-factors and p is as usual. The sampling rule is said to be noninformative if the Q-factor in (4.13) can be ignored: that is, if the posterior for 9 given x is unaffected by its exclusion. Sufficient conditions are that Q does not depend on 9 and 9 and ij/ are independent prior to x. Two examples follow. Example 1. Suppose xt ~ N(9,1) and the stopping rule is to continue sampling until \x\ > 2n~ 1 / 2 . (Sample until the null hypothesis that 9 = 0 is conventionally rejected at the 5% level.) This has been discussed by Armitage (1963). Here, perhaps surprisingly, the sampling rule is noninformative and the likelihood is as usual, though, at least when n (now n) is large almost all the information is contained in it. Example 2. The following practical application is due to Roberts (1967). The situation is the capture-recapture analysis that is presumably familiar enough to omit a detailed description. The marriage between the natural notation in this context and that of this review is as follows: 9 -» N, the size of the population, of which R are tagged, x -> r, the number found to be tagged in a second sample of n, \l/ -> p, the chance of catching a fish (say) in that sample. We make all the usual assumptions; for example, that all fish have the same chance of capture irrespective of whether or not they have been tagged in the first sample. Roberts points out that the sampling rule may reasonably be informative. As usual, we have the likelihood

where s = n — r, S = N — R. But reasonably it might also be true that

corresponding to the Q-factor in (4.13). If so the full likelihood is proportional to

18

Notice that in writing down this formula it has been assumed that p(xJ0) = p(xt\6, i//); that is, given 9, x, is independent of \l/. In Bayesian statistics all quantities are random variables and care is needed in making the probability specification. Usually the most convenient method is through a sequence of conditional probability statements: here p(6, ) and so on in the natural order.

BAYESIAN STATISTICS

25

with S and p as the two parameters. Roberts supposes S to be uniform over the nonnegative integers and p to have the conjugate Beta density pr'~l(\ — p ) R ' ~ r ' ~ l , the distributions being independent and prior to the data. Integration with respect to p gives

with mean R + (R + R' - 2)(s + l)/(r + r' - 2) - 1, compared with the m.l. estimate R + Rs/r. Notice that r' and R may be related to the experience gained in capturing the first sample of R for tagging. The value of experiments. In expression (4.7) we saw how to solve the experimental design problem within the Bayesian framework. This expression is now studied further in order to assess the value of an experiment e. We suppose U(d, 9, e, x) = U(d, 9) + U(x, e) so that the terminal utility and experimental costs are additive. The expected utility of e before it is performed is

Consider the second of the two terms in the braces. It equals the expected utility of the best decision from e, given that x is observed. Hence the expectation of the utility from e will be the average of this over X. Whereas if e is not performed the best that can be obtained is maxd

U(d, 9)p(9) d9. The difference of these

two expressions, namely,

is called the expected value of e, denoted v(e). (Raiffa and Schlaifer call it the expected value of sample information, EVSI.) The expression is clearly nonnegative, since on reversing the orders of integration over X and maximization over d in the first term, an operation which can only decrease the value, the first and second terms become equal, by Bayes' theorem, and the difference is zero ; hence v(e) ^ 0. Hence any experiment is expected to be of value. Of course, when realized the value of x may result in a loss of utility. Writing U(x, e) = — c(x, e), the cost of e and x (in units of utility) the experiment is only worth performing if

on comparing with the first term in (4.16). A special case is where e isa perfect experiment ; that is, an experiment which is certain to inform you of the correct value of 9. Here p(9\x, e) becomes a Dirac (5-function and the integration over 0 in (4.16) gives just U(d, 9'), where 9' is the "revealed" value of 9, so that one obtains maxd U(d, 9'). But 9' has density p(9')

26

D. V. LINDLEY

prior to the perfect experiment e*. Hence,

in terms of the loss function (2.2). This expression, v(e*), is called the expected value of perfect information, EVPI. Reversal of the orders of integration over 0 and maximization over d in the first term of (4.17) clearly shows that v(e*) ^ v(e) (which is intuitively obvious). Hence the EVPI is a (useful) upper bound to the value of any experiment. It should be remembered that the exact connection between utility and experimental cost has to be considered carefully and involves considerations of the utility of money (see end of § 7). A detailed discussion has been given by LaValle (1968a, b, c) who discusses, inter alia, the buying and selling prices of a lottery. We next provide some examples designed to illustrate the above ideas. Example 1. This is a no-data decision problem with 0 the real line and D — ( d 1 , d 2 ) , the loss functions being linear in 9. Specifically, we suppose that

and otherwise zero, bl and b2 being nonnegative. The value 90 is therefore the "break-even" value; for 9 > 0 0 , dv is optimum, for 9 < 0 0 , d2 is the better. This is the most general linear-loss form, though without loss of generality we could put av = 0, bl = 1. The optimum decision is to select d^ if it has smaller expected loss, that is, if

(If data were available, p(9) would be replaced here by p(9\x).) Write p(0) = /0(0) and define

Integration by parts enables (4.20) to be written where E(9) is the expected value of 0. Evaluation of f^(Q) (recognizable as the distribution function) and /2(0) are necessary for solution of the decision problem. Had the loss functions been polynomials of degree m, then the /j(0) would be required up to degree m + 1, in general. (If bl = b2, /2(0) is not needed.) Notice that the normal distribution is particularly simple since /0(0) and /i(0) can be expressed in terms of 0(0 and O(f), the density and distribution functions of the

BAYESIAN STATISTICS

27

standardized normal curve, and the integral of oo; only the ratio b/c is relevant, so this is equivalent to c -> 0. 20 It is disappointing that Bayesian decision theory has had so little impact on the whole field of quality control which is still dependent upon sampling-theory ideas, though there are exceptions; for example, the comprehensive paper by Wetherill and Campling (1966) and Campling (1968).

30

D. V. LINDLEY

An interesting problem that arises in medical statistics has been discussed by Anscombe (1963), Colton (1963) and Canner (1970). Here N patients have a disease and two treatments Tj and T2 are available. A clinical trial is performed in which n patients are given 7i and n, T2. On the basis of the results of the trial the remaining N — In patients are treated with what appears to be the better treatment. The result of a trial is either success or failure and beta-priors are appropriate. The problems are how to select n and then 7^ or T2 for the remaining patients. The loss (or utility) function naturally needs careful consideration. Canner solves the problem by the usual inverse method corresponding to (4.7). He shows, for example, that the optimum value of n is about {(N + 2)/(12c + 2)}1/2, where c is the cost of each patient in the trial. Guthrie and Johns (1959) made an early Bayesian study of sampling from a batch of size N with a single sample of n and discuss the optimum sample size and decision procedure for large N. We conclude this material on basic ideas by discussing a Bayesian method of hypothesis testing different from those indicated at the beginning of the section and in Example 1. Let H be a subset of© and suppose that we wish to see, in the light of the data, whether it is reasonable to suppose 6 e H. It is customary to speak of this as testing the null hypothesis, H, that 9 e H, where H has been used to denote both the hypothesis and the subset of 0. The alternative hypothesis, H, is that 6 $ H, that is, 6 E H. One way of testing is to calculate P(H\x), the probability that 6 € H, given the data; or, more conveniently P(H\x)/P(H\x), the posterior odds in favor of H. Now

with a similar expression for H. The posterior odds are therefore given by

and do not involve p(x). A still more convenient expression is the ratio of posterior to prior odds which is easily seen to be given by

This expression has the advantage that it does not depend on p(H). However, it does involve the distributions of 6 conditional on H and on H prior to x. Its use first seems to have been suggested by Jeffreys (1967). A common special case is that of a sharp hypothesis. This arises when 6 = (£, r\), say, and H specifies the value of ^ = £ 0 , say, without specifying r\. H is simply £ 7^ £ 0 . Then r\ is a nuisance parameter. An obvious example is where we wish to test whether the mean ^ of a normal distribution is £0 without specifying the variance rj. It has been shown by Dickey and Lientz (1970) in an elegant paper that develops a general treatment, that in this case, under a reasonable additional assumption, (4.27) takes on a simple form.

BAYESIAN STATISTICS

31

Let us write p(9\H) = p(£, q\H) = /(£, q), say, where /(£, rj) is defined21 as the elementary derivative of the distribution of 6 obtained by taking a sphere of radius p about (

\f(^0,rj)dt], the usual conditional form. In words, the

conditional distribution of 77, given £, considered as a function of £, is smooth around £ = £ 0 , so that the only discontinuity in the joint distribution occurs in f, having a "concentration" of (prior) probability at the null value. The additional assumption is that

Returning then to (4.27), the denominator is simply p(x\H) and the numerator may be rewritten, using (4.28), giving the ratio to be

But

by Bayes' theorem

and

Consequently the ratio of posterior to prior odds is simply According to Dickey and Lientz this result is due to L. J. Savage. The simplicity of (4.29) is due to its containing only the marginal densities of £ (at £0) before and after the data. If conjugate densities can be used, then these are very simple to calculate, far simpler than the form (4.27). A simple example is where x — (rl,r2), 0 = (di,Q2) an^ >",• is the number of successes in nt binomial trials with probability of success 0;, i = 1, 2, the two sets of trials being independent, and we wish to test d1 — 82- If ^i ar|d #2 have, 2 ' A density is not unique since it may be changed on any set of dominating measure zero. Its definition here becomes critical since H is such a set.

32

D. V. LINDLEY

under H, independent prior beta distributions with parameters at, bh the posterior distributions will also be beta. The above result can be applied with 22 £ = 0j — 6 rj = |(0 j + 0 2 ). The calculations of p(£0\H) and p(£ 0 |x,H) with £0 = 0 follow easily from properties of the beta distributions. If the problem of testing H against H is regarded as a decision one with two decisions dH and dn we may, without loss of generality (see the remark after (2.2)) suppose

The expected utilities are then

and similarly,

and the posterior odds are directly relevant to the solution of the decision problem. If R is the ratio of the two integrals, then H is accepted if RLO exceeds unity, where 0 is the prior odds. The asymptotic theory will be discussed in § 1 1 below. 5. Sequential experimentation. In the last section the choice of a single experiment was discussed ; we now consider the selection of a sequence of experiments. This is a field in which the Bayesian approach promises to be more successful than standard theory, partly because it does not involve the complicated sample space in the same way, and partly because probabilities prior to xn seem more acceptable when data xl, x2, • • • , xn_l are already available. Consider a finite sequence of possible experimental choices; let E = E^ x E2 x • • • x En with E( = (X{, 0, p(x{\9)) so that 0 is fixed throughout. Let the cost function be additive: that is, c(x, e) = £"=1 c,-(x,-, et). The (terminal) decision space is D and the loss function is L(d, 9), supposed added to the experimental cost. we shall write \t — ( X j , x 2 , • • • , x f ). The idea is that e± is selected from £j, x t observed, then e2 chosen from E2, and so on, up to xn, when finally d is chosen from D. Typically each E{ will include a null experiment, that is, one in which no further data is collected, so that d is immediately taken. We saw that, even in the case of a single experimental choice, the analysis proceeds in a reverse time order (see (4.7) and the related decision tree). Consequently suppose that en_l = (e l 5 e2, •- ,en-i) has been performed, with result \n-i so that it is only necessary to consider the choice of en, the value of xn and the terminal decision. Then (4.7) 22

There are other possibilities: for example, £ = \og(Ql/Q2), r\ = log#,0 2 > but the results are invariant.

BAYESIAN STATISTICS

33

may be applied with the result

Write this Ln_l(xn_1,en-l); it is the expected loss of the best choice of en and d, given results x n _ 1 from e n _ j . The same principle can now be applied to the choice of

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close