Probability and measure

Probability and Measure Third Edition PATRICK BILLINGSLEY The University of Chicago A Wiley-Interscience Publication ...

Author: Patrick Billingsley

553 downloads 1856 Views 13MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Probability and Measure Third Edition

PATRICK BILLINGSLEY

The University of Chicago

A Wiley-Interscience Publication JOHN WILEY & SONS New York Chichester Brisbane •

•

•

Toronto

•

Singapore •

This text is printed on acid-free paper. Copyright

©

1 995 by John Wiley & Sons, Inc.

All rights reseiVed. Published simultaneously in Canada. Reproduction or translation of any part of this work beyond that permitted by Section 107 or 108 of the 1976 United States Copyright Act witt10ut the permission of the copyright owner b unlawful. Requests for penmssion oclur.th� information should be addressed t�Ltbe Perr.Jissfons-t>ee artment, John Wiley & Sons, Inc., 605 Thirfo;-Avenue, New Yorl(!NY 1 0 158-0012.

Library of Congress Catawging in Pl!.blicatwn Data:

Billingsley, Patrick. Probability and measure I Pab'ick-Billingsley. -3� ed. p. em. -(Wiley series in pro6abili ty -and mathematical statistics. Probability and mathematical statistics) "A Wiley-Interscience publication." Includes bibliographical references and index. ISBN 0-471 -007 10-2 1 . Probabilities. 2. Measure theory. I. Title. II. Series. QA273.B575 1 995 5 1 9.2-dc20 94-28500

Printed in the United States of America 10 9

Preface

Edwa 1 d Davenant said he " would have a man knockt in the head that should write anything in Mathematiques that had been written of before." So reports John Aubrey in his Brief Lives. What is new here then? To introduce the idea of measure the book opens with Borel's normal number theorem, proved by calculus alone. and there follow short sections establishing the existence and fundamental properties of probability mea sures, including Lebesgue measure on the unit interval. For simple random variables-ones with finite range-the expected value is a sum instead of an integral. Measure theory, without integration, therefore suffices for a com pletely rigorous study of infinite sequences of simple random variables, and this is carried out in the remainder of Chapter 1, which treats laws of large numbers, the optimality of bold play in gambling, Markov chains, large deviations, the law of the iterated logarithm. These developments in their tum motivate the general theory of measure and integration in Chapters 2 and 3. Measure and integral are used together in Chapters 4 and 5 for the study of random sums, the Poisson process, convergence of measures, characteristic functions, central limit theory. Chapter 6 begins with derivatives according to Lebesgue and Radon-Nikodym-a return to measure theory-then applies them to conditional expected values and martingales. Chapter 7 treats such topics in the theory of n and the case d;(w) = 1 for i > n. The second case can be achieved, but since the binary expansions represented by the d;(w) are nonterminating-do not end in O' s-the first cannot, and w must actually exceed E7= 1 u;/2; . Thus

( 1 .9)

'

(

n

U·

n

U·

1 [ w:d;( w) = u ; , i = 1 , . . . ,n] = _L ; , _L 2; + 2" 1=1 2 t=l

l

·

4

PROBABILITY

The interval here is open on the left and closed on the right precisely because the expansion (1.7) is the nonterminating one. In the model for coin tossing the set (1 .9) represents the event that the first n tosses give the outcomes u 1 , .. . , u n in sequence. By (1.3) and (1 .9),

1 , n P [ w : d,. (w) = u,. , i = 1, . . . ] = 2 " ,

( 1 .10)

which is wha t probabilistic intuition requires. 0

1 01

00 000

001

010

I

10 01 1

I

1 00

I

11 101

Decompositions by dyadic intervals

1 10

I

111

The intervals (1 .9) are called dyadic intervals, the endpoints being adja cent dya dic rationals k/2" and (k + l)j2n with the same denominator, and n is the rank or order of the interval. For each n the 2" dyadic inte rvals of rank n decompose or partition the unit interval. In the passage from the partition for n to that for n + 1, each interval (1.9) is split into two parts of equal length, a left half on which dn +1 (w ) is 0 and a right half on which dn +1 (w) is 1. For "u = 0 and for u = 1, the" set [w: dn + lw) = u] is thus a disjoint union of 2 intervals of length 1/2 +1 and hence has probability � : P[w: dn (w ) = u] = � for all n. Note that d,(w) is constant over each dyadic interval of rank i and that for n > i each dyadic interval of rank n is entirely contained in a single dyadic interval of rank i. Therefore, d,. ( w) is constant over each dyadic interval of rank n if i < n. The probabilities of various familiar events can be written down immedi ately. The sum Ej= 1 d,.(w) is the number of l 's among d 1(w), . . . , dn(w), to be thought of as the number of heads in n tosses of a fair coin. The usual binomial formula is

( 1 .1 1 )

0 < k < n.

This follows from the definitions: The set on the left in (1 .11) is the union of thos e intervals (1.9) corresponding to sequences u 1 , , u n containing k l's and n- k O' s; each such interval has length 112n by (1 .10) and there are ; of them, and so (1 . 1 1) follows from (1.3). •

.

.

()

sECfiON ] .

BOREL' S NORMAL NUMBER THEOREM

5

The functions dn(w) can be looked at in two ways. Fixing n and letting w vary gives a re al function dn = dn( · ) on the unit interval. Fixing w and letting n vary gives the sequence (1 .8) of O's and l 's. The probabilities ( 1 . 10) and ( 1 . 1 1 ) involve only finitely many of the components d,.( w). The interest here, however, will center mainly on properties of the entire sequence ( 1 .8). It will be seen that the mathematical properties of this sequence mirror the proper ties to be expected of a coin-tossing process that continues forever. As the expansion ( 1 .7) is the nonterminating one , there is the defect that for no w is (1 .8) the sequence (1, 0, 0, 0, . . . ), for example. It seems clear that the chance should be 0 for the coin to turn u p heads on the first toss and tails forever after, so that the absence of ( 1 , 0, 0, 0, . . . )-or of any other single sequence-should not matter. See on this point the additional remarks immediately preceding Theorem 1.2. The Weak Law of Large Numbers

In studying the connection with coin tossing it is instructive to begin with a result that can, in fact, be treated within the framework of discrete probabil ity. namely, the weak Law of Large numbers: Theorem l.l.

( 1 . 1 2)

For each E,t limP

n-> oo

[ w : -n [ d,.( w) 1

n

i= l

1 >€ 2

l

=

0.

Interpreted probabilistically, (1 . 1 2) says that if n is large, then there is small probability that the fraction or relative frequency of heads in n tosses will deviate much from t, an idea lying at the base of the frequency conception of probability. As a statement about the structure of the real numbers, ( 1 . 1 2) is also interesting arithmetically. Since d,.(w) is constant over each dyadic interval of rank n if i < n, the sum E?- t d,.( w) is also constant over each dyadic interval of rank n. The set in ( 1 . 1 2) is therefore the union of certain of the intervals (1 .9), and so its probability is well defined by (1 .3). With the Riemann integral in the role of expected value, the usual application of Chevyshev's inequality will lead to a proof of (1.12). The argument becomes simpler if the d n(w) are replaced by the Rade macher

functions, ( 1 . 13)

rn( w) = 2dn( w) - 1 =

(

1 -1

+

if dnC (J) ) if dn( (J) )

=

'

1 = 0.

tThe standard E and {) of analysis will always be understood to be positive.

6

PROBABILITY

0

I

I

0

1

1

Graph of r1(w}

Graph of r2(w)

Consider the partial sums

sn(w)

( 1 . 14) Since L:7= thing as

n

=

L r1 ( w ) .

i =I

1 d,.(w) (s"(w) + n)j2, ( 1 . 1 2) with E/2 in place of E is the same =

[ w: _!_s P n n(w) " ..... "" lim

( 1 .15)

>

€]

=

0.

This is the form in which the theorem will be proved. The Rademacher functions have themselves a direct probabilistic mean ing. If a coin is tossed succes�ively, and if a particle starting from the origin performs a random walk on the real line by successively moving one unit in the positive or negative direction according as the coin falls heads or tails, then represents the distance it moves on the ith step and represents its position after n steps. There is also the gambling interpreta tion: If a gambler bets one dollar, say, on each toss of the coin, r,. represents his gain or loss on the ith play and represents his gain or loss in n plays. Each dyadic interval of rank i - 1 splits into two dyadic intervals of rank i; has value - 1 on one of these and value + 1 on the other. Thus r,. ) is -1 on a set of intervals of total length � and + 1 on a set of total length �· = 0 by (1 .6), and Hence M r;(

r,.( w)

sn(w) (w)

sn(w)

(w

r;(w)

w) dw

( 1 . 16) by (1 .5 ). If the integral is viewed as an expected value, then (1 . 16) says that the mean position after n steps of a random walk is 0. Suppose that i <j. On a dyadic interval of rank j - 1 , r,. ) is constant and rj ) has value - 1 on the left half and + 1 on the right. The product

(w

(w

SECf'ION

1. BOREL' S NORMAL NUMBER THEOREM

7

r1 (w)r/w) therefore integrates to 0 over each of the dyadic intervals of rank j - 1 , and so i-1=

( 1 . 17)

j.

This corresponds to the fact !hat independent random variables are uncorre lated. Since r,.Z (w) = 1, expanding the square of the sum (1.14) shows that

{0 s�( w ) dw = n.

( 1 . 18)

This corresponds to the fact that the variances of independent random variables add. Of course (1.16), (1.17), and (1.18) stand on their own, in no way depend on an)' probabilistic interpretation. Applying Chebyshev's inequality in a formal way to the probability in (1.15) now leads to

( 1 .19)

P [ w:l snC w) l > nE]

a] is for a finite union of intervals and 1 ilf( w) dw . P [ w: f( ) > a] � ( 1 .20 ) a o Lemma.

where a

>0

eu

The shaded region has area

aP[w: f(w) �a}.

PROOF. The set in question is the union of the intervals (xj_ 1 , x) for which cj > a. If r: denotes summation over those j satisfying cj > a, then P[w: f(w) > a] = E'(xj - xj _ 1 ) by the definition (1.3). On the other hand,

8

PROBABILITY

since the ci are all nonnegative by hypothesis, (1 .6) gives [

Lf( w) dw = 0

Hence (1.20).

k

I

[ci(xi-xi_1)> [cJxi-xi-l)

j=l

> ['a(xi-xi_I).

•

Taking a n2E2 and f(w ) =s�(w) in (1.20) gives (1.19). Clearly, implies (1. 15), and as already observed, this in turn implies ( 1 . 1 2).

=

(1. 19)

The Strong Law of Large Numbers

lt is possible with a minimum of technical apparatus to prove a stronger result that cannot even be formulated in the discrete theory of probability. Consider the set n 1 N = w : nlim n L w ) = -21 ( 1 .21) I

[

->oo

-

l= .

di(

l

consisting of those w for which the asymptotic relative frequency • of 1 in the sequence (1.8) is I· The points in (1.21) are called normal numbers. The idea is to show that a real number w drawn at random from the unit interval is "practically certain" to be normal, or that there is " practical certainty" that 1 occurs in the sequence (1.8) of tosses with asymptotic relative frequency I· It is impossible at this stage to prove that P(N) = 1, because N is not a finite union of interva ls and so has been assigned no probability. But the notion of "practical certainty" can be fonnalized in the following way. Define a subset A of f! to be negligible t if for each positive E there exists of intervals (they may overlap) a finite or countable:!: collection /1 , /2, sa tisfying • • •

( 1.22) and

( 1.23) A negligible set is one that can be covered by intervals the total sum of

whose lengths can be made arbitrarily small. IfP( A ) is assigned to such an

* The frequency of 1 (the number of occurrences of it) among d1(w), .. . , d/w) is I:?= 1 d;(w), the 1elative frequency is n -II:?= 1 d;(w ), and the asymptotic relative frequency is the limit in (1.21). tThe term negligible is introduced for the purposes of this section only. The negligible sets will reappear later as the sets of Lebesgue measure 0. tcountably infinite is unambiguous. Countable will mean finite or countably infinite, although it will sometimes for em phasis be expanded as here to finite or countable.

SECf'ION

1.

BOREL'S NORMAL NUMBER THEOREM

9

in any reasonable way, then for the Ik of (1.22) and (1 .23) it ought to be true that P(A) < EkP(Ik) Ekiikl < E, and henceP(A) ought to be 0. Even without any assignment of probability at all , the definition of negligibility can serve as it stands as an explication of " practical impossibility" and "practical certainty": Regard it as practically impossible that the random w will lie in A if A is negligible, and regard it as practically certain that w will lie in A if its complement Ac [Al] is negligible. Although the fact plays no role in the next proof, for an understanding of negligibility observe first that a finite or countable union of negligible sets is negligible. Indeed, suppose that A 1 , A 2 , . . . are negligible. Given E, for neach n choose intervals Inl• In2• . . . such that A n c uk Ink and Ekiin kl < Ej2 . All the intervals Ink taken together form a countable collection covering Un A n , n and their lengths add to EnEk lfn k l < LnE/2 = €. Therefore, U n A n is negli gible.

A

=

A set consisting of a single point is clearly negligible, and so every countable set is also negligible. The rationals for example form a negligible set. In the

coin-tossing model, a single point of the unit interval has the role of a single sequence of O's and l's, or of a single sequence of heads and tails. It corresponds with intuition that it should be " practicaliy impossible" to toss a coin infinitely often and realize any one particular infinite sequence set down in advance. It is for this reason not a real shortcoming of the model tha t for no w is (1.8) the sequence (1, 0, 0, 0, . . . ). In fact, since a countable set is negligible, it is not a shortcoming that (1.8) is never one of the countably many sequences that end in O's. Theorem 1.2.

The set of normal numbers has negligible complement.

Borel ' s normal number theorem,t a special case of the strong Law of Large numbers. Like Theorem 1.1, it is of arithmetic as well as probabilistic This is

interest. The set Nc is not countable: Consider a point w for w hich ( d 1( w), d z( w ), ... ) = (1, l, u 3 , 1, l, u6, . 1. . )-that is, a point for which d;(w) = 1 unless i is a multiple of 3. Since n - E7 1 d;( w) > t, such a point cannot be normal. But there are uncountably many such points, one for each infinite sequence (u 3 , u6, ) of O's and l's. Thus one cannot prove Nc negligib le by proving it countable, and a deeper argument is required. =

• • •

PROOF OF THEOREM

( 1 .24)

1.2.

Oearly

[

(1.21) and

N= w: lim

n -->00

_!_ n sn ( w ) =

o]

tE mile Borel: Sur les probabilites denombrables et leurs applications arithmetiques, Circ. Mat d. Palermo, 29 (1 909), 247-271. See DuDLEY for excellent historical notes on ana lysis and probability.

PROBABILITY

10

define the same set (see (1. 14)). To prove Nc negligible requires constructing coverings that satisfy (1.22) and (1 .23) for A = Nc. The construction makes use of the inequality

( 1.25) This follows by the same argument that leads to the inequality in ( 1 .19)-it is only necessary to take f(w) = s�(w ) and a= n4E4 in (1.20). As the integral in (1.25) will be shown to have order n2, the inequality is stronger than (1. 19). The integrand on the right in (1.25 ) is

( 1 .26) where the four indices range independently from 1 to n. Depending on how the indices match up, each term in this sum reduces to one of the following five forms, where in each case the indices are now distinct:

r,-\ w) = 1 , r/( w ) r/( w ) = 1 , r,.Z ( w ) rJ ) rk ( w ) = rJ w ) rk ( w ) rl ( w) ri ( w) = r;( w) rj ( ) r1( w) rj( w) rk ( w) r1( w).

( 1 . 27 )

w

lu

,

,

If, for example, k exceeds i, j, and I, then the last product in (1.27) integrate s to 0 over each dyadic interval of rank k - 1, because r,- ( vJ)r/w)riw) is constant there, while r/w) is - 1 on the left half and + 1 on the right. Adding over the dyadic intervals of rank k - 1 give s

This holds whenever the four indices are distinct. From this and (1.17) it follows that the last three forms in (1 .27) integrate to 0 over the unit interval; of course, the first two forms integrate to 1. The number of occurrences in the sum (1.26) of the first form in (1.27) is n. The number of occurrences of the second form is 3n(n- 1), because there are n choices for the a in (1.26), three ways to match it with {3, y, or 5, and n - 1 choices for the value common to the remaining two indices. A term-by term integration of (1.26) therefore gives

( 1.28)

fs�( w) dw = n 0

+

3n(n

-1 ) < 3n2,

sECTION 1.

BOREL' s NORMAL NUMBER THEOREM

11

and it follows by (1.25) that (1 .29 )

Fix a positive sequence {En} going to 0 slowly enough that the series z 1 Ln E;4n - converges ( take En= n- 18, for example ). If An = [w: In - ls n(w )I > En ], then P(An) < 3E;4n-2 by (1.29), and SO LnP(An) < oo. If, for some m, w lies in A� for a ll n greater than or equal to m, then In - lsn (w )l <En for n > m, and it follows that w is normal because E"--+ 0 (see (1.24 )) . In other words, for each m, n � =m A� c N, which is the same thing as Nc c U�=m A,. This last relation lea ds to the required covering: Given E, choo�e m so that L.�=mP(An) m, k > 1) provide a covering of Nc of the kind the definition of negligibility calls for. • Strong Law Versus Weak

Theorem 1.2 is stronger than Theorem 1 . 1 . A consideration of the forms of the two propositions wi ll show that the strong law goes far beyond the weak law. For each n let fn(w) be a step function on the unit interval, and consider the relation ( 1 .30)

lim P[w: lfn(w)l > E j = 0 -+

n

oo

together with the set ( 1 .31)

[ w:

lim fn( w) =

/l-+00

o] .

If fiw) = n-1s"(w), then (1.30) reduces to the weak law (1.15), and (1.31) coincides with the set (1.24) of normal numbers. According to a general result proved below (Theorem 5.2(ii)), whatever the step functions fn(w) may be, if the set (1.31) has negligible complement, then (1 .30) hol ds for each positive E. For this reason, a proof of Theorem 1.2 is automatically a proof of Theorem 1 . 1 . The converse, however, fails: There exist step functions fiw) that satisfy (1.30) for each positive E but for which (1.31) fails to have negligible complement (Example 5.4). For this reason, a proof of Theorem 1.1 is not automatically a proof of Theorem 1 .2; the latter lies deeper and its proof is correspondingly more complex. Length

According to Theorem 1.2, the complement Nc of the s et of normal numbers is negligible. What if N itself were negligible? It would then follow that (0, 1] = NUNc was negligible as well, which would disqualify negligibility as an explication of "practical impossibility," as a stand-in for "probability zero." The proof below of the "obvious" fact that an interval of positive

PROBABILITY

12

length is not negligible (Theorem 1.3(ii)), while simple enough, does involve the most fundamental properties of the real number system. Consider an interval I=(a, b] of length I I I = b - a; see ( 1 . 1 ). Consider also a finite or infinite sequence of intervals Ik = (ak, bk]. While each of these intervals is bounded, they need not be subintervals of (0, 1].

If Uk Ik ci, and the Ik are disjoint , then L:kii kl < 1/1. If I c U� Ik (the Ik need not be disjoint ), then III < L:kiikl. If I = uk Ik , and the lk are disjoint, then I I I = L:kllkl.

Theorem 1.3.

(ii) (iii)

PROOF.

(i)

Of course (iii) follows from (i) and (ii).

PRooF oF (i): Finite case. Suppose there are n intervals. The result being obvious for n = 1 , assume that it holds for n - 1. If a, is the largest among a1, , a, (this is just a matter of notation), then uz:\Cak, bk]c (a, a,,], so that L:Z: \(bk-ak) < a,-a by the induction hypothesis, and hence L:Z= /bk-ak) < (a,- a)+(b, -a,) < b- a. Infinite case. If there are infinitely many intervals, each finite subcollection satisfies the hypotheses of (i), and so L:Z= 1(blc-ak) < b-a by the finite case. But as n is arbitrary, the result follows. • • •

PROOF OF (ii): Finite case. Assume that the result holds for the case of n - 1 intervals and that (a, b] c UZ =1(ak,bk]. Suppose that a, a, -a by the induction hypothesis and hence L:Z= /bk-ak)>(a,-a)+(b,- a,)>b-a. The finite case thus follows by induction. that (a,b]c u;= 1(ak, bk]. If O 1 and (1 .35) holds for infinitely many pairs p, q but only for finitely many relatively prime ones. Then x is rational. (c) If 1p goes to infinity too rapidly, then Acp is negligible (Theorem 1.6). But however rapidly 1p goes to infinity, A is nonempty, even uncountable. Hint · Consider x = I:k' = 1 1 1 2 a( k > for integraf a (k) increasing very rapidiy to infinity. SECTION 2. PROBABILITY MEASURES Spaces

Let f! be an arbitrary space or set of points w. In probability theory f! consists of all the possible results or outcomes w of an experiment or observation. For observing the number of heads in n tosses of a coin the space f! is {0, 1 , . . . , n}; for describing the complete history of the n tosse s f! is the space of all 2" n-long sequences of H ' s and T' s; for an infinite sequence of tosses f! can be taken as the unit interval as in the preceding section; for the number of a-particles emitted by a substance during a unit interval of time or for the number of telephone calls arriving at an exchange f! is {0, 1 , 2, . . . }; for the position of a particle f! is three-dimensional Euclidean space; for describing the motion of the particle f! is an appropri ate space of functions; and so on. Most f! ' s to be considered are interesting from the point of view of geometry and analysis as well as that of probability.

·�

PROBABILITY

Viewed probabilistically, a subset of f! is an is a sample point.

event and an element w of f!

Assigning Probabilities

In setting up a space f! as a probabilistic model, it is natural to try and assign probabilities to as many events as possible. Consider again the case f! = (0, 1] -the unit interval. It is natural to try and go beyond the definition (1.3) and assign probabilities in a systematic way to sets other than finite unions of intervals. Since the set of nonnormal numbers is negligible, for example, one feels it ought to have probability 0. For another probabilistically interesting set that is not a finite union of intervals, consider (2. 1 )

"'

u [ w : -a < s , ( w) , . . . , sn - l(w) < b , sn( w) = - a ] ,

n=l

where a and b are positive integers. This is the event that the gambler's fortune reaches - a before it reaches + b; it represents ruin for a gambler with a dollars playing against an adversary with b dollars, the rule being that they play until one or the other runs out of capital. The union in (2. 1) is countable and disjoint, and for each n the set in the union is itself a union of certain of the intervals (1.9). Thus (2. 1) is a countably infinite disjoint union of intervals, and it is natural to take as its probability the sum of the lengths of these constituent intervals. Since the set of normal numbers is not a countable disjoint union of intervals, however, this extension of the definition of probability would still not cover all the interesting sets (events) in (0, 1]. It is, in fact, not fruitful to try to predict just which sets probabilistic analysis will require and then assign probabilities to them in some ad lwc way. The successful procedure is to develop a general theory that assigns probabilities at once to the sets of a class so extensive that most of its members never actually arise in probability theory. That be ing so, why not ask for a theory that goes all the way and applies to every set in a space f!? In the case of the unit interval, should there not exist a well-defined probability that the random point w lies in A, whatever the set A may be? The answer turns out to be no (see p. 45), and it is necessary to work within subclasses of the class of all subsets of a space fl . The classes of the appropriate kinds-the fields and o--fields-are defined and studied in this section. The theory developed here covers the spaces listed above, including the unit interval, and a great variety of others. Classes of Sets

It is necessary to single out for special treatment classes of subsets of a space !1, and to be useful, such a class must be closed under various of the

SECTION 2.

PROBABILITY MEASURES

19

operations of set theory. Once again the unit interval provides an instructive example. Example 2.1. • Consider the set N of normal numbers in the form (1.24), where sn (w) is the sum of the first n Rademacher functions. Since a point w lies in N if and O!lly if lim n n - l sn(w ) 0, N can be put in the form =

(2 2)

N

=

""

""

""

n U n

k = l m = l n =m

[ w : ! n - 1 sn( w) ! < k - 1 ] .

Indeed, because of the very meaning of union and of intersection, w lies in the set on the right here if and only if for eve ry k there exists an m such that l n- 1sn(w)l < k - 1 holds for all n > m, and this is1 just the definition of convergence to 0-with the usual E replaced by k- to avoid the formation of an uncountabie intersection. Since sn(w) is constant over each dyadic interval of rank n, the set [w: n - 1 sn(w)l < k - 1 ] is a finite disjoint union of intervals. The formula (2.2) shows explicitly how N is constructed in steps • from these simpler sets.

A

systematic treatment of the ideas in Section 1 thus requires a class of sets that contains the intervals and is closed under the formation of count able unions and intersections. Note that a singleton [Al] {x} is a countable intersection nn ( X - n - I , X ] Of intervals. If a claSS COntains all the singletons and is closed under the formation of arbitrary unions, then of course it contains all the subsets of fl. As the theory of this section and the next does not apply to such extensive classes of sets, attention must be restricted to countable set-theoretic operations and in some cases even to finite ones. Consider now a completely arbitrary nonempty space fl. A class !F of subsets of fl is called a jietdt if it contains fl itself and is closed under the formation of complements and finite unions: (i) fl E /F; (ii) A E !F implies Ac E !F; (iii) A , B E !F implies A u B E :F. Since fl and the empty set 0 are complementary, (i) is the same in the presence of (ii) as the assumption 0 E :F. In fact, (i) simply ensures that !F is nonempty: If A E !F, then Ac E !F by (ii) and fl A u Ac E !F by (iii). By DeMorgan's law, A n B (Ac u SCY and A u B (Ac n SCY. If :F is closed under complementation, therefore, it is closed under the formation of finite unions if and only if it is closed under the formation of finite intersec=

=

=

* Many of the examples in the book simply illustrate the concepts at hand, but others cont ain definitions and facts needed subsequently t The term algebra is often used in place of field

20

PROBABILITY

tions. Thus (iii) can be replaced by the requirement (iii' )

A, B E !F implies A n B E !F.

A class !F of subsets of f! is a a--field if it is a field and if it is also closed under the formation of countable unions:

By the infinite form of DeMorgan's law, assuming (iv) is the same thing as assummg

Note that (iv) implies (iii) because one can take A 1 = A and A n = B for n � 2. A field is sometimes called a finitely additive field to stress that it need not be a a--field. A set in a given class !F is said to be measurable !F or to be an !F-set. A field or a--field of subsets of f! will sometimes be called a field or a--field in fl . Section 1 began with a consideration of the sets ( 1 .2), the finite disjoint unions of subintervals of f! = (0, 1]. Augmented by the empty set, this class is a field f!J0 : Suppose that A = ( a 1 , a'1 ] U · · · U (a m , a:nl, where the notation is so chosen that a 1 < · · · < am. If the (a,, a'i ] are disjoint, then Ar is (0, a 1 ] U (a'1 , a 2 ] U · · · U ( a',., _ 1 , a m ] U (a'rn , 1] and so lies in f!J0 (some of these intervals may be empty, as a'i and a i + I may coincide). If B = (b" b'1 ] u . . u (bm b�], the (bi, bj] again disjoint, then A n B = U � 1 U �·� 1{(a , , a�] n (bj, bj]}; each intersection here is again an interval or else the empty set, and the union is disjoint, and hence A n B is in f!J0. Thus f!J0 satisfies (i), (ii), and (iii'). Although f!J0 is a field, it is not a a--field: It does not contain the singletons {X}, even though each is a COUntable intersection nn ( t - n - I , X] of f!J0-sets. And f!J0 does not contain the set (2. 1 ), a countable union of i ntervals that cannot be represented as a finite union of intervals. The set • (2.2) of normal numbers is also outside f!J0. Example 2.2.

The definitions above involve distinctions perhaps most easily made clear by a pair of artificial examples. Let !F consist of the finite and the cofinite sets (A being cofinite if A' is finite). Then !F is a field. If f! is finite, then !F contains all the subset

SECTION 2.


25

Here, of course, the A k need not be disjoint. Sometimes (2. 10) is called

Boote 's inequality.

In these formulas all the sets are naturally assumed to lie in the field !F. The derivations above involve only the finite additivity of P. Countable additivity gives further properties: Theorem 2. 1.

Let P be a probability measure on a field !F.

(i) Continuity from below : P( A,,) i P( A). (ii) Continuity from above: P( AnH P( A). (iii) Countable subadditivity:

need not be disjoint ), therz

If All and

A

If All and

A

If A " A 2 ,

•

•

•

lie in !F andt

An i A ,

then

lie in !F and

An � A ,

then

and u;= 1 A k lie in !F (the A k

( 2.1 1 )

PROOF. For (i), put B 1 = A 1 and Bk = A k - A k _ ,. Then the Bk are disjoint, A = u;= I Bk , and An = u�= I Bk , so that by countable and finite additivity, P( A ) = L; = 1 P( Bk ) = limn LZ = 1 P( Bk ) = limn P( An). For (ii), ob serve that An � A implies A� i Ac, so that 1 - P( An) i 1 P( A). As fo r (iii). increase the right side of (2. 10) to L; = 1 P( A k ) and then apply part (i) to the left side. • -

In rhe presence of finite additivity, a special case of (ii) implies countable additivity. If P is a finitely additive probability measure on the field !F, aad tf A n ! 0 {or sets An in !F implies P( AnH 0, then P is countably additive. Indeed, if B = Uk Bk for disjoint sets Bk ( B and the Bk in !F ), then Cn = U k > n Bk B - U k ,; n Bk li�s in the field !F, and Cn � 0. The hypothesis, together with finite additivity, gives P( B) - LZ = 1 P( Bk ) • P(Cn) --+ 0, and hence P( B) = L;= 1 P(Bk ). Example 2.10.

=

=

Lebesgue Measure on the Unit Interval

The definition (1 .3) specifies a set function on the field f!J0 of finite disjoint unions of intervals in (0, 1 ]; the problem is to prove P countably additive. It will be convenient to change notation from P to and to denote by ..f the class of subintervals (a, b] of (0, 1]; then ,.\( I ) = I II = b - a is ordinary length. Regard 0 as an element of J of length 0. If A = U7= 1 Ii, the Ii being

A,

t For the notation, see [A4] and [AlO].

PROBABILITY

26

disjoint J-sets, the definition (1 .3) in the new notation is

A( A )

( 2. 12)

n

n

i= l

i= l

= L A( /; ) = L I IJ

As po inted out in Section 1, there is a question of uniqueness here, because A will have other representations as a finite disjoint union Uj� 1 Jj of J-sets. But J is clo sed under the fo rmation of finite intersections, and so �he finite form of Theorem 1 .3(iii) gives n

( 2. 1 3)

n

m

m

i=1

j= l

(Some of the 1,. n 11 may be empty, but the corresponding lengths are then 0.) The definition is indeed consistent. Thus (2. 12) defines a set function A on f!J0 , a set function called Lebesgue

measure.

Lebesgue measure measure on the field f!J0. Theorem 2.2.

A is a ( countably additive) probability

PROOF. Suppose that u; = I where and the a nd the are disjoint. Then UT� 1 U �= I /i and unions of J-sets, and (2. 12) and Theorem 1 .3(iii) give

A=

Ak

( 2. 1 4)

A( A )

=

A=

n

Ak, n

A Ak =

A k are f!Jo-sets Jkj are disjoint

1n 1..

oo

I: II;I = I: I: I: II,. n lkj l

i= l 00

111 1..

i = l k= l j= l 00

= L [ llkjl = k[ A( A k ) . k = l j= l =l

•

In Section 3 it is shown how to extend A from f!J0 to the larger class Y8 O"( f!J0) of Borel sets i n (0, 1]. This will complete the construction of A as a probability measure (countably additive, that is) on �. and the construction is fundamental to all that follows. For example, the set of normal numbers lies in f!J (Example 2.6), and it will turn out that = 1, as probabilistic i ntuition requires. (In Chapter 2, will be defined for sets outside the unit i nterval as well.) It is well to pause here and consider just what is involved in the construc tion of Lebesgue measure on the Borel sets of the unit interval. That length defines a finitely additive set function o n the class J of intervals in (0, 1] is a consequence of Theorem 1 .3 for the case of only finitely many intervals and th us involves only the most elementary properties of the real number system. B ut proving countable additivity on J requires the deeper pro perty of =

A

N A(N)

SECTION 2.


27

compactness (the Heine-Bo rel theorem). Once A has been proved countably additive on J, extending it to !$0 by the definition (2. 12) presents no real difficulty: the arguments involving (2. 13) and (2. 14) are easy. Difficulties again arise, however, in the further extension of from !$0 to !$ = 0"(!$0), and here new ideas are again required. These ideas are the subject of Section 3, where it is shown that any probability measure on any field can be extended to the generated O"-field.

A

Sequence Space

•

Let S be a finite set of points regarded as the possible outcomes of a simple observation or experiment. For tossing a coin, S can be {H, T} or {0, 1 l; fo r rolling a die, S = { 1 , . . . , 6}; in information theory, S plays the role of a finite alphabet. Let f! = S"" be the space of all infinite sequences ( 2. 1 5 ) of elements of S : zk(w) E S fo r all w E S"" and k > 1 . The sequence (2. 15) can be viewed as the result of repeating infinitely often the simple experi ment represented by S. Fo r S = (0, 1}, the space S"' is closely related to the unit interval; compare (1 .8) and (2. 15). The space S"" is an infinite-dimensional Cartesian product. Each z / · ) is a mapping of S"' onto S; these are the coordinate functions, or the natural projections. Let s n = S X · · · X S be the Cartesian product of n copies of S; it consists of the n-long sequences (u 1 , , u n ) of elements of S. For such a sequence, the set • • •

(2. 16) represents the event that the first n repetitions of the experiment give the outcomes u 1 , , un in sequence. A cylinder of rank n is a set of the fo rm • • •

( 2. 17) where H c s n . Note that A is nonempty if H is. If H is a singleton in s n, (2. 17) reduces to (2. 1 6), which can be called a thin cylinder. Let -6"0 be the class of cylinders of all ranks. Then -6"0 is a field: S"" and the empty set have the form (2.17) for H = s n and for H = 0. If H is replaced by s n - H, then (2. 17) goes into its complement, and hence 110 is *

The ideas that follow are basic to probability theory and are used further on, in particular in Section 24 and On more elaborate form) Section 36. On a first reading, however, one might prefer to skip to Section 3 and return to this topic as the need arises.

28

PROBABILI fY

closed under complementation As for unions, consider

(2. 17) together with

( 2.18 ) a cylinder of rank m. Suppose that n < m (symmetry); if H' consists of the sequences (u,, . . . ' urn) in s m for whi ch the truncated sequence (u,, . . . ' un) lies in H, then (2.17) has the alternative form

(2 . 1 9 ) Si nce it is now clear that

( 2.20 )

AUB

=

( w : ( z ( ) . . . , Z111 ( w ) ) E H ' U I] i

(tl

,

is also a cylinder, -6"0 is closed under the formation of finite unions and hence is indeed a field. Let p11 , u E S, be probabilities o n S-nonnegative and summing to 1. Define a set function on -6"0 (it wi:t tum out to be a probability measure) in thi s way: For a cylinder A given by (2.17), take

P

P( A ) = L P11 1 ' · ' Pu.,•

( 2.21 )

H

the sum extending over all the sequences (u " . . . , u ) in H. As a special case, n

( 2.22) Because of the products on the right in (2.21) and (2.22), P is called product measure; it provides a model for an infinite sequence of i ndependent repeti tions of the simple experiment represented by the probabilities Pu on S. In the case where S = {0, 1} and p0 p 1 = t, it is a model for independent tosses of a fair coin, an alternative to the model used in Sectio n 1 . The definition (2.21) presents a consistency problem, since the cylinder A will have other representations. Suppose that A is also given by (2.19). If n = m, then H and H' must coincide, and there is nothing to prove. Suppose then (symmetry) that n < m. Then H' must consist of those (u " . . . , urn) in s m for which (u ,, . . . , u n) lies in H : H' = H X s m - n . But then =

'\" p " ' Pu, PU n +l " ' Pu , ( 2.23) i..J H'

UJ

"\" p · · · pu, i...J '\" Pu, + l · · · pUm H sm - n - L Pu 1 ' · ' Pu,, • H =

.L...t

LIJ

The definition (2.21) is therefore consistent. And fi nite additivity is now easy: Suppose that A and B are disjoint cylinders given by (2.17) and (2.18).

SECTION 2.

PROBABILITY MEA'iURES

29

Suppose that n < m, and put A in the form (2. 19). Since A and B are disjoint, H' and I must be disjoint as well, and by (2.20), ( 2 .24)

P(A u B) =

L

H'UI

Pul . . . Pum = P ( A ) + P ( B ) .

Taking H = S'' in (2.21) shows that P (S ) = 1. Therefore, (2.21) "'

defines a

finitely additive probability measure on the field -6"0 . Now, P is countably additive on -6"0, but this requires no further argument, because of the following completely general result.

Every finitely additive probability measure on the field -6"0 cylillders in S"' is in fact countably additive. Theorem 2.3.

of

The proof depends on rhis fundamental fact: Lemma.

nonempty.

If A n � A ,

where the

An

are nonempty cylinders, then

A

LS

PROOF OF THEOREM 2.3. Assume that the lemma is true, and apply Example 2. 10 to the measure P in question: If A n � 0 for sets i n -6"0 (cylinders) but P(An) does not converge to 0, then P(An) > E > 0 fo r some E . But then the A n are nonempty, which by the lemma makes A n � 0 impossible. • PROOF OF THE LEMMA. t

( 2 .25 )

Suppose that A , is a cylinder of rank

m,,

say

]

A , = [ w : ( z 1 ( w ) , . . . , z m,( w ) ) E H, ,

where H, c s m, , Choose a point wn in A n , which is nonempty by assumption. Write the components of the sequences in a square array:

( 2 .26)

z1( wJ ) z2( w J )

z 1( w 2 ) z2( w 2 )

z1( w3) z2(w3)

...

The nth column of the array gives the components of wn ' Now argue by a modification of the diagonal method [A14]. Since S is finite, some element u 1 of S appears infinitely often in the first row of (2 .26): for an increasing sequence {n 1 k} of integers, z 1(w n ) = u 1 for all k. By the same reasoning, there exist an increasing subsequence {n 2. k } of {n 1 k} and an '

l. k

'

t The lemma is a special case of Tychonov's theorem: If S is given the discrete topology, the topological product S"" is compact (and the cylinders are closed).

PROBABILITY

30

element u 2 of S such that z 2(w" ) = u 2 for all k. Continue. If n k = n k , k • 2' then z,(w n) = u , for k > r, and hence (z 1(w n), . . . , z ,(w n )) = (u 1 , . . . , u , ) for k > r. Let W0 be the element of S "' with components u,: W0 = (u 1 , u2, . . . ) = ( z 1( w0), z2( w0), ) . Let t be arbitrary. If k > t, then ( n k is increasing) n k > t and hence w" E A" k cA,. It follows by (2.25) that, for k > t, H, contains the k m point ( z 1(w n ) , . . . , zm,(w n )) of s ' . But for k > m , this point is identical with ( z 1( w 0), , zm '(w0)), which therefore lies in H, . Thus W0 is a point common to all the A , . • •

•

•

•

•

•

Let � be the a--field in S "' generated by �0. By the general theory of the next section, !he probability measure P defined on �0 by (2.21 ) extends to � The term product measure, properly speaking, applies to the extended P. Thus (S "', �. P) is a probability space, one impo rtant in ergodic theory (Section 24). Suppose that S = {0, 1} and Po = p 1 = 4- . In this case, (S"', tf, P ) is closely related to ((0, 1], fig, A), although there are essential differences. The sequence (2. 15) can end in O' s, but (1.8) cannot. Thin cylinders are like dyadic intervals, but the set� in tf0 (the cylinders) correspond to the finite disjoint unions of intervals with dyadic endpoints, a field somewhat smaller than f!g0. While nonempty sets in pgc (for example, ( t, t + 2 - n ]) can contract to the empty set, nonempty sets in tf0 cannot. The lemma above plays here the role the Heine-Borel theorem plays in the proof of Theorem 1.3. The product probability measure constructed here on tf0 On the case S {0, 1}, Po = p 1 But a finitely additive = t, that is) is analogous to Lebesgue measure on f!g0 probability measure on f!g0 can fail to be countably additive,t which cannot happen in tf0. =

Constructing it reduces by (2.29) to ] = Du 1 . Suppose 1 that (2.31) holds with n - 1 in place of n. Consider the condition "

[ w : w E /.,

w E n - l(f.,

(2 .32)

""' 1.. 1

, Jw

"m l.. 1

, .

)

w E n(/.,

, 1., , ) is By (2.28) and (2.29), a necessary and sufficient condition for that either (2.32) is false for k = 1 or else (2.32) is true for some k exceeding 1 . But by the induction hypothesis, (2.32) and its negation can be replaced by ( , 1., , ) if and only if 11 _ 1(Du , Du , . . . ) and its negation. Therefore, D11 , . . ) . Thus U '* � ' and there are Borel sets that (anilot be arrive>d at from the in tervals by any finite sequence of set-theoretic operations, each operation being finite or countable. It can even be shown that there are Borel sets that cannot be arrived at by any countable sequence of these operations. On the other hand, every Borel set can be arrived at by a countable ordered set of these operations if it is not required that they be performed in a simple sequence. The proof of this statement-and in deed even a precise explanation of its meaning--depends on the theory of infinite ordinal numbers. t "' 1.. 1

"' 1.. 2

w E n(Du , n.f,; I

2

w E /., ,

.

111

11 2

. • .

wE

ul

1 2

• . .

PROBLEMS 2.1.

Define X v y = max{x, y}, and for a collection {x) define v a X a = supa xa; define X 1\ y = min{x, y} and A a x a = infa X a. Prove that /A u 8 = /A v 18 , /A n 8 = lA 1\ 18 , lA' = 1 - lA , and IA !:J. 8 = I IA - 18 1, in the sense that there is equality at each point of fL Show that A c B if and only if /A < 8 pointwise. Check the equation x l\ ( y V z ) = ( x l\ y ) V (x l\ z ) and deduce the distributive law

/

t

See Problem 2.22.

SECTION 2.

PROBABILITY M EASURES

33

A n (B U C) = (A n B) u (A n C). By similar arguments prove that A U ( B n C) = (A U B ) n ( A u C), A t1C e ( A t1 8 ) u ( Bt1C ) ,

( u A ,r = n A�. n

n

( n A nr = U A � . n

2.2.

n

Let A 1 , , A n be arbitrary events, and put UA = U ( A ; ( l · · · n A 1 ) and lk = n( A ; U · · · U A ; ), where the union aPd intersection �xtend over �I! the k-tuples s�tisfying 1 < f1 < · · · < i k < n. Show that Uk = !Il - k + 1 • • • •

Suppose that !l '= !f and that A , 8 E !f implies A - 8 =A 'I W E !f. Show that !f is a field. (b) Suppose that n E sr and that sr is closed under the formation of comple ments and finite disjoint unions. Show that !f need not be a field.

2.3. (a)

2.4.

Let .9) , !fi, . . be classes of sets in a common space n. Show that U: _ 1 5';, is a (a) Suppose that .9,; are fields satisfying .9,; e ,9;, field. (b) Suppose that .9,; are a--fields satisfying .9,; e .9,; + 1 Show by example that U � 1.9,; need not be a a--field. + 1•

•

=

2.5.

The field f(d ) generated by a class d in !l is defined as the intersection of all fields in n containing d. (a) Show that f(d) is indeed a field, that de f(d), and that f(d) is minimal in the sense that if .:# is a field and de .:#, then f(d) e .:#. (b) Show that for nonempty d, f(d) is the class of sets of the form u I'!. I n 7'= IAij• where for each i and j either A ,j E d or A }j E d, and where the m sets n 7'= I A ij• 1 < i < m , are disjoi nt. The sets in f(d) can thus be explicitly presented, which is not in general true of the sets in a-(d).

2.6.

i (a) Show that if d consists of the singletons, then f(d) is the field in Example 2.3. (b) Show that f(d) e a-( d), that f(d) = a-(d) if d is finite, and that a- (f(d)) = a-(d). (c) Show that if d is countable, then f(d) is countable. (d) Show for fields .9) and !fi that f(.9) U !fi) consists of the finite disjoint unions of sets A 1 n A 2 with A1 E !F;. Extend.

2.7.

2.5 i Let H be a a set lying outside !T, where !f is a field [or a--field]. Show that the field [or a--field] generated by S'u {H} consists of sets of the form (2.33)

A , B E !f.

PROBABILITY

34

2.8. Suppose for each A in .w' that A c is a countable union of elements of d. The class of intervals in (0, 1] has this property. Show that a-(.9«') coincides with the smallest class over .w' that is closed under the formation of countable unions and intersections. 2.9. Show that, if B E a-(.9«'), then there exists a countable subclass .9«'8 of d such that B E a-(.9«'8). 2. 10.

Show that if a-(d) contains every subset of fl, then for each pair w and w' of distinct points in fl t here is in .9Jf' an A such that l}w ) -=!= Iiw' ) (b) Show that the reverse implication holds if n is countable. (c) Show by example that the reverse implication need not hold for uncount able n (a)

2. ll. A a--field is countahly generated, or separable, if it i� generated by some count.tble class of sets. (a) Show that the a--field � of Borel sets is countably generated. (b) Show that the a--field of Example 2.4 is countably generated if and only if fl is countable. (c) Suppose lhat .9) and !fi are a--fields, .9) c !fi, and !fi is countably generated. Show by example that � may :1ot be countably generated. 2.12. Show that a a--field cannot be countably infinite-it� cardinality must be finite or else at least that of the continu urn. Show by example that a field can be countably infinite. 2.13.

Let !f be the fie ld consisting of the finite and the cofinite sets in an infi n ite fl, and define P on !f by taking P( A ) to be 0 or 1 as A is finite or co finite. (Note that P is not well defined if f1 is finite.) Show that P is finitely additive. (b) Show that this P is not countably additive if n is countably infinite. (c) Show that this P is countably additive if n is uncountable. (d) Now let !f be the a--field consisting of the countable and the cocountable sets in an uncountable fl, and define P on !f by taking P( A ) to be 0 or 1 as A is countable or cocou ntable. (Note that P is not well defined if n is countable.) Show that P is countably additive. (a)

2.14. In (0, 1] let !T be the class of sets that either (i) are of the first catego:y [A 15] or (ii) have complement of the first category. Show that !f is a a--field. For A in !T, take P(A ) to be 0 in case (i) and 1 in case (ii). Show that P is countably additive. 2.15. On the field �0 in (0, 1J define P( A ) to be 1 or 0 according as there does or does not exist some positive fA (depending on A ) such that A contains the interval (t, t + fA ]. Show that P is finitely but not countably additive. No such example is possible for the field �� in S"' (Theorem 2.3). 2.16.

Suppose that p is a probability measure on a field !f. Suppose that A , E sr for t > 0, that A 5 c A , for s < t , and that A U 1 > O A r E !f. Extend Theorem 2. l(i) by showing that P( A , ) i P( A ) as t -+ oo. Show that A necessarily lies in !f if it is a a--field. (b) Extend Theorem 2. 1(ii) in the same way. (a)

=

SECTION 2.

PROBABI LITY M EASURES

35 • • •

2.17. Suppose that P is a probability measure on a field !T, that A 1 , A 2 , , and A = U nAn lie in !T, and that the An are nearly disjoint in the sense that P( A m n An) = O for m =fo n. Show that P(A ) = [nP(An). 2.18. Stochastic arithmetic. Define a set function Pn on the class of all subsets of n = { 1, 2, . . . } by (2 .34)

1

Pn ( A ) = n # [ -

m:

1

< m
0 implies that there exists a B such that B cA and 0 < P( B ) < P(A) (A and B in !T, of course). (a) Assuming the existence of Lebesgue measure A on � ' prove that it is non atomic. (b) Show in the nonatomic case that P(A ) > 0 and t: > 0 imply that there exists a B such that B cA and 0 < P( B ) < t: .

36

PROBABILITY

(c) Show in the nonatomic case that 0 < x < P(A) implies that there exists a B

such that B cA and P(B) =x. Hint: Inductively define classes �. numbers h,., and sets Hn by �1 = {0} = {H0}, � = [ H: H c A - U k < n Hk , P( U k < n Hk ) + P(H) < x], hn = sup[P(H) · H E �], and P(Hn) > hn - n - 1 • Consider U k Hk . (d) Show in the nonatomic case that, if p 1 , p 2 , • • • are nonnegative and add to 1 , then A can be decomposed into sets 8 1 , 82 , • • • such that P(B,) =pn P(A ).

2.20. Generalize the construction of product measure: For n = 1 , 2, . . . , let Sn be a finite spa:::e with given probabilities P, , , u E s, Let sl X Sz X . be the space of sequences (2. 15), where now zk (w) E Sk . Define P on the class of cylinders, appropriately defined, by using the product p 1 , • • P, , on the right in (2.21) Prove P countably additive on �1, and extend 'Theore m 2.3 and its lemma to this more general setting. Show that the lemma fails if any of the S, are infinite 2.21. (a) Suppose that d= {A 1, A 2, • • } is a countable partition of fl Show (see (2.27)) that d1 = d0* = .rJI* coincides with u(d). This is a case where u( d) can be constructed " from the inside." (b) Show that the set of normal numbers lies in .fc,. (c) Show that £* = £ if and on ly if £ is a u-field. Show that J,; _ 1 I S strictly smaller than J,; for all n.

2.22. Extend (2.27) to infinite ordinals a by defining .W:, = ( U 13 < a � )*. Show that, if fl is the first uncountable ordinal, then U a < nda = u (d ). Show that, if the cardinality of d does not exceed that of the continuum, then the same is true of u(d). Thus � has the power of the continuum. 2.23. i Extend (2.29) to ordinals a < n as follows. Replace the right side of (2.28) by U � 1( A 2 11 _ 1 U A�n). Suppose that p is defined for {3 < a Let f3a(l) , f3a(2), . . . be a sequence of ordinals such that f3a(n) < a and such that if {3 < a , then {3 = f3 a( n ) for infinitely many even n and for infinitely many odd n; define =

( 2.38) a{ A I , A 2 , · · · ) =

(

qr i3a( I )

(A

m11 '

A

m 12 '

• • • ) ' p.,(2 ) ( A

m 21 '

A m22 ' •

• • ) ' • • • )•

Prove by transfi n ite induction that (2.38) is in � if the A , are, that every element of J;; has the form (2.38) for sets A,. in J;1, and that (2.31) hold� with a i n place of n . Define ifia(w) = a(lw , !.,,, . . . ), and show that Ba = [w: w � ifia(w)] lies in � - � for a < fl . Sho� that � is strictly smaller than � for a < {3 < n.

SECTION 3. EXISTENCE AND EXTENSION The main theorem to be proved here may be compactly stated th is way: Theorem

3.1. A probability measure on a field has a unique extension

the generated a--field.

to

In more detail the assertion is this: Suppose that P is a probability measure on a field �� of subsets of f! , and put .9'= a-(� ). Then there

SECTION 3.

37

EXISTENCI:. AND EXTENSION

exists a probability measure Q on !F such that Q(A) P(A ) for A E .9Q. Further, if Q' is another probability measure on !F such that Q'( A) = P(A ) for A E .ro, then Q'( A ) Q(A) for A E /F. Although the measure extended to !F is usually denoted by the same letter as the original measure on .ro . they are really different set functions, since they have different domains of defin ition. The class .9Q is only assumed finitely additive in the theorem, but the set function P on it must be assumed countably additive (since this of course follows from the conclusion of the theorem). (initially defined for intervals as length: As shown in Theorem 2.2, extends to a probability measure on the field �0 of finite disjoint unions of subintervals of (0, 1]. By Theorem 3 . 1 , A extends in a unique way from pj)0 to � = (1"(�0), the class of Borel sets in (0, 1 ]. The extended is Lebesgue measure on the unit interval. Theorem 3 . 1 has many other applica tions as wei!. The uniqueness in Theorem 3.1 will be p10ved iater; see Theorem 3.3. The first project is to prove that an extension does exist.

=

=

A

A( / ) = Ill)

A

Construction of the Extension

Let P be a probability measure on a field .ro. The construction following extends P to a class that in general is much larger than £T(.9;, ) but nonethe less does not in general contain all the subsets of fl. For each subset A of f!, define its outer measure by (3.1)

P* ( A )

= inf [n P ( A n ) , • • •

where the infimum extend� over all finite and infinite sequences A 1 , A 2 , of .9;,-sets satisfying A c Un An . If the An form an efficient covering of A , in the sense that they do not overlap one another very much or extend much beyond A , then L:n P(A n ) should be a good outer approximation to the measure of A if A is indeed to have a measure assigned it at all. Thus (3. 1 ) represents a first attempt to assign a measure to A . Because of the rule P(Ac) 1 P( A) for complements (see (2.6)), it is natural in approximating A from the inside to approximate the complement Ac from the outside instead and then subtract from 1 :

=

-

( 3 .2)

This, the inner measure of A , is a second candidate for the measure of A .t A plausible procedure is to assign measure to those A for which (3. 1 ) and (3.2) tAn idea which seems reasonable at first is to define P *(A) as the supremum of the sums [.nP(An) for disjoint sequences of .9(,-sets in A. This will not do. For example, in the case where .!1 is the unit interval, .9(, is �0 (Example 2.2), and P is A as defined by (2.12), the set N of normal numbers would have inner measure 0 because it contains no nonempty elements of �0; in a satisfactory theory, N will have both inner and outer measure 1 .

38

PROBABILITY

agree, and to take the common value P*(A ) = P ( A ) as the measure. Since * (3. 1 ) and (3.2) agree if and only if

( 3 .3 )

P* ( A) + P * ( Ac)

=

1,

the procedure would be to consider the class of A satisfying (3.3) and use P * ( A ) as the measure. It turns out to be simpler to impose on A the more stringent requirement that

( 3 .4)

P* ( A n £ ) + P* ( Ac n E ) = P * ( E )

hold for every set E; (3.3) is the special case E = n, because it will turn out that P * ( f! ) = l.t A set A is called ? * -measurable if (3 .4) holds for all E; let ./I be the class of such sets. What will be shown is that ./I contains a-(.ro) and that the restriction of P* to a(.ro) is the required extension of P. The set function P* has four properties that will be needed: (i) (ii) (iii) (iv)

P* (0) = 0; P* is nonnegative: P * ( A ) > 0 for every A c f!; P* is monotone: A c B implies P*( A ) < P * ( B ); P* is countably subadditive: P * ( Un A) < L., P*(A n ).

The others being obvious, only (iv) needs proof. For a given E, choose .ro-sets Bn k such that A 11 c U k Bn k and L.k P(Bnk) < P*(A n ) + € 2 - n , which is possible by the definition (3. 1 ). Now Un A n c Un . k Bn k • so that P * ( Un A n ) < Ln . k P(Bnk ) < L.n P*(A n ) + €, and (iv) follows.:!: Of course, (iv) implies finite subadditivity. By definition, A lies in the class Jl of ?*-measurable sets if it splits each E in 2 11 in such a way that P* adds for the pieces-that is, if (3.4) holds. Because of finite subadditivity, this is equivalent to

( 3 .5) Lemma l.

P* ( A n £) + P* ( Ac n E ) < P* ( E ) .

The class ./1 is a field.

t it also turns out, after the fact, that (3.3) implies that (3.4) holds for all E anyway, see Problem 3.2. t compare the proof on p. 9 that a countable u nion of negligible sets is negligible.

SECTION

3.

39

EXISTENCE AND EXTENSION

It is clear that f! E ./I and that ./1 is closed under complementa tion. Suppose that A, B E ./I and E c fl. Then PROOF.

P * ( E ) = P* ( B n E) + P* ( W n E) =

P* ( A n B n E) + P* ( Ac n B n E) + P*( A n W n E) + P*( Ac n W n E)

> P* ( A n B n E ) + P *(( Ac n B n E ) u ( A n W n E ) =

u

( Ac n Bc n E))

P* (( A n B ) n E) + P * (( A n B ) c n E ) ,

the inequality following by subadditivity. Hencet A n B E ./1, and ./1 is a • field. Lemma

2.

If A 1 , A 2 , . . .

then for each E c f!,

is a finite or infinite sequence of disjoint Jl-sets,

( 3 .6 )

Consider first the case of finitely many Ak, say n of them. For n 1, there is nothing to prove. In the case n 2, if A 1 u A 2 f!, then (3.6) is just (3 .4) with A 1 (or A 2 ) in the role of A . If A 1 u A 2 is smaller than !l, split E n ( A 1 U A 2 ) by A 1 and A� (or by A 2 and A�) and use (3.4) and disjointness. Assume (3.6) holds for the case of n 1 sets. By the case n 2, together with the induction hypothesis, P*(E n ( UZ = 1 A k )) = P* ( E n ( uz : : Ak)) + PROOF.

=

=

=

=

-

P*(E nAn) = L.Z = 1 P* ( E n A k ).

Thus (3.6) holds in the finite case. For the infinite case use monotonicity: P * ( E n ( U; = I A k)) > P*(E n (UZ = 1 A k )) L.Z = 1 P*(E n A k ). Let n � oo, and conclude that the left side of (3.6) is greater than or equal to the right. The reverse inequality follows by countable subadditivity. • =

Lemma

3.

The class ./I is a a--field, and P* restricted to ./I is countably

additive.

Suppose that A1, A 2 , are disjoint Jl-sets with union A. Since UZ = 1 A k lies in the field ./1, P*(E) P*(E n Fn ) + P*(E n Fnc). To the

P ROOF.

Fn t

=

• • •

=

This proof does not work if (3.4) is weakened to (3.3).

PROBABILITY

40

first term on the right apply (3.6), and to the second term apply monotonicity (F: :_)Ac): P*(E) > L.Z = 1 P*(E n A k ) + P*(E n Ac). Let n � oo and use (3.6) again: P*(E) > L� = 1 P* ( E n Ak ) + P*(E n Ac) = P*(E n A ) + P*(E n A c ). Hence A satisfies (3.5) and so lies in ./1, which is therefore closed under the formation of countable disjoint unions. From the fact that ./1 is a field closed under the formation of countable disjoint unions it follows that ./1 is a £T-field (for sets Bk in ./1, let A 1 = B1 and A k = Bk n Bf n · · · n B� _ 1 ; then the A k are disjoint �sets and Uk Bk = Uk Ak E ./I ). The countable additivity of P* on ./1 follows from (3.6): take E = fl . • Lemmas 1 , 2, and 3 use only the properties (i) through (iv) of P* derived above. The next two use the specific assumption that P* is defined via (3.1) from a probability measure P on the field � · Lemma 4.

If P*

is

defined by (3.1 ), then 9"0 c./1.

PRooF. Suppose that A E Y;",. Given E and E, choose .ro-sets An such that E c Un An and Ln P( An) < P * ( E ) + E . The sets Bn = A n n A and en = An n Ac lie in .ro because it is a field. Also, E n A c Un Bn and E nAc c Un en; by the definition of P* and the finite additivity of P, P*(E n A ) + P * ( E n Ac ) < Ln P(Bn ) + L11 P(en ) = Ln P(A n ) < P*(E) + E . Hence A E .ro implies (3.5), and so .9;"1 c./1. • Lemma 5.

( 3 .7 )

If P*

is defined by

(3. 1 ), then

P* ( A ) = P( A )

for A E .ro .

PROOF. It is obvious from the definition (3. 1 ) that P*( A ) < P(A) for A in .ro . If A c Un An, where A and the An are in L�. then by the countable subadditivity and monotonicity of P on .ro . P( A ) < Ln P( A n A n) < • Ln P(An). Hence (3.7). PROOF OF EXTENS ION IN THEOREM 3 . 1 . Suppose that P* is defined via (3. 1 ) from a (countably additive) probability measure P on the field g;; . Let !F £T (.ro ). By Lemmas 3 and 4,t

By (3.7), P*(f!) = P(f!) = 1 . By Lemma 3, P* (which is defined on all of 2 11) restricted to ./1 is therefore a probability measure there. And then P* further restricted to SO is clearly a probability measure on that class as well. t In the case of Lebesgue measure, the relation i s &!l'0 c &!l' c �c 2 1'l. l l, and each of the three inclusions is strict; see Example 2.2 and Problems 3.14 and 3.21

SECTION 3.

41

EXI STENCE AND EXTENSION

This measure on !F is the required extension, because by (3.7) it agrees with • P on �Uniqueness and the

v-A

Theorem

To prove the extension in Theorem 3. 1 is unique requires some auxi liary concepts. A class 9 of subsets of f! is a 1r-system if it is closed under the formation of finite intersections:

(7T ) A , B E 9 implies A n B E 9. A class ../ is a A-system if it contains f! and is closed under the formation of complements and of fin ite and countable disjoint unions: (A 1 ) f! E ../; (A 2 ) A E ../ implies Ac E ../; (A,) A , , A 2 , . . . ' E ../ and A n n A m = 0 for

m =fo n

imply Un A n E J.

Because of the disjointness condition in (A 3 ), the definition of A-system is weaker (more inclusive) than that of O"-field. In the presence of (A 1 ) and ( A 2 ), which imply 0 E ../, the countably infinite case of (A3) implies the finite one. In the presence of ( A 1 ) and (A 3), (A2) is equivalent to the condition that ../ is closed under the formation of proper differences: •

( X2 ) A, B E ../ and A c B imply B - A E ../. Suppose, in fact, that ../ satisfies ( A 2 ) and (A3). If A, B E ../ and A c B, then ../ contains W, the disjoint union A u B e , and its complement (A u SCY = B - A . Hence (X2 ). On the other hand, if ../ satisfies ( A 1 ) and (X2 ), then A E ../ implies Ac = f! - A E ../. Hence (A 2 ). Although a O"-field is a A-system, the reverse is not true (in a four-point space take ../ to consist of 0, f!, and the six two-point sets). But the connection is close: Lemma 6.

A

class that is both a 1r-system and a A-system is a O"-jield.

PROOF. The class contains f! by ( ,.\ 1 ) and is closed under the formation of complements and finite intersections by (,.\ 2 ) and ( 7T ). It is therefore a field. It is a O"-field because if it contains sets A n , then it also contains the disjoint sets Bn = A n n A ) n nA�_ 1 and by (,.\3) contains Un A n = Un Bn • ·

·

·

PROBABILITY

42

Many uniqueness arguments depend on Dynkin's 1T-A theorem: Theorem 3.2.

implies a-( 9) c ../.

If

9

LS

a 1T-system and ../

LS

a A-system, then 9c ../

Let ..zfc1 be the A-system generated by .9---- that is, the intersec tion of all A-systems containing 9. It is a A-system, it contains 9, and it is contained in every A -system that contains 9 (see the construction of gener ated a--fields, p. 21). Thus 9c..zfc1 c ../. If it can be shown that ..zfc1 is also a 7T-system, then it will follow by Lemma 6 that it is a a--field. From the minimality of a-(9) it will then follow that a-(9) c ../0, so that 9c a-(9) c ..£"0 c ../. Therefore, it suffices to show that ..zfc1 is a 7T-system. For each A , let � be the class of sets B such that A n B E ..zfc1• If A is assumed to lie in 9, or even if A is merely assumed to lie in ..zfc1, then � is a A-system: Since A n f! = A E' �� by the assumption, � satisfies (A 1 ). If 8 1 , B2 E � and B1 c B2 , theP the A-system ..zfc1 contains A ri B1 and A n B2 and hence contains the proper difference (A n B2) - ( A n B1) = A n ( B2 - B 1 ), so that � contains B2 - B 1 : � satisfies (A'2 ). If Bll are disjoint �-sets, then ..zfc1 contains the disjoint sets A n Bll and hence contains their union A n ( Uil B,): � satisfies (A3). If A E 9 and B E 9, then (9 is a 7T-system) A n B E 9c..zfc1, or B E � . Thus A E 9 implies 9c� , and since � is a A-system, minimal ity gives ../0 c�. Thus A E 9 implies ../0 c�, or, t0 put it another way, A E 9 and B E .../"0 together imply that B E � and hence A E../8. (The key to. the proof is that B E � if and only if A E ../8.) This last implication means that B E ../0 implies 9c../8. Since ./8 is a A-system, it follows by minima!ity once again that B E ..zfc, implies 1;1 c../8. Finally, B E 1;, and C E ..zfc1 together imply C E .../"8 , or B n C E .../"0. Therefore, 1;1 is indeed a 1Tsystem. • PRooF.

Since a field is certainly a 7T-system, the uniqueness asserted in Theorem 3 . 1 is a consequence of this result:

Suppose that P1 and P2 are probability measures on u(9 ), where 9 is a 1T-system. If P1 and P2 agree on 9, then they agree on u(9 ). Theorem 3.3.

Let .../' be the class of sets A in a-(9) such that P1( A ) = PiA ). Clearly f! E../. If A E ../, then P1(Ac) = 1 - P1( A ) = 1 - P2( A ) = P2(Ac ), and hence Ac E ../. If All are disjoint sets in ../, then P1( Uil All) = L:ll P1(Ail) = L:ll P2(A il ) = P2( Uil A11 ), and hence Ull All E../. Therefore ...i" is a A-system. Since by hypothesis 9c../ and 9 is a 7T-system, the 1r-A theorem gives a-(9) c ../, as required. • PROOF.

SECTION 3.

43

EXISTENCE AND EXTENSION

Note that the 1T-A theorem and the concep t of A-system are exactly what are needed to make this proof work: The essential property of probability measures is countable additivity, and this is a condition on countable disjoint unions, the only kind involved in the requirement (,.\ 3) in the definition of ,.\-system. In this, as in many applications of the 1T-A theorem, ../c a(.9 ) and therefore a(.9 ) = ../, even though the relation a(.9 ) c../ itself suffice s for the conclusion of the theorem. Monotone Classes

A class .4 of subsets of n is monotone if it monotone unions and intersections: (i) A 1 , A 2 , (ii ) A 1 A 2 ,

,

• • • • •

IS

closed under the formation of

E .4 and A 11 'i' A imply A E .!!; E .4 and A 11 l A imply A E .4.

Halmos's monotone class theorem is a close relative of the TT-A theorem but wi:l be

less frequently used in this book. Theort'm

a- (.9()) c .4.

3.4. If .9(1 is a field and .4 is a monotone class, then !fO c-4 implies

PRoOF. Let m(!fij) be the minimal monotone class over .9(1-the intersection of all monotone classes containing � · I t is enough to prove a-(S'ij) c m(!fij); this will follow if m(SZ{J) is shown to be a field, because a monotone field is a a--field. Consider the class .:#= [ A : Ac E m(!fij)]. Since m(S'ij) is monotone, so is .:#. Since .9(1 is a field, ,9A' c B E :F, and P( B) = 0 together imply that A' E :F and P(A') = P(A ). Suppose that :F, P) is an arbitrary probability space. Define P* by (3. 1) for .9;, = .r- a-(.9;,), and consider the a--field .A' of P* -measurable sets. The arguments leading to Theorem 3.1 show that P* restricted to .A' is a probability measure. If p*( B ) = 0 and A c B, then P*(A n E) + P*(Ac n E) < P*(B) + P*(E) = P*.(E) by .A', P* ) is a monotonicity, so that A satisfies (3.5) and hence lies in .A'. Thus complete probability measure space. In any probability space it is therefore posstble to

(D.,

(D.,

(D.,

(D.,

enlarge the a--field and extend the measure in such a way as to get a complete space.

Suppose that ((0, 1], �. A) is completed in this way. The sets in the completed a--field .A' are called Lebesgue sets, and A extended to .A' is still called Lebesgue measure. Nonmeasurable Sets

1here exist in (0, 1] sets that lie outside �- For the construction (due to Vitali) it is convenient to use addition modulo 1 in (0, 1]. Fc.r x, y E (0, 1] take x EB y to be x + y or x + y - 1 according as x -:- y lies in (0, 1] or not.t Put A EB x = [a EB x: a E A ]. Let ../ be the class of Borel sets A such that A EB x is a Borel set and A(A EB x) = A(A). Then ../ is a A-system containing the intervals, and so � c ../ by the Tr-A theorem. Thus A E � implies that A EB x E � and A(A EBx) A(A). in this sense, A ;s translation-invariant. Define x and y to be equivalent (x - y) if x EB r = y for some rational r in (0, 1 ]. Let H be a subset of (0, 1] consisting of exactly one representative point from each =

equivalence class; such a set exists under the assumption of the axiom of choice [A8]. Consider now the countably many se£s H EB r for rational r These sets are disjoint, because no two distinct points of H are equivalent. (If H EB r 1 and H EB r 2 share the point h 1 EB r 1 = h 2 EB r2 , then h 1 - h 2 ; this is impossible unless h 1 = h 2 , in which case r 1 = r2.) Each point of (0, 1] lies in one of these sets, because H has a representative from each equivalence class. (If x - h E H, then x =. h EB r E H EB r for some rational r.) Thus (0, 1] = U,(H EB 1 ), a countable disjoint UniOn. If H were in �' it would follow that A(O, 1] = [,A(H 6l r ) This is impossible: If the value common to the A( H EB r) is 0, it leads to 1 = 0; if the common value is positive, it leads to a convergent infinite series of identicai positive terms (a + a + · · · < oo and a > 0). Thus H lies outside �. • .

Two Impossibility Theorems

•

The argument above, which uses the axiom of choice, in fact proves this: There exists on 2 < 0. 11 no probability measure P such that P( A EB x) = P( A) for all A E 2 < O. IJ and all x E (0, 1]. In particular it is impossible to extend A to a translation-invariant probabil' 0 < ity measure on 2 • 11. t This amounts to work ing in the circle group, where the translation y -+ x Ell y becomes a rotation (1 is the identity). The rationals form a subgroup, and the set H defined below contains one element from each coset. * This topic may be omitted. It uses more set theory than is assumed in the rest of the book.

46

PROBABILITY

There is a stronger result There exists on 2 1 - 1 1 21 + 1 for i > 1. If 1

oo

u(i)

oo

A = n U A /, = n [ x, : t (i) < u(i )] = [ x, : t 0. If A is shown to be countable, this will contradict the hypothesis that each singleton has probability 0. Now, there is some t0 in T such that u < t0 (if u E T, take t0 = u, otherwise, u is rejected by some t0 in T). If t < u for a I in T, then t < t0 and hence t "Sw t 0 (since otherwise t0 rejects t ). This means that [t: t < u] is contained in the countable set [t. t <w t n for which w E Ak; in other words, w lies in (4.4) if and only if it lies in infinitely many of the An. In the same way, w lies in (4.5) if and only if there is some n such that w E A k for all k > n; in other words, w lies in (4.5) if and only if it lies in all but finitely many of the An. Note that n ;= n Ak i lim infn An and U ;=nAk ! lim supn An. For every m and n, n ;= m A k c u ;= n A k , because for i > max{m, n}, Ai contains the first of these sets and is contained in the second. Taking the union over m and the intersection over n shows that (4.5) is a subset of (4.4). But this follows more easily from the observation that if w lies in all but finitely many of the A n then of course it lies in infinitely many of them. Facts about limits inferior and superior can usually be deduced from the logic they involve more easily than by formal set-theoretic manipulations. If (4.4) and (4.5) are equal, write

( 4.6)

lim An = lim inf An = lim sup An. n

n

n

To say that A n has limit A, written An � A , means that the limits inferior and superior do coincide and in fact coincide with A. Since lim infn An c lim supn A n always holds, to check whether An � A is to check whether lim supn A n cA c lim infn A,. From A , E :7" and A, � A follows A E :F. t See

Problems

and inferior.

41

and

42

for the analogy between set-theoretic and nu merical limits superior

SECTION 4.

DENUM ERABLE PROBABILITIES

53

Consider the functions dn(w) defined on the unit interval by the dyadic expansion (1 .7), and let ln(w) be the length of the run of O's starting at d n(w ): ln( w ) = k if dn( w ) = · · · = dn + k - 1(w) = 0 and dn +/w) = 1; here ln( w ) 0 if dn(w) = 1. Probabilities can be computed by (1.10) . Since n n k [ w : ln (w) = k ] is a union of 2 - l disjoint intervals of length 2 - - , it lies in !F and has probability 2 - k - I _ Therefore, [w: ln( w ) > r ] [w: d ;{ w ) = 0, n r for infinitely many n, or, n being regarded as a time index, such that ln( w ) > r II

infinitely often.

Because of the theory of Sections 2 and 3, statements like (4.7) are valid in the sense of ordinary mathematics, and using the traditional language of probability-" ileads," "runs," and so on-does not change this. When n has the role of time, (4.4) is frequently written lim sup A n = [ A n i .o. ] , n

( 4.8)

where " i.o." stands for "infinitely often." Theorem 4.1.

( 4.9)

{i)

(

For each sequence {An},

)

A < lim inf P ( An) P lim inf n n n

(

)

< lim sup P ( An) < P Iim su p A n · n n (ii)

If A n � A , then P( An) � P( A).

PROOF. Clearly {ii) follows from {i). As for {i), if Bn = n k =n A k and en = U � =n A k , then Bn i lim infn A n and en ! lim sup n A n , so that, by parts {i) and (ii) of Theorem 2.1, P( A n) > P(Bn) � P(lim infn An) and P( A n) < P(en) � P(lim sup n A n ). •

Define ln( w ) as in Example 4. 1 , and let A n = [ w : ln( w ) > r] for fixed r. By (4.7) and (4.9), P[w : ln( w ) > r i.o.] > 2 - '. Much stronger results will be proved later. • Example 4.2.

Independent Events

Events A and B are independent if P( A n B) = P( A ) P(B). (Sometimes an unnecessary mutually is put in front of independent .) For events of positive

54

PROBABILITY =

probability, this is the same thing as requiring P( B I A ) = P( B ) or P( A I B ) P( A ). More generally, a finite collection A 1 , , An of events is independent if •

•

•

P ( A k l n · · , nA k ; ) = P ( A k l ) · · · P ( A k; )

( 4.10)

for 2 <j < n and 1 < k 1 < · · · < ki < n. Reordering the sets clearly has no effect on the condition for independence, and a subcollection of independent events is also independent. An infinite (perhaps uncountable) collection of events is defined to be independent in each of its finite subcollections is. If n = 3, (4. 10) imposes for j = 2 the three constraints

( 4. 1 1 )

P( A 1 n A 3 ) = P( A 1 ) P ( A 3 ) , P( A 1 n A 2 ) = P ( A 1 ) P( A 2 ) , P ( A 2 n A 3) = P( A 2 ) P ( A 3 ) ,

and for j

=

3

the single constraint

( 4.12) Example 4.3. Consider in the unit interval the events Bu L [w: du(w) = =

=

d1 (w)]-the uth and vth tosses agree-and let A 1 B12, A 2 = B13, A3 = B23. Then A p A 2 , A 3 are pairwise independent in the sense that (4. 11) holds (the two sides of each equation being ± ). But since A 1 n A 2 c A 3 , (4.12) does not hold (the left side is ± and the right is i ), and the events are not indepen • dem.

Example 4.4. In the discrete space

{1, . . . , 6} suppose each point has probability i (a fair die is rolled). If A 1 = {1, 2, 3, 4} and A 2 = A 3 = { 4, 5, 6}, then (4.12) holds but none of the equations in (4.11) do. Again the events are f! =

•

not independent.

Independence requires that (4.10) hold for each j = 2, . . . , n and each choice of k 1 , , ki , a total of 2 n - 1 - n constraints. This re quirement can be stated in a different way: If B1, , Bn are sets such that for each i = 1, . . . , n either B; = A ; or B; f!, then

Ej=2{ �)

. • •

=

•

•

•

=

(4.13) The point is that if B, = f!, then B; can be ignored in the intersection on the left and the factor P( B;) 1 can be ignored in the product on the right. For example, replacing A 2 by f! reduces (4.12) to the middle equation in (4.11). From the assumed independence of certain sets it is possible to deduce the independence of other sets. =

SECTION 4.

DENUM ERABLE PROBABILITIES

55

Example 4.5. On the unit interval the events Hn = [ w : dn(w) = 0], n = i 1 , 2, . . . , are independent, the two sides of (4. 10) having in this case value 2 - .

It seems intuitively clear that from this should follow the independence, for example, of [w: diw) = O] = Hz and [w: d1(w) = O, diw) = 1] = H1 n H�, since the two events involve disjoint sets of times. Further, any sets A and B depending, respectively, say, only on even and on odd times (like [w : dz n(w) = 0 i.o.] and [w: d zn + 1( w) = 1 i.o.]) ought also to be independent. This raises the general question of what it means for A to depend only on even times. Intuitively, it requires that knowing which ones among Hz, H4 , . . . occurred entails knowing whether or not A occurred-that is, it requires that the sets Hz , H4 , . . . "determine" A . The set-theoretic form of this require ment is that A is to lie in the 0. Thus if B is independent of .Pi', then the observer's probability for B is P( B) even after he has received the information in .w'; in this case .w' contains no information about B. The point of Theorem 4.2 is that this rem ains true even if the observer is given the inform ation in £T(.w'), provided that .w' is a 7T-system. It is to be stressed that here like and is an informal, extramathe matical term On pa'rticular, it is not information in the technical sense of entropy). The notion of partial inform ation can be looked at in terms of partitions. Say that points w and w' are if, for every A in .w', w and w' lie

partial information.

information, observer know, .saf-equivalent

58

PROBABILITY

either both in A or both in Ac-that is, if A E N'.

( 4. 16)

This relation partitions f! into sets of equivalent points; call this the N

partition. Exampl e If w and w' are £T( N')-equivalent, then certainly they are N-equivalent. For fixed w and w', the class of A such that 1/w) 1/ w' ) is a 4. 8.

=

£T-field; if w and w' are N-equivalent, then this £T-field contains .sf and hence £T{N'), so that w and w' are also £T(N')-equivalent. Thus N-equivalence and £T( N')-equivalence are the same thing, and the �partition coincides with the £T(N' )-partition. II An observer with the information in £T(N') knows, not the point w drawn, but only the equivalence class containing it. That is exactly the infOi ma tion he has. In Example 4.6, it is as though an observer with the items of information in N'1 were unable to combine them to get information about B23.

Example . . . then If [w: dn( w) 0] as in Example 4.5, and if w and w' satisfy (4. 16) if and only if dn( w) dn(w') for all odd n. The information in £T(N') is thus the set of values of dn(w) for n odd. 4.9.

Hn

=

=

=

},

{Hp H3, H5,

d

•

One who i n format i o n about knows t h at the smal l e r l i e s i n a set has more however, has iabouts. Onethewholaknows for each cl a ss more i n formati o n in a the i n format i o n have the rger i s . Furt h ermore, to i n and informat i o n i n .N not that i n Jdf • i s to have t h e i n formati o n i n 2 2 nature thi s The fol l o wi n g of i n terpretat i o n of exampl e poi n ts up t h e i n formal a--fields as informatio n. be t h e l e t afi e l d consi s t i n g In t h e uni t i n t e rval (D., t h e of for each each set i s or countabl e and the cocountabl e set s . Si n ce i n of But i n t h i s case t h e �part i t i o n consi s t s of the si n gl e t o ns, iandn soisthinedependent i n format i o n whi c h i n has been drawn. exact l y i t e l s t h e observer the sense that H and (iindependent. ) The a--field(ii) Theconta--fiainselnod informat are i o n about H-i n the sense i n format i o n about H-i n the cont a i n s that it tel s the observer exactly which was drawn. But t h i s exampl e , (i ) contradi c ti o n. the mathemat i c s i s In and (i i ) st a nd i n apparent -H andof the difficulindependent concerns heuri s t i c int e rpret a t i o n. iThen (i)source onl y whi l e (i i ) t y here l i e s in the unnatural struct u re of or apparent paradox n of tihnough d iratng hofer a-than-fieldisn andany idefinformat dependence. t The theheuria--fisticelequat ciencyioninis thehelpnotiful oeven it sometimes A

w Iiw)

A

JJf

w

A

.:#

N,

JJ11

u

.:# 5', P) P( G) 0

Example 4.10. !f

w

.:#. n .:#

JJ11 Jlfi n

JJ12,

G

1

w

.:#

w

.:# are

.:#

t see Problem 4.10 for a more extreme example

all

n

H

.:#,

.:#

•

SECT ION

4.

DENUMERABLE PROBABILITI ES

59

breaks down, and of course i l u si o ns and vagari e s are i n di ff e rent t o what e ver proofs brought them into existence. The Borei-Cantelli Lemmas

i the first Borel-Cantelli Lemma:

This s

Theorem 4.3.

If

Ln P( A )

converges, then P( l m supn A n ) = 0 .

i

PROOF. From lim supn A n C U k = m A k follows P(lim supn A n ) < P ( U k = m A k ) < L'k = m P( A k ), and this sum tends to 0 as m � oo if Ln P(A J • converges.

By Theorem 4.1 , P( An) � O implies that PO im infn An ) = O; in Theorem 4. 3 hypothesis and conclusion are both stronger. Example

w

Consider the run length L n( ) of Example 4. 1 and a sequence { rn } of positive reals. If the series L:1/2'n converges, then 4.11.

( 4.17)

P[

w: In( w ) > r, i .o .J

=

0.

To prove this, note that if s, is r, rounded up to the next integer, then by (4.7), P[w: L ( ) > r,] = 2 - s, < 2 - '". Therefore, (4.17) follows by the first Borel-Cantelli lemma. If r, = (1 + E) log 2 n and E is positive, there is convergence because 2 -' n = 1jn 1 H Thus

,w .

P[

( 4.18)

w L,(w ) :

>

( 1 + E ) Iog 2 n i .o . ] = 0.

w

The limit superior of the ratio L ( )j log 2 n exceeds 1 if and only if w belongs to the set in (4.18) for some positive rational E. Since the union of this countable class of sets has probability 0, ,

1 (w ) , > 1 J = 0. P [ w lim sup : , og 2 n

( 4.19)

:

To put it the other way around,

( 4.20)

P

[w

:

lim sup "

:og, ( w2 n) < 1 ] = 1 .

Technically, the probability in (4.20) refers to Lebesgue measure. Intu itively, it refers to an infinite sequence of independent tosses of a fair coin. •

,w

In this example, whether lim sup , L ( )j log 2 n < 1 holds or not is a and the property in fact holds for in an S'tset of probability property of

w,

w

PROBABILiTY

60

1. In such a case the property is said to hold with probability 1, or almost surely. In non probabilistic contexts, a property that holds for w outside a (measurable) set of measure 0 holds almost everywhere, or for almost all w. Example 4.12. The preceding example has an interesting arithmetic consequence. (n - I )-place approximation Truncating thek dyadic expansion at n gives the standard n + 1 - , and the error relative to the I:f: :.: d k (w )2 .IS to w; the error is between 0 and 2 maximum 00

L dn + i - 1 { W ) 2 - i

( 4 21)

i= I

•

which lies between 0 and 1. The binary expansion of e,(w) begins with l11(w) O's, and then comes a 1 Hence .0 . 01 < en(w ) < .0 . . . 0 1 1 1 . . . , where there are ln(w) O's i n the extreme terms. Therefore,

( 4.22) so that results on ru n length give- information about the euor of approximation. By the left-hand inequality in (4.22), e)w) < x, (assume that 0 < xn < 1) implies that ln(w) > - logz xn - 1; since r::r r., < 00 implies (4. 17), LX n < 00 implies P[w: en(w) < X n i.o.] = 0. (Clearly, [w: en(uJ) < X ] is a Borel set ) In particular,

( 4.23) Tech nically, this probability refers to Lebesgue measure; intuitively, it refers to a poi nt drawn at random from the u nit i nterval. •

Example 4.13.

The final step in the proof of the normal number theorem (Theorem 1 .2) was a dis.guised application of the first Borel-Cantelli lemma. l If A, = [ w: In- 1s ,( w )I > n - /8 ], then L.P(A ,) < oo, as follows by (1.29), and so P[ A, i.o.] = 0. But for w in the set complementary to [ A, i.o.], n - 1 s,( w) �

0.

The proof of Theorem 1.6 is also, in effect, an application of the first • Borel-Cantelli lemma. This

is the second Borel- Cantelli lemma:

4.4. If {A,} is an independent sequence of events and L:, P(A,) diverges, then P {l i m sup, A,) = 1. Theo rem

PROOF. It is enough to prove that P( u� = I n� =fl A�) = 0 a nd hence enough to prove that P( n�=fl AD = 0 for all n. Since 1 - X < e - x .

S ECTION

4.

DENUMERABLE PROBABILITIES

61

L.k P( A k ) diverges, the last expression tends to 0 as j � oo, and hence P( n�=n A� ) = lim1 P< nz�� Ap = o. • Since

By Theorem 4.1, lim sup, P(A,) > 0 implies P(lim sup, A ,) > 0; in Theo rem 4.4, the hypothesis L.,P(A,) = oo is weaker but the conclusion is stronger because of the additional hypothesis of independence.

Example 4.14. Since the events [w: L,(w) = 0] = [w: d,(w) = 1], n = 1, 2, . . . , are independent and have probability ·L P[ w: L,(w) = 0 i.o.] = 1. Since the events A , = [w: L,(w) = 1] = [w: d,(w) = 0, d, + 1(w) = 1], n = 1, 2, . . . , are not independent, this argument is insufficient to prove that P [ w : l,(w) 1 i .o. ] = 1 . ( 4.24) But the events A 2 , A4, A6, . . . are independent (Theorem 4.2) and their probabilities form a divergent series, and so P[w: / 2,{w) = 1 i.o.] = 1 , which • implies (4.24). =

Significant applications of the second Borel-Car.telli lemma usually re quire, in order to get around problems of dependence, some device of the kind used_ in the preceding example.

Example 4.15. There is a complement to (4. 17): If r, is nondecreasing and [.2-rn jr, diverges, then ( 4.25)

=k

To prove this, note first that if r, is rounded up to the next integer, then L.2-r" /r, still diverges and (4.25) is unchanged. Assume then that r, r(n) is integral, and define {n k } inductively by n 1 = 1 and n k + l = n k + r, k, > L Let A k = [w: L, k(w) > r,) = [ w; d;(w) = O, n k < i < n k + 1 ]; since the A k involve nonoverlapping sequences of time indices, it follows by Corollary 2 to Theorem 4.2 that A 1 , A 2 , are independent. By the second Borel-Cantelli n lemma, P[ A k i.o.] = 1 if L.k P ( A k ) = L.k 2- r < k ) diverges. But since r, is nondecreasi ng, •

k> l

•

•

k ;:>, l >I n that of L. k 2 -r ( k l, n

Thus the divergence of L.,2 -r" r; 1 implies and it follows that, with probability 1 , /, k{w) > r, k for infinitely many values of But this is stronger than (4.25). The result in Example 4.2 follows if r, = r, but this is trivial. If r, = log 2 n, then [.2-r" /r, = L.1/(n log 2 n) diverges, and therefore

( 4 . 26)

k.

62

PROBABILITY

By (4.26) and (4 . 20), ( 4.27 )

w

P [ w: lim supn :on(gw2 n)

]

= 1 = 1.

Thus for i n a set of probability 1, log 2 n as a function of " upper envelope" for the function in(w). Example

By the right-hand inequality en(w) < 1 1n. Hence (4.26) gives 4.16.

m

n

is a kind of •

{4.22), if l11(w) > log 2 n, then

( 4.28 ) This and (4.23) show that, with probability 1, en(w ) has 1 I n as a " lower envelope." k will The discrepancy between w andn its ( n - I)-place approximation r::z : :d k (w )2 n fall infinitely Often beJoW 1 l(n 2 - I ) but not infinitely often belOW 1 l(n 1 + < 2 - I ). •

Example 4. 1 7. Examples 4. 12 and 4. 16 have to do wit h the approximation of real numbers by rationals: Diophantine approximation. Change the xn = 1 ln 1 + < of (4.23) to 1 l((n - 1 ) log 2) 1 + •. Then I:x, still converges, and hence

And by (4.28),

P [ w: en ( w ) < 1 1 log 2 n - l i.o . ) = 1. n The dyadic rational I:'k: ld k (w)2 - k =plq has denominator q 2 - l , and en(w) = q(w - p I q ). There is therefore probability 1 that, if q is restricted to the powers of 2, t hen O < w -p l q < 1 j (q log q) holds for infi n itely many plq but O < w -pjq < 1 l(q log 1 +•q) holds only for finitely many_ t But contrast this wit h Theorems 1.5 and 1.6: The sharp rational approximations to a real number come not from truncating its dyadic (or decimal) expansion, but from truncating its continued-fraction expansion; see Section 24. • =

The Zero-One Law

For a sequence A 1 , A 2 , of events in a probability space ( f!, :F, consider the a-fields a( An, An + l • . . . ) and their intersection •

( 4.29)

• •

00

!T= n a( A n , An + 1 , n =l

• • •

P)

).

tThis ignores the possibility of even p (reducible p 1 q ); but see Problem l . l l(b). And rounding w up to ( p + 1 ) I q instead of down to pI q changes nothing; see Problem 4. 1 3.

4.

SECTION

DENUMERABLE PROB n U I > k A 1 and lim infm Am = u k ;:>: n n i ;:>: k A , are both in a( An, An + I • . . . ): the lim its superior and inferior are tail events for the sequence {A n }. • 4. 18.

Example 4. 19. Let l n(w) dn(w) = 0]. For each n0, [ w : l,.(w) > rn

be the run length, as before, and let Hn = [w:

! .o. ] = n u [w: lk ( w) > rk ] n ;:>: n 0 J.. ;:>:: n

= n U Hk n Hk + l n · · · n Hk + rk - 1 · n � n 0 /c ;:>:, n

Thus [w:

lJw) > rn

i .o.] is a tail event for the sequence {Hn}.

The probabilities of tail events are governed by

law:t

Theorem 4.5. 1f A 1 , A 2 , each event A in the tail a-field •

•

•

Kolmogorov 's zero -one

is an independent sequence of events, then for (4.29), P(A ) is either 0 or 1.

•

PROOF. By Corollary 2 to Theorem 4.2, a(A 1 ), . . . , u(An _ 1 ), a( A n , A n + P · · · ) are independent. If A E !T, then A E a(A n , A n + 1, . . . ) and therefore A I > , A n _ 1 , A are independent. Since independence of a collec tion of events is defined by independence of each finite subcollection, the sequence A , A 1 , A 2, . . . is in dependent. By a second application of Corollary 2 to Theorem 4.2, a( A) and a( A 1, A 2, . . . ) are independent. But A E !Tc a( A 1 , A 2 , . . . ); from A E a(A) and A E u(A p A 2 , . . . ) it follows that A is independent of itself: P( A n A ) = P(A)P(A). This is the same as P(A) = 2 • (P(A)) and can hold only if P(A) is 0 or 1. •

•

Example 4.20.

By the zero-one law and Example 4.18, P{lim supn An) is 0 or 1 if the A n are independent. The Borel -Ca ntelli lemmas in this case go further and give a specific criterion in terms of the convergence or divergence of EP( A ,). • Kolmogorov's result is surprisingly general, and it is in many cases quite easy to use it to show that the probability of some set must have one of the extreme values 0 and 1. It is perhaps curious that it should so often be very difficult to determine which of these extreme values is the right one. t For a more general version, see Theorem 22.3

64

PROBABILITY

By Kolmogorov's theorem and Example 4. 19, [ w: Ln( w ) > rn i.o.] has probability 0 or 1. Call the sequence { rn} an outer boundary or an inner boundary according as this probability is 0 or 1. In Example 4. 1 1 it was shown that { rn} is an outer boundary if E2 -r" < oo. In Example 4. 15 it was shown that { rn} is an inner boundary if rn is nondecreasi ng and E2 -r" r; 1 = oo. By these criteria rn = () log 2 n gives an outer boundary if () > 1 and an inner boundary if () < 1 . What about the sequence rn = !og 2 n + () log 2 log 2 n? Here E2 -r" = E 1jn(iog 2 n) 9 , and this converges for () > 1, which gives an outer boundary. Now 2 -r"r; 1 is of the order 1 jn{log 2 n) 1 +9 , and this diverges if () < 0, which gives an inner boundary (this follows indeed from (4.26)). But this analysis leaves the range 0 < () < 1 unresolved, although every sequence is either an inner or an outer boundary This question is pursued further in Example 6.5 . Example 4.21.

•

PROBLEMS

4.1. 2. 1 i The limits superior and inferior of a numerical sequence {xn} can be defined as the supremum and infimum of the set of limit points-that is, the set of limits of convergent subsequences. This is the same thing as defining lim supxn = 1\ V x k n n = l k =n 00

( 4.30 )

00

and 00

(4.3 1 )

"'

n = I k =n

Compare these relations with (4.4) and (4.5) and prove that /lim sup A = Jim Su p [A n n

n

n

,

/lim in f

II

A II

= lim inn flA . n

Prove that limn An exists in the sense of (4.6) if and only if lim n/A ( w ) exists for each w. n

4.2. i (a) Prove that

{ lim supn An ) n {lim supn Bn ) ::J iim supn ( A n n Bn), { lim supAn ) u {Iim sun p Bn ) = lim supn ( A n u Bn), n (lim infAn ) n (Iim infBn n { A n n Bn ) , n n ) = lim inf

Show by example that the two inclusions can be strict.

SECTION

4.

DENUMERABLE PROBABILITIES

65

(b) The numerical analogue of the first of the relations in part (a) is

{ lim sun xn ) {lim supyn ) > lim supn ( x, /\ Yn) · n p

1\

Write out and verify the numerical analogues of the others. (c) Show that

(

r

lim sup A�, = lim infA, , n n lim sup A , - lim infA, = lim sup ( A , n A�, + 1 ) n n n n (d) Show that A n -+ A and Bn -----+ 8 together imply that A n U B, -+ A u B and A n n Bn --> A n B.

4.3. Let An be the square [(x, y ): lxl < 1, l yl < 1] rotated through the angle 2rrne. Give geometric descriptions of lim supn A, and lim inf A, in case (a) e = k; (b) e is rational; (c) e is irrational. Hint: The 2rrne reduced modulo 2rr are dense in [0, 2rr] if e is irra tiona I. (d) When is there convergence is the sense of (4.6)? 4.4. Find a sequence for which all three inequalities in (4.9) are strict. 4.5. (a ) Show that limn P(lim inf k An n A/,) = 0. Hmt: Show that lim supn lim inf k An nA'k is empty. Put A* = lim supn An and A * = lim infn An(b) Show that P(An - A* ) -----+ 0 and P(A * - A ,) -----+ 0. (c) Show that An -----+ A (in the sense that A = A * = A * ) impli�s P{ A t. A,) -+ 0. * (d) Suppose that A n converges to A in the weaker sense that P(A t. A ) = P(A t. A * ) = 0 (which implies that P(A* - A * ) = 0). Show that P(A t. A,) 0 (which implies that P(An) -----+ P(A )). .....

4.6. In a space of six equally likely points (a die is rolled) find three events that are not independent even though each is independent of the intersection of the other two. n 4.7. For events A 1 , , A m consider the 2 equations P(B1 n · · · n Bn) = P(B1 ) · · · P(Bn) with B; = A ; or B; = Aj for each i. Show that A 1 , . . . , A, ar e independent if all these equations hold. •

•

•

4.8. For each of the following classes .rff, describe the dpartition defined by (4.16). (a) The class of finite and cofinite sets. (b) The class of countable and cocountable sets.

66

PROBABILiTY

(c) A partition (of arbitrary cardinality) of The level sets of sin (D. = R 1 ).

(d) x (e) The a--fi e ld in Problem 3.5.

fl.

4.9. 2.9 2.10 i In connection with Example 4.8 and Problem 2 10, prove these facts: (a) Every set in u(d ) is a union of dequivalence classes. (b) If d [ A 8: (J E 0], then the dequivalence classes have the form n 888, where for each e, 88 is A8 or A�. (c) Every finite a--field is generated by a finite partition of fl. (d) If .9Q is a fie ld, then each singleton, even each finite set, in u(,9Q) is a countable intersection of .9Q-sets. 4.10. 3.2 i There is in the unit interval a set H t hat is nonmeasurable in the extreme sense> that its inner and outer Lebesgue measures are 0 and 1 (see (3.9) and (3. 10)): A * ( H ) = 0 and A.'"( H) = I . See Problem 12.4 for the construction Let fl = (0, 1], let .:§ consist of the Borel sets in fl, and let H be t he set just described. Show that the class s< of sets of the form ( H n G 1 ) u ( He n G 2) for G 1 and G 2 in .:§ is a a--fi e ld and that P[( H n G 1 ) U ( Hc n G 2)] = !A(G 1 ) + !MG2) consistently defines a probability measure on :F. Show that P( H) = I and that P(G) = A( G) for G E .:§. Show that .:§ is generated by a countable subclass (see Problem 2. 1 1). Show that .:§ contains all the singletons and that H and .:§ are independent. The construction proves this: There exist a probability space (fl, s k P( Ani A'k n nA�_ 1 ) diverges. From this deduce the second Borei- Cantelli lemma once agam. (c) Show by example that P(lim supn An) = 1 does not follow from the diver gence of LnP( A ni A) n n A � _ 1 ) alone. (d) Show that P(lim supn A n) = 1 if and only if L:n P( A n An ) diverges for each A of positive probability. (e) If sets An are independent and P(An) < 1 for all n, then P[ A n i.o. ] 1 if and only if P( U nAn) = 1. =

=

· · ·

· · ·

=

4.12. (a) Show (see Example 4.21) that log 2 n + log 2 1og 2 n + (J log 2 1og 2 log 2 n is an outer boundary if (J > 1 . Generalize. (b) Show that log 2 n + log 2 log 2 log 2 n is an inner boundary.

SECTION 5.

S IMPLE RANDOM VARIABLES

67

4.13. Let lfJ be a positive function of_ integers, and define B"' as the set of x in (0, 1) such that lx - p/2' 1 < 1j2'1{J(2') holds for infinitely many pairs p , i. Adapting the proof of Theorem 1.6, show directly (without reference to Example 4. 12) that L:; l/1fJ(2i) < oo implies A(B"' ) = 0. 4.14. 2. 19 i Suppose that there are in (!l, 5', P) independent events A 1 , A 2 , . . . such that, if an = min{P(A n), 1 - P(A n)}, then [ a n = oo. Show that P is nonatomic. 4.15. 2. 18 i Let F be the set of square-free integers-those integers not divisible by any perfect square. Let F1 be the set of m such that p 2 1 m for no p < l, and show that D(F1) = np s t(l - p - 2 ). Show that Pn( F1 - F ) < [p > tp- 2 , and con clude that the square-free integers h ave density n / 1 - p- 2 ) = 6jTr 2. 4.16. 2. 18 i Reconsider Problem 2. 18(d). If D were countably additive on /(.A'), it would extend to a-(.A'). Use the second Borel-Cantelli lemma. SECTION 5. SIMPLE RANDOM VARIABLES Definition

Let (f!, !F, P) be an arbitrary probability space, and let X be a real-valued function on f! ; X is a simple random variable if it has finite range (assumes o nly finitely m any values) and if

(5.1)

E !F for each real x. (Of course, [w: X(w) = x] = 0 E !F for x outside the range of X.) Whether or not X satisfie� this condition depends only on !F, not on [ w : X( w) = x]

P, but the point of the definition is to ensure that the probabilities P[w: X(w) = x] are defined. Later sections will treat the theory of general random

variables, of functions on f! having arbitrary range; (5.1) will require modifi cation in the general case. The dn(w) of the preceding section (the digits of the dyadic expansion) are simple random variables on the unit interval: the sets [w: dn(w) = 0] and [w: dn(w) = 1] are finite unions of subintervals and hence lie in the a-field fJd of Borel sets in (0, 1]. The Rademacher functions are also simple random variables. Although the concept itself is thus not entirely new, to proceed further in probability requires a systematic theory of random variables and their expected values. The run lengths ln(w) satisfy (5.1) but are not simple random variables, because they have infinite range (they come under the general theory). In a discrete space , !F consists of all subsets of f!, so that (5. 1) always holds. It is customary in probability theory to omit the argument w. Thus X stands for a general value X(w) of the function as well as for the function itself, and [X = x ] is short for [w: X(w) = x]

68

PROBABILITY

A finite sum

( 5.2) is a random variable if the A; form a finite partitiOn of f! into St"sets. Moreover, every simple random variable can be represented in the form (5.2): for the X; take the range of X, and put A; = [ X = xJ But X may have other such representations because xJA can be replaced by [1-xJA . if the Aii form a finite decomposition of A; into St"sets. If .§ is a sub-a-field of !F, a simple random variable X is measurable .§, or measurable with respect to .§, if [ X = x] E .§ for each x. A simple random variable is by definition always measurable !F. Since [ X E H ] = U [ X = x ], where the union extends over the finitely many x lying both in H and in the range of X, [ X E H ] E .§ for every H c R 1 if X is a simple random variable variable measurable .§ The a-field a( X) generated by X is the smallest a-field with respect to which X is measurable; that is, a( X) is the intersection of all a-fields with respect to which X is measurable. For a finite or infin ite sequence Xp X 2 , of simple random variables, a( X1 , X2 , ) is the smallest a-field with respect to which each X; is measurable. It can be described explicitly in the finite case; ,,

'

.

.

Theorem 5.1.

{i)

Let X� >

.

.

.

.

•

•

•

, Xn be simple random variables.

The a- eld a(Xp . . . , Xn) consists of the sets

fi

for H c R n ; H in this representation may be taken finite. {ii) A simple random variable Y is measurable a(Xp . . . , X�) if and only if (5 .4) PROOF. Let .A' be the class of sets of the form (5.3). Sets of the form [( Xp . . ' Xn) = (x i > . . . ' xn)] = n 7� I [X; = X;] must lie in a(Xp . . . ' Xn); each set (5.3) is a finite union of sets of this form because (X1 , , Xn), as a n mapping from f! to R , has finite range. Thus .A'c a( Xp . . . , Xn). n On the other hand, .A' is a a-field because f! = [( XI > . . . , Xn ) E R ], [(Xp . . . ' Xn) E H jC = [(X I , . . . ' Xn) E He], and u j[( XI . . . ' Xn) E H) = [( XI > . . . , Xn) n U iH). But each X; is measurable with respect to .A', because [X; = x] can be put in the form (5.3) by taking H to consist of those n R for which x; = x. It follows that a(X 1 , , Xn) is contained in (x I > . . . , x n) .

•

,

•

•

•

•

•

SECTION 5.

69

SIMPLE RANDOM VARIABLES

in ./I and therefore equals ./1. As intersecting H with the range (finite) n of (X1 , , Xn) in R does not affect (5.3), H may be taken finite. This proves {i). Assume that Y has the form (5.4)-that is, Y( w ) = f(X 1(w), . . . , Xn(w)) for every w. Since [ Y = y] can be put in the form (5.3) by taking H to consist , xn) for which f( x) = y, it follows that Y is measurable of those x = ( x a(Xp . . . , Xn). Now assume that Y is measurable a(XP . . . , Xn). Let y P . . . , yr be the n distinct values Y assumes. By part {i), there exist sets H1 , , Hr in R such that • • •

P

. . •

• • •

Take f = L.'i� yJH;· Although the H; need not be disjoint, if H; and Hi share a point of the form (X1( w ), . . . , X/w)), then Y( w ) Y; and Y( w ) = yi , which is impossible if i =I= j. Therefore each (X 1{ w ), . . . , Xn(w)) lies in exactly one of the H;, and it follows that f(X ( w ), . . . , Xn(w)) = Y(w). • i

=

1

Since (5.4) implies that Y is measurable a(X 1 , , Xn), it follows i11. particular that functions of simple random variables are again simple random and so on are simple random variables along with X. variables. Thus X 2 , Taking f to be I:7� 1 x;, 07� 1 x;, or max ; ,; n X ; shows that sums, products, and maxima of simple random variables are simple random variables. As explained on p. 57, a sub-a-fie ld corresponds to partial information about w. From this point of view, a( X I > . . . , Xn) corresponds to a knowledge of the values X 1( w ), . . . , X/w). These values suffice to determine the value Y(w) if and only if (5.4) holds. The elements of the a(XI > . . . , Xn)-partition (see (4. 16)) are the sets [XI = X I > . . . ' xn = x n ] for X ; in the range of X;. • • •

e'x,

For the dyadic digits dn(w) on the unit interval, d 3 is not measurable a(d�> d2); indeed, there exist w' and w" such that d 1 (w') d 1(d') and diw') = d 2(w") but d 3(w') =I= d 3(w"), an impossibility if diw) = f(d 1(w), d 2(w)) identically in w. If such an f existed, one could unerringly predict the outcome diw) of the third toss from the outcomes d 1(w) and d2 (w) of the first two. • Example 5.1.

=

Let sn(w) -= L.%� 1 r k(w) be the partial sums of the Rademacher functions-see (1. 14). By Theorem 5.1{ii) sk is measurable a(r J > · · · , rn) for and rk = sk - sk - l is measurable a(s J > · · · , sn) for Thus a(r�> . . . , rn) = a(s� > . . . , sn). In random-walk terms, the first positions contain the same information as the first distances moved. In gambling terms, to know the gambler's first fortunes (relative to his initial fortune) is the same thing as to know his gains and losses on each of the first plays. • Example 5.2.

k < n.

k < n,

n

n

n

n

PROBABILITY

70

An indicator /A is measurable .:# if and only if A liesn in .:#. a( A . . . , An) if and only if /A = f{IA , . . . , /A ) for some f: R � R 1 .

Example 5.3.

And A

E

p

I

•

n

Convergence of Random Variables

It is a basic problem, for given random variables X and XI > X2, on a probability space (f!, !F, P), to look for the probability of the event that limn Xn(w) X(w). The normal number theorem is an example, one where the probability is 1. It is couvenient to characterize the complementary event: Xn(w) fails to converge to X( w) if and only if there is some E such that for no m does IXn(w) - X(u,) l remain below E for all n exceeding m-that is to say, if and only if, for some E, IXn(w) - X( w)l > € holds for infinitely many values of n. Therefore, • • •

=

( 5.5) where the union can be restricted to rational (positive) E because the set in the union increases as E decreases (compare (2.2)). The event [ lim n Xn = X ] therefore always lies in the basic a-field !F, and it has probability 1 if and only if

( 5 .6) for each E (rational or not). The event in (5.6) is the limit superior of the events [IXn - XI > E], and it follows by Theorem 4. 1 that (5.6) implies

( 5.7)

lim n P ( I Xn - XI > E]

=

0.

This leads to a definitio n: If (5.7) holds for each positive E, then Xn is said to converge to X in probability, written Xn � p X. These arguments prove two facts:

There is convergence limn Xn = X with probability 1 if and only if (5.6) holds for each €. (ii) Convergence with probability 1 implies convergence in probability. Theorem 5.2.

(i)

Theorem 1.2, the normal number theorem, has to do with the convergence with probability 1 of n - 1 E7= 1 d,.( w) to t. Theorem 1 . 1 has to do instead with the convergence in probability of the same sequence. By Theorem 5.2(ii), then, Theorem 1.1 is a consequence of Theorem 1.2 (see (1.30) and (1.31)). The converse is not true, however-convergence in probability does not imply convergence with probability 1:

SECTION 5.

71


Take X = 0 and Xn = /A . Then Xn �P X is equivalent to P( An) � 0, and [lim n Xn = X]c = [An i.o.]. Any sequence {An} such that P(An) � 0 but P[ An i.o.] > 0 therefore gives a counterexample to the converse to Theorem 5.2(ii). Consider the event An = [w: Ln(w) > log 2 n] in Example 4. 15. Here, P(An) 1/n 0, while P[ An i.o.] 1 by (4.26), and so this is one coun terexample. For an example more extreme and more transparent, define events in the unit interval in the following way. Define the first two by Example 5.4.

· · · , Xn s therefore that • • •

for linear sets H1 , • • • , Hn. The definition (4.10) also requires that (5.8) hold if 1 one or more of the [ X; E H;] is sup1Jressed; but taking H; to be R eliminates it from each side. For an infinite sequence X1, X2, • • • , (5.8) must hold for each n. A special case of (5.8) is

On the other hand, summing (5.9) over x 1 E H1, , xn E Hn gives (5.8). Thus the X; are independent if and only if (5.9) holds for all x 1, . . . , x n· Suppose that • • •

(5 .10)

... ...

••

is an independent array of simple random variables. There may be finitely or

72

PROBABILITY

infinitely many rows, each row finite or infinite. If N; consists of the finite 1 intersections n ) Xij E Hj] with Hj c R , an application of Theorem 4.2 shows that the a-fields a(Xi1, X;2, ), i = 1, 2, . . . are independent. As a consequence, Y1 , Y2 , are independent if Y; is measurable a( Xil, Xi2, ) for each i. • • •

• • •

• • •

The dyadic digits d 1(w ), d iw ), . . . on the unit interval are an independent sequence of random variables for which Example 5.5.

P [ dn

(5.11)

=

0] = P [ dn

=

1] = t .

It is because of (5 1 1) and independence that the dn give a model for tossing a fair coin. The sequence (d1(w), d2(w), . . . ) and the point w determine one another. It can be imagined that w is determined by the outcomes dn( w) of a sequence of tosses. It can also be imagined that w is the result of drawing a point at random from the unit interval, and that w determines the dn(w). In the second interpretation the dn(w) are all determined the instant w is drawn, and so it should further be imagined that they are then revealed to the coin tosser or gambler one by one. For example, a(dp d2) corresponds to knowing the outcomes of the first two tosses-to knowing not w but only d1(w) and d2(w) -and this does not help in predicting the value d3(w), • because a(d1, d2) and a(d3) are independent. See Example 5.1. Example 5.6.

For example,

Every permutation can be written as a product of cycles. 2 1

3 7

4 4

5 6

6 2

j)

=

( 1 5 6 2){3 7)(4) .

This permutation sends 1 to 5, 2 to 1, 3 to 7, and so on. The cyclic form on the right shows that 1 goes to 5, which goes to 6, which goes to 2, which goes back to 1; and so on. To standardize this cyclic represe ntation, start the first cycle with 1 and each successive cycle with the smallest integer not yet encountered. Let f! consist of the n ! permutations of 1, 2, . . . , n, all equally probable; .9contains all subsets of f!, and P(A) is the fraction of points in A . Let Xk (w) be 1 or 0 according as the element in the kth position in the cyclic representation of the permutation w completes a cycle or not. Then S(w) = EZ= 1Xk (w) is the number of cycles in w. In the example above, n = 7, XI = x2 = x3 = Xs = 0, x4 = x6 = x7 = 1, and s = 3. The following argument shows that X I , . . . ' xn are independent and P[ Xk = 1 ] = 1/(n - k + 1). This will lead later on to results on P[S E H].

SECTION 5 .


73

The idea is this: X1( w) = 1 if and only if the ra ndom permutation sends 1 to itself, the probability of which is 1 /n. If it happens that Xlw) = 1-that w fixes 1 -then the image of 2 is one of 2, . . . , n, and X2(w) = 1 if and only if this image is in fact 2; the conditional probability of this is 1 j( n - 1). If Xlw) = 0, on the other hand, then w sends 1 to some i =fo 1, so that the image of i is one of 1 , . , i - 1 , i + 1 , . . , n , and X2{w) = 1 if and only if this image is in fact 1 ; the conditional probability of this is again 1 j(n - 1 ). This argument generalizes. w

.

.

.

But the details are fussy Let Y1(w), . . . , Yn(w) be the integers in the successive positions in the cyclic representation of w Fix k, and let A , be the set where ( X1 , , Xk _ 1 , Y1 , , Yk ) assumes a specific vector of values v = ( x 1 , , x k - I > y 1 , . . , yk ) The A , form a partition .w' of !l, :md if P[Xk = 1 I A . ] = l j(n - k + 1 ) for each u , then by Example 4.7, P[Xk = 1] = 1 j(n - k + 1 ) and Xk is independent of u(N) and hence of the smaller a--field u(X 1 , , Xk _ 1 ). It will follow by induction that X1 , , Xn are independent. Let j be the position of the rightmost I among x 1 , , x k _ 1 ( j = 0 if there are none). Then w l ies in A , if and only if it permutes y 1 , , Yj among themselves (in a way specified by the val ues .c 1 , , x, _ 1 , xj = 1, y 1 , , Y) and sends each of yj 1 , , y k _ 1 to the y just to its right. Thus A ,. contains (n - k + 1)! sample points. And Xk (w) = 1 if and only if w also sends Yk to yj 1 Thus A , n [ Xk = I] contains (n - k )! sample points, and so the conditional probability of Xk = 1 is 1 /(n - k + 1 ). • •

•

•

•

•

•

•

•

•

•

•

•

• • .

• • .

• . .

• • •

• • •

+

• • •

+

•

Existence of Independent Sequences

The distribution of a simple random variable X is the probability measure JL defined for all subsets A of the line by

JL ( A)

(5 . 1 2)

=

P[ X E A ] .

This does define a probabiLity measure. It is discrete in the sense of Exampie 2.9: If x 1 , . x 1 are the distinct points of the range of X, then JL has mass P; P[ X = x;] JL{x;} at x , and J.L(A ) = LP;, the sum extending over those i for which x; EA. As f..L( A ) 1 if A is the range of X, not only is f.L discrete, it has finite support. •

=

•

,

=

;

=

Let {J..L n } be a sequence of probability measures on the class of all subsets of the Line, each having finite support. There exists on some probability space (f!, !F, P) an independent sequence {Xn} of simple random variables such that xn has distribution f.Lw Theorem 5.3.

What matters here is that there are finitely or countably many distribu tions f.Ln· They need not be indexed by the integers; any countable index set will do.

PROBABILITY

74

The probability space will be the unit interval. To understand concentrates its the construction, consider first the case in which each Split mass on the two points and 1. Put = and qn = 1 - Pn = 1] into two intervals /0 and / 1 of lengths p1 and q1 . Define X1(w ) = for w E /0 and X1(w) = 1 for w E / 1 • If P is Lebesgue measure, then clearly P[X1 = = p 1 and P[ X1 = 1] = q 1 , so that X1 has distribution PROOF.

0

(0,

11- n

Pn J.Ln {0}

J.Ln{l} .

0

J.L 1 .

0]

x,

X2

=

XI = 0 X2 1

0

=0

=

P 1 P2

pl q2

XI =

I

X2 O q l p2 =

XI = I X2 = 1 q l q]

--

Now split /0 into two intervals /00 and /01 of lengths p 1 p 2 and p 1 q 2 , and split /1 into two intervals / 1 0 and /1 1 of lengths q 1 p 2 and q 1 q 2 • Define X2(w) for w E /00 U / 10 and Xiw) = 1 for w E /0 1 U / 1 1 . As the diagram makes clear, P[ X 1 = X2 = 0] = p 1 p 2 , and similarly for the other three possibilities. It follows that X1 and X2 are independent and X2 has distribution J.L 2 • Now X3 is constructed by splitting each of 100 , /0 1 , / 10 , / 1 1 in the proportions p 3 and q3• And so on. If Pn = qn = � for all n , then the successive decompositions here are the decompositions of 1] into dyadic intervals, and Xn(w) = dn (w). The argument for the general case is not very different. Let x n 1 , • • • , x n 1 be the distinct points on which concentrates its mass, and put Pn; = for 1 , 1 < i < I 1 . Then ( P is Lebesgue measure) P[w: X 1(w) = x l i] = P(J,.0l) = p u , 1 � i < / 1 • Thus X1 is a simple random variable with distribution J.L 1 • Next decompose each Jf 1 l into / 2 subintervals IHl, . . . , 1 �2 ) of respective lengths PuP21 , • • • , P uP 2 12 • Define Xi w) = x 2i for w E u : I_ � JHl, 1 <j < /2 • Then P[ w: X 1(w) = x 1 ; , X/w) = x 2i ] = P(JHl) = p u p 2 i. Add ing out i shows tha t P[ w: X2(w ) = x 2) = p 2i, as required. Hence P[ X1 = x u, X2 = x 2) = P uP 2i = P[X1 = x u ]P[X2 = x 2), and X 1 and X2 are indepen dent. 1] has been de The construction proceeds inductively. Suppose that composed into / 1 • • • intervals =

0

0,

(0,

J.Ln

ILn{ xn ,.}

(0,

;

In

(0,

( 5.13) t lf b - a = 8 1 + · · · +8, and 8; > 0, then f; = ( a + [. < i8 , a + f. ;,; i8j ] decomposes (a, b] into j j j subintervals /1 , • • • , I, with lengths of 8; Of cou rse, I; is empty if 8; = 0.

S ECTION 5.

75


of lengths P( J1.< "> . ) = p 1, 1. 1 · · · pn . l. n •

(5. 14)

1

,,

+ + ll o f respect 1' ve Decompose I;( n ) ; m to / , + 1 subn'mtervaI s J(; ln ;ln)" . . . , J(n ; ;n 1 lengths PU;( n ) i )p, + I • I ' . . . ' PU;( ) i ) pn + I • I • These are the intervals of the I

I

rr

,

.

I

tr + l

,

11 + I

I

next decomposition. This construction gives· a sequence of decompositio ns (5. 13) of (0, 1] into subintervals; each decomposition satisfies (5. 14), and each refines the preceding one. If J.l., is given for 1 < n < N, the procedure terminates after N steps; for an infinite sequence it does not terminate at all. Ii" > ,. Since each deFor 1 ; ; n • Therefore, each element of (5. 1 3) is contained in the element with the same label i 1 • • • i, in the decomposition l

;

n - 1

I

;

u - 1

k

I

k

The two decompositions thus coincide, and it follows by (5. 14) that P[ X1 = x l i,, . . . , X, = x,;) = p 1 , ; , . . P,, Adding out the indices i I > ' . . , i, _ 1 shows that X, has distribution J.l. , and hence that X1 , • • • , X, are indepen dent. But n was arbitrary. • .

;" ·

In the case where the JL , are all the same, there is an alternative construction based on probabilities in sequence space. Let S be the support (finite) common to the JL m and let p11, u S, be the probabilities common to the JL , . In sequence space S"', define product measure P on the class 1&'0 of cylinders by (2.21). By Theorem 2.3, P is countably additive on -G'0 , and by Theorem 3.1 it extends to t&'= a-( 1&'0 ). The coordi nate functim�s zk( · ) are random variables on the probability space (S"", t&', P); take these as the Xk . Then (2.22) translates into P[ X1 = u" . . . , X, = u, ] = Pu · · · Pu , which is just what Theorem 5.3 requires in this special case.

E

1

••

Probability theorems such as those in the next sections concern indepen dent sequences {X,} with specified distributions or with distributions having specified properties, and because of Theorem 5.3 these theorems are true not merely in the vacuous sense that their hypotheses are never fulfilled. Similar but more complicated existence theorems will come later. For most purposes the probability space on which the X, are defined is largely irrelevant. Every independent sequence { X,} satisfying P[X, = 1] = p and P[X, = 0] = 1 p is a model for Bernoulli trials, for example, and for an event like U � = 1 [[� 1 Xk > an], expressed in terms of the X, alone, the calculation of its probability proceeds in the same way whatever the underlying space (f!, !F, P ) may be. -

=

76

PROBABILITY

It is, of course, an advantage that such results apply not just to some canonical sequence {Xn } (such as the one constructed in the proof above) but to every sequence with the appropriate distributions. In some applications of probability within mathematics itself, such as the arithmetic applications of run theory in the preceding section, the underlying f! does play a role. Expected Value

A simpie random variable in the form

mean value

(5.2) is assigned expected value or

( 5 .15) There is the alternative form

(5 .16)

E[ X ] = L, xP [ X = x ] , X

the sum extending over the range of X; indeed, (5.15) and (5.16) both coincide with LxLi ). = < xi P( A). By (5. 16) the definitio n (5. 15) is consistent: different representations (5.2) give the same value to (5.15). From (5. 16) it also follows that E[X ] depends only on the distribution of X; hence E[ X] = E[ Y] if P[X = Y] = 1. If X is a simple random variable on the unit interval and if the A i in (5.2) happen to be subintervals, then (5.15) coincides with the Riemann integral as given by (1.6). More general notions of integral and expected value will be studied later. Simple random variables are easy to work with because the theory of their expected values is transparent and free of technical complica tions. As a special case of (5.15) a nd (5.16), I

( 5 .17) As another special case, if a constant X( w) = a, then

( 5 .18)

a

is identified with the random variable

E[ a] = a .

From (5.2) follows f(X) = L.J( x)IA , and hence I

( 5 . 19)

E[ f ( X) ] = [ J ( xi ) P( A i) = L f( x ) P[ X = x ] , i

X

the last sum extending over the range of X. For example, the k th moment E[ Xk ] of X is defined by £[X k ] = L. yP[ X k = y ], where y varies over the Y

SECTION 5.

77


k range of X\ but it is usually simpler to compute it by E[ Xk ] = Lx x P[ X = x ], where x varies over the range of X. If ( 5.20) are simple random variables, then aX + /3Y = L:J ax; + /3y1 ) 1A ; n Bi has ex pecte d value L:Jax; + /3 Y) P( A; n B) = aL:;x;P( A) + /3L;YjP(B). Ex pected value is therefore Linear:

( 5 .2 1 )

E[aX + /3Y ] = a E[ X ] + /3E[ Y ] .

If X(w) ::;: Y( w) for aH w, then X; < yi if A ; n Bi is noncmpty, and hence L;j x ; P( A ; n Bi) < L;j YJ P( A n B/ Expected value therefore preserves or ;

der:

( 5. 22)

E [ X ] < E[ Y ]

if X < Y.

{It is enough that X < Y on a set of probability 1.) Two applications of (5.22) give £[ -- lXI] < E[ X] < E[IXI), so that by linearity,

( 5 . 23 )

I E [ X ] I < E[ I XI] .

And more generally,

I E[ X - Y ] l < E[ I X - Yl] .

( 5 . 24)

The relations (5.17) through (5.24) will be used repeatedly, and so will the following theorem on expected values and limits. If there is a finite K such that IXn{w)l < K for all w and n, the Xn are uniformly bounded. Theorem 5.4.

If {Xn } is uniformly bounded, and if X = lim n Xn with

probability 1, then E[ X] = lim n E[ Xn ].

By Theorem 5.2(ii), convergence with probability 1 implies con vergence in probability: xn � p X. And in fact the latter suffices for the present proof. Increase K so that it bounds l X I (which has finite range) as well as all the I Xn l; then I X - Xn l < 2 K. If A = [I X - Xnl > €], then PROOF.

I X( w) - Xn ( w) l < 2 KIA( w ) + dAc( w) < 2KIA( w) + E for all w. By (5.17), (5. 18), (5.2 1), and (5 . 22 ),

PROBABILITY

78

But since Xn � P X, the first term on the right goes to arbitrary, E[I X - Xn ll � 0. Now apply (5 .24).

0, and since

E

is

•

Theorems of this kind are of constant use in probability and analysis. For the general version, Lebesgue's dominated convergence theorem, see Sec tion 16. Example 5. 7.

On the unit interval, take X(w) identically 0, and take n - I and 0 if n - I Xn(w) to be n 2 if 0 1. Then Xn(w) � X(w) for every w, although E[ Xn l = n does not converge to E[ X] = 0. Thus theorem 5.4 fails without some hypothesis such as that of uniform boundedness. See • also Example 7. 7.

<
x 1 ] + L ( x ; - x ; _ d P[ X > x ; ] . i=2

Since P[ X > x ] = P[ X > x 1 ] f8r 0 < x < x 1 and P[ X > x 1 = P[ X > x; 1 for x; _ 1 < x < x;, it is possible to write the final sum as the Riemann integral of a step function:

E[ X ] =

(5 .29)

{ 0

'' P [ X > x ] dx .

This holds if X is non negative. Since P[ X > x ] = 0 for x > x k , the range of integration is really finite. There is for (5.29) a simple geometric argument involving the "area over the curve." If p, = P[ X = x;1, the area of the shaded region in the figure is rhe sum p 1 x 1 + · · · +p k x k = E[ X1 of the areas of the horizontal strips ; it is also the integral of the height P[X > x 1 of the regio n.

I

0

Xz

X

PROBABILITY

80 Inequalities

There are for expected values several standard inequalities that will be needed. If X is nonnegative, then for positive a (sum over the range of X) E[ X ] = Lx xP[ X = x] > L x x � a xP[ X = x] > aL x x � a P[ X = x]. Therefore, 1

(5.30 )

P[ X > a } < a E [ X ] -

X is nonnegative and a positive. A specia l case of this is (1.20). Applied to l XI\ (5 . 30) gives Markov 's inequality, if

( 5 .3 1 )

P[I XI > a] < -;.E [ I XIk] , a

valid for pos 1 t1ve o:. If k = 2 and m = E[ X ] is subtracted from becomes the Chebyshev (or Chebyshev-Bienayme) inequality :

(5 . 32)

P[ I X - m l > a ]
1, q > l.

Holder 's inequality is ( 5. 35 ) If, say, the first factor on the right vanishes, then X = O with probability 1 , hence XY = 0 with probability 1, and hence the left side vanishes also. Assume then that the right side of (5.35) is positive. If1 and b are positive, there exist s and such that a = eP - 1 5 and b = e q - 1• Since e x IS convex,

t

a

SECT ION 5.

81

SIMPLE RAi>IDOM VARIABLES

q aP b ab < p +q. This obviously holds for nonnegative as well as for positive and v be the two factors on the right in (5.35). For each w,

X( w ) Y( w )

uv

a and b. Let u

1 X( w ) P 1 Y( w) q < + -p u q v

Taking expected values and applying (5.34) leads to (5.35). If p = q = 2, Holder's inequ::11ity becomes Schwarz 's inequality: (5.36) Suppose that 0 < a < {3. In (5.35) take p {3/a , q = f3/( f3 - a), and Y(w) = 1, a nd replace X by I XIa. The result is Lyapounov 's inequality, =

0 < a < f3 .

( 5 .37) PROBLEMS

5.1. (a) Show that X is measurable with respect to the a--field .:# if and only if a-(X) c .:#. Show that X is measurable a-(Y ) if and only if a-(X) c a-(Y ). (b) Show that, if .:#= {0, !l}, then X is measurable .:# if and only if X is constant. (c) Suppose that P(A. ) is 0 or 1 for every A in .:#. This holds, for example, if .§ is a ]
1jEP[ X ] for p > 0 and X a positive random vari able. 5.8. (a) Let f be a convex real function on a convex set C m the plane Suppose that (X(w ), Y(£•) )) E C for all w and prove a two-dimensional Jensen's I nequal ity: ( 5.38 )

f( E[ X ] , E[ Y ] ) < E [ f( X, Y ) ] .

(b) Show that f is convex if it has continuous second derivatives that satisfy

fu > 0 ,

(5 .39)

5.9. i Holder's inequality is equivalent to E[ X 1 1Py l !q] < £11P[X ] · E11q[ Y ] ( p _ , + q - I = 1), where X and Y are nonnegative random variables. Derive this from (5.38). 5.10. i

Minkowski 's inequality is

(5.40 ) valid for p > 1. It is enough to prove that E[(X 1/P + y 11P )P] < (E 11P[X] + E l / P[Y ])P for nonnegative X and Y. Use (5.38).

5.11. For events A , A2 , , not necessarily independent, let Nn = 'L'k = iA k be the 1 1 among the first n. Let number to occur • • •

II

{ 5 .41 ) an = n L P ( A k ) , k=I 1

f3n =

2

n( n - 1 ) �j

E[X?}

E[Xi4 ] E[S:l L:E[XaX13 XyXIl], £[Xi]

E[X;4 ] E[X/X}J E[X?]E[X/1

,

( 6.2) where K does not depend on n. By Markov's inequality (5.31) for k = 4, -I the first Borei-Cantelli lemma,

P[IS"I > n ] < Kn - 2€ - 4, and so by P[ln S"l > i.o.] = 0, as required.

P[Xn = ]

E

•

E

P[Xn ] = Sn

The classical example is the strong law of large numbers for = 0 1 - p, m = p; represents Bernoulli trials. Here 1 = p, the number of successes in n trials, and n - • � p with probability 1. The idea of probability as frequency depends on the long-range stability of the success ratio fn . • Example 6.1.

Sn

Sn

86

PROBABILITY

Theorem 1.2 is the case of Example 6. 1 in which (f! , !F, P) is the unit interval and the Xn(w) are the digits dn(w) of the dyadic expansion of w. Here p = t . The set (1.21 ) of normal numbers in the unit interval has by (6. 1) Lebesgue measure 1; its complement has measure 0 (and • so in the terminology of Section 1 is negligible). Example 6.2.

The Weak Law

Since convere:-- nce with probability 1 implies convergence in probability (Theorem 5.2(ii)), it follows under the hypotheses of Theorem 6. 1 that n - lsn � P m. But this is of course an immediate consequence of Chebyshev's inequality (5.32) and the rule (5.28) for adding variances:

[I p n

- 1

sn - m I

]

>€
1 and X E [0, 1] for the moment. Let x, , . . . xn be independent ' random variables (on some probability space) such that P[ X1 = 1] = x and n P[ X 0] 1 - x; put S X1 + · · · + Xn. Since P [ S = k] = ( � )x k (l - x) - k , 0

1 =

=

=

the formula (5.19) for calculating expected values of functions of random variables gives E[f(Sjn)] = Bn(x). By the law of large numbers, there should be high probability that Sjn is near x and hence (f being continuous) that f(Sjn) is near f(x); E [ f( Sjn )] should therefore be near f(x). This is the probabilistic idea behind the proof and, indeed, behind the definition (6.3) itself. Bound lf(n - Is ) - f(x)l by o(€ ) on the set [ I n - Is - x i < E] and by 2 M on the complementary set, and use (5.22) as in the proof of Theorem 5.4. Since E[ S ] = nx, Chebyshev's inequality gives

\ Bn( x ) - f( x) \ < E [ i f( n - 1 S ) - f( x ) l ] < o ( E) P[In - Is - xi < E ) + 2MP[In - Is - xl > € 1 22 < o ( E) + 2 M Var[ S]/n E ; since Var[S] nx(l - x) < n, (6.4) follows. =

•

A Refinement of the Second Borei-Cantelli Lemma

For a sequence A 1 , A 2 , . . . of events, consider the number Nn = IA 1 + · · · + IA of occurrences among A , An. Since [ An i.o.] = [w: supn Nn( w ) oo], P [ A n i.o.] can be studied by means of the random variables Nn. n

=

P

• • •

88

PROBAOILITY

Suppose that the A, are independent. Put Pk = P(A k ) and m, = p1 + · · · +p,. From E[ IA ) = pk and Var[ IA J = pk(l - pk ) < pk follow E[N,] = m, and Vai [ N,] = L:Z = 1 Var[ IA ) < m,. If m, > x, then

( 6.5 )

P[ N, < x ] m, - x ] Var[ N, ] m, < < 2 ( m, - x ) ( m, - x ) 2 "

If L: p, = oo, so that Since

m, � oo, it follows that lim, P[ N, < x] = 0 for each x.

( 6.6) P[sup k Nk < x] = 0 and hence (take the union ov�r x = 1, 2, . . . ) P[supk Nk < oo] = 0. Thus P[ A, i.o.] = ?[sup, Ni = oo] = 1 if the A, are independent and 'Lp, = oo, which proves the second Borel-Cantel!i lemma once again. Independence was used in this argument only to estimate Var[ N,]. Even without independence, E[N,] = m, and the first two inequalities in (6.5) hold. Theorem

6.3.

If L:P(A,) diverges and

( 6.7)

lim inf "

I: P ( Aj n A k )

j, k :f n

( L P( Ak) )

2

< 1,

k 5, n

then P[ A, i.o.] = 1. As the proof will show, the ratio in (6.7) is at least 1; if (6.7) holds, the inequality must therefore be an equality. PROOF.

Let e, denote the ratio in (6.7). In the notation above, Var[ N, ] = E[ N,2 ]

- m� = L E [ IA/Ak ] - m� j,

k :f n

= L P ( Ai n A k ) - m� = ( e, j, k :f, n

- 1 )m�

(and e, - 1 > 0). Hence (see (6.5)) P[ N, < x ] < (e, - l)m�j(m, - x) 2 for x < m,. Since m�/(m, - x) 2 � 1, (6.7) implies that lim inf, P[ N, < x ] = O. It still follows by (6.6) that P[sup k Nk < x] = 0, and the rest of the argument is • as before.

SECTION 6.

89

THE LAW OF LARGE NUMBERS

If. as in the second Borei -Cantelli lemma, the A n are independent (or even if they are merely independent in pairs), the ratio in • (6.7) is 1 + L k n( Pk - p� )/m�. so that L: P( A n ) = oo implies (6.7). Example 6.4. s.

Return once again to the run lengths ln( ) of Section 4. It was shown in Example 4.21 that {rn} is an outer boundary (P[ ln > rn i.o.] = 0) if [2 -'" < oo. It was also shown that {rn } is1 an inner boundary ( P [ln > rn i.o.] = 1 ) if rn is nondecreasing and [2 -' r,;- = oo, but Theorem 6.3 can be used to prove this under the sole assumption that [2 _ ," = oo. As usual, the rn can be taken to be positive integer'>. Let A n = [l n > rn ] = [dn = · · · = dn + r" - l = O]. lf j + ri < k , then Ai and A k are independent. If P(A 1.IA k ) -< P[d1 = · · · = d k - 1 = . O I A k ] = P[ d1 = · · · = J. < k <J + r1' then . dk l = 0] = 1j2 k - 1, and so P( A j nA k ) < P(A k )/2 k -1 • Therefore, L: P ( Aj nA k ) j, k S.n w

Example 6.5.

"

j< k :5 n k <j + r;

j < k :>,n

0 with probability 1 Here the X, need not be ident ically distributed or even independent. ,

n

6.8. i Suppose that X 1 , X2, are independent and uniformly bounded and E[ X, ] = 0. Us ing only the preceding result, the first Borel - Cantelli lemma, and Chebyshev's inequality, prove that n - 1 S, -> 0 with probability 1. • • •

6.9. i Use the ideas of Problem 6.8 to give a new proof of Borel's normal number theorem, Theorem 1.2. The point i s to return to first principles and use only negligibility and the other ideas of Section l, not the apparatus of Sections 2 through 6; in particul ar, P(A ) is to be taken as defined only if A is a finite, disjoint union of intervals. 6.10. 5.1 1 6.17 i Suppose that (in the notation of (5.41)) {311 - a � = 0(1 In). Show that n - N, - a, -> 0 with probability 1. What condition on {3, - a � will imply a weak law? Note that independence i s not assumed here. 6.11. Suppose that X1 , X2, are m-dependent in the sense that random variables more than m apart in the sequence are independent. More precisely, let .wj k = u(Xj, . . . , Xk ), and assume that .wj� . . . , N},k ' are independent if ki - l + m < ji for i = 2, . . . , l. (Independent random variables are 0-dependent.) Sup pose that the X, have this property and are uniformly bounded and that E[ X,] = 0. Show tha t n - 1 s, -> 0. Hint: Consider the subsequences Xi , Xi+m + l • Xi + Z< m + 1 1, for 1 < i < m + 1. • • •

•,

6.12. i Suppose that the X, are independent and assume the values x 1 , . , x 1 with probabilities p(x 1 ), . . . , p(x1). For u 1 , . . . , u k a k-tuple of the x/s, let , •

SECTION 6.

TH E LAW OF LARGE NUMBERS

91

N,(u 1 , • • • , u k ) be the frequency of the k-tuple in the first n + k - I trials, that is, the number of t such that l < t < n and X, = u 1 , • • • , X, +k 1 u k . Show that with probabil ity 1 , all asymptotic relative frequencies are what they should be-that is, with probability l, n - 1 N,(u 1 , • • • , u k ) -> p(u 1 ) • • • p(u k ) fo r every k and every k-tuple u 1 , • • • , u k . _

6.13.

=

j

A number w in the unit interval is completely normal if, for every base b and every k and every k-tuple of base-b digits, the k-tuple appears in the base-b expansion of w with asymptotic relative frequency b - k . Show that the set of completely normal numbers has Lebesgue measure I .

6.14. Shannon ' s theorem. Suppose that XI> X2 , • • • are independent, identically dis tributed random variables taking on the values 1, . . . , r with positive probabili ties p 1 > . . . , p, If P n(i 1 , . . . , i , ) = p; . . pi' and p,(w) = p,(X1(w), . . Xn(w)), then p11(w) is the probability that a �ew se q ucnc.e of n tri::Jis would produce the particular sequence X1(w ), . . . , X,,(w ) of outcomes that happens actually to have been observed. Show that

,

1 - n log p,/ w )

-'>

h

r

= - iL= I P; log P;

with pro bability l. In info rmation theory 1, . . . , r are interpreted as the letters of a'l alphabet, X1 , X2 , . . . are the successive letters produced by an information �ource, ::�nd h is the entropy of the source. Prove the asymptotic equipartition property: Fo r large n there is probability exceeding 1 *' that the probability p11(w ) of the observed n-long sequence, or message, is in the range e - 11< 11 ± ' 1•

-

6.15. In the terminology of Example 6.5, show that log 2 n + log 2 log 2 n + (J log2 log 2 log 2 n is an outer or inner boundary as (J > 1 or (J < 1. Generalize. (Compare Problem 4. 12.) 6.16. 5.20 T Let g(m) = [P oP (m) be the number of distinct pl ime divisors of m. For Show that an = En[ g ] (see (5.46)) show that a 11 _, oo .

[ ( - _!ln n p j)(s q - _!ln n q J )] < __!_ np

E11 oP

( 6.8)

+

1 _ nq

for p '* q and hence that the variance of g under Pn satisfies Var, [ g ] < 3 L . p n ps

1

( 6.9)

Prove the Hardy-Ramanujan theorem:

[

lim n Pn m:

( 6.10 )

-

g(m) an - 1

zt:] = 0.

Since an log log n (see Problem 18.17), most integers under n have something like log log n distinct prime divisors. Since log log 107 is a little less than 3, the typical integer under 107 has about three prime factors-remarkably few.

92

PROBABILITY

are i nd epende nt a nd P[ Xn 0] = p. Let Ln be the 6.17. Suppose that X1 , X2 , length of the run of O' s starting at the nth place: Ln = k if Xn = · · · = Xn + k _ 1 = 0 -4= Xn +k Show that P[L n > rn i.o.] is 0 or 1 according as Ln P'" converges or · diverges. Example 6.5 covers the case p = t · • • •

SECTION

7.

=

GAMBLING SYSTEMS

Let Xp X2, be an independent sequence of random. variables (on some (f!, /Y, P)) taking on the two values + 1 and - 1 with probabilities P[ X, = + 1] = p and P[ Xn = - 1 ] = q = 1 - p. Th 1 0ughout the section, X" will be viewed as the gambler's gain on the nth of a series of plays at unit stakes. The game is favorable to the gambler if p > f, fair if p = I, and unfavorable if p < i. The case p < I will be called the subfair case. After the classical gambler ' s ruin problem has been solved, it will be shown that every g::1mbling system is in certain respects without effect and that some gambling systems are in other respects optimal. Gambling prob lems of the sort considered here have inspired many ideas in the mathemad cal theory of probability, ideas that carry far beyond their origin. Red-and-black will provide numerical examples. Of the 38 spaces on a roulette wheel, 18 are red, 18 are black, and 2 are greer.. In betting either on red or on black the cha nee of winning is �� . • • •

Gambler's Ruin

Suppose that the gambler enters the casino with capital a and adopts the strategy of continuing to bet at unit stakes until his fortune increases to c or his funds are exhausted. What is the probability of ruin, the probability that he will lose his capital, a? What is the probability he will achieve his goal, c? Here a and c are integers. Let ( 7.1 )

Sn = X I + · · · + Xn '

S0 = 0.

The gambler's fortune aft�r n plays is a + Sn . The event (7 .2)

n-1 A a,n = [a + Sn = c] n n [O < a + Sk < c ]

k= l

represe nts success for the gambler at time n, and ( 7.3)

n- 1 Ba ,n = [a + S, = O] n n [ O < a + Sk < c ]

k= l

SECTION

7. GAMBLING SYSTEMS

93

represents ruin at time n. If s/a) denotes the probability of ultimate success, the n (7.4)

for 0 < a < c. Fix c and let a vary. For n > 1 and 0 < a < c, define A a n by (7.2), and adopt the conventions Aa ,o = 0 for 0 < a < c and Ac, o = f! (success is impossibl e at time 0 if a < c and certain if a = c), as well as Ao, n = A c, n = 0 for n > 1 (play never starts if a is 0 or c). By these conventions, s/0) = 0 and '

sc(c) = 1.

.

Because of independence and the fact that the sequence X2 , X3, is a probabilistic replica of X1 , X2 , . . . , it seems clear that the chance of success for a gambler with initial fortune a must be the chance of winning the first wager times the chance of success for an initial fortune a + 1, plus the cha 11ce of losing the first wager times the chance of success for an initial fortune a - 1. It thus seems intuitively clear that (7.5)

.

sc ( a ) = psc ( a + 1 ) + qs c ( a - 1 ) ,

.

•

0 < a < c.

For a rigorous argument, define A'a n just as A a n but with S� = X2 + · · · +Xn + l in place of Sn in (7.2). Now P[ X; = x;, i < n] = P[X; + l = x;, i < n] for each sequence x 1 , . . . , xn of + l's and -n 1 's, and therefore P[(X1 , . . . , Xn) E H] = P[(X2n, . . . , Xn + l) E H] for H c R . Take H to be the set of x = (x" . . . , xn ) in R satisfying X; = + 1, a + x 1 + . . . +xn = c, and 0 < a + x 1 + · · · +x k < c for k < n. It follows then that '

'

(7 .6)

Moreover, Aa, n = ([X1 = + 1] nA'a + l , n - I ) U ([ X1 = - 1] n A'a - I , n - I ) for n > 1 and 0 < a < c. By independence and ( 7.6 ), P(Aa , n) = pP(A a + l ,n - 1) + qP(A a - ! , n - l ); adding over n now gives (7.5). Note that this argument involves the entire infinite sequence X1, X2 , It remains to solve the difference equation (7.5) with the side conditions sc(O) = 0, s/c) = 1. Let p = qjp be the odds against the gambler. Then [A19] there exist constants A and B such that, for 0 < a < c, s/a) = A + Bp0 if p =I= q and sc(a) = A + Ba if p = q. The requirements sc(O) = 0 and s c(c) = 1 determine A and B, which gives the solution: •

•

•

•

The probability that the gambler can before ruin attain his goal of c from an initial capital of a is p a - 1 0 < a < c, if p = g__ =I= 1 C- 1 ' p ' P sc ( a) = a ( 7 .7) 0 < a < c, if p = g__ c' p = 1.

94

PROBABILITY

The gambler ' s initial capital is $900 and his goal is $1000. If p = -i, his chance of success is very good: s 1000(900) = .9. At red-and-black, p = �� and hence p ��; in this case his chance of success as computed by (7.7) is only about .00003. • Example 7.1.

=

·

It is the gambler's desperate intention to convert his $100 into $20,000. For a game in which p = -i (no casino has one), his chance of success is 100/20,000 = .005; at red-and-black it is minute-about 3 X 10 - 9 1 1 Example 7.2.

•

In the analysis leading to (7.7), replace (7.2) by (7.3). It follows that (7. 7) with p and q interchanged (p goes to p - I ) and a and c - a interchanged gives the probability r/a) of ruin for the gambler: r/a) (p - < c - a ) - l)j (p - c - 1) if p =fo 1 and rc(a) = (c - a)jc if p = 1. Hence s/a) rc(a) 1 holds in all cases: The probability is 0 that play continues forever . For positive integers a and b, let =

-1-

=

be the event that Sn reaches +b before reaching - a. Its probability is simply (7.7) with c = a + b: P(Ha, b) = sa + b (a). Now let 00

[

00

Sn > b Hb = U Ha, b = U [ Sn = b] = sup n n=l a= l

]

be the event that Sn ever reaches + b. Since Ha h i Hb as a � oo, it follows that P( Hb) = l im a sa + b (a); this is 1 if p = 1 or p '< 1, and it is ljp b if p > 1 . Thus

( 7 .8)

[

]

(

if p > q, 1 b P sup Sn > b = n ( pjq) if p < q.

This is the probability that a gambler with unlimited capital can ultimately gain b units. The gambler in Example 7.1 has capital 900 and the goal of in Example 7.2 he has capital 100 and b is 19,900. Suppose, instead, that his capital is infinite. If p = i, the chance of achieving his goa l to 1 in the second. increases from .9 to 1 in the first example and from .005 100 1 9900 remain .9 9 however, . and red-and-black, the probabilities two At • essentially what they were before (.00003 and 3 x 10 - 9 1 1 ). Example 7.3. winning b = 100;

SE::CTION

7.

95

GAMBLING SYSTEMS

Selection Systems

Players often try to improve their luck by betting only when in the preceding trial s the wins and losses form an auspicious pattern. Perhaps the gambler bet s on the nth trial only when among X1, • • • , Xn _ 1 there are many more + l's than - l's, the idea being to ride winning streaks (he is " in the vein"). Or he may bet only when there are many more 1 's than + l's, the idea being it is then surely time a + 1 came along (the "maturity of the chances"). There is a mathematical theorem that, translated into gaming language, says all such systems are futile.

-

It might be argued that it is sensible to bet if among X 1, • • • , X, _ 1 there is an excess of + l's, on the ground that it is evidence of a high value of p. But it is assumed throughout that 3tatistical inference i3 nol at issue: p is fix ed a t ��, for example, in the case of red-and-black -and is known to the gambler, or should be. -

The gambler's strategy is described by random variables B1, B 2 , • • • taking the two values 0 and 1: If Bn = 1, the gambler places a bet on the nth trial; if Bn = 0, he skips that trial. If Bn were (Xn + 0/2, so that Bn = 1 for Xn = + 1 and Bn = 0 for Xn = - 1, the gambler would win every time he bet, but of course such a system requires he be prescient-he must know the outcome only on Xn in advance. For this reason the value of B, is assumed to depend 1 the values of X1 , • • • , Xn _ 1: there exists some function bn : Rn - -+ R 1 such that

( 7.9) (Here B1 is constant.) Thus the mathematics avoids, as it musr, the question of whether presciem:e is actually possible. Define

( 7 .10 )

f � � ( XI , . . . , Xn ) , CT

\ .ro - {0,0}.

n = 1 , 2, . .

. ,

The CT-field � - I generated by XI ' . . . ' xn - I corresponds to a knowledge of the outcomes of the first n 1 trials. The requirement (7.9) ensures that Bn is measurable � _ 1 (Theorem 5.1) and so depends only on the information actually available to the gambler just before the nth trial. For n = 1, 2, . . . , let Nn be the time at which the gambler places his nth bet. This nth bet is placed at time k or earlier if and only if the number L:t 1 Bi of bets placed up to and including time k is n or more; in fact, Nn is the smallest k for which L:�= 1 Bi = n. Thus the event [ Nn < k ] coin cides with [L:�= , Bi > n]; by (7.9) this latter event lies in CT( B1, • • • , Bk) c CT(X1 , • • • , Xk _ 1 ) = .9k _ 1• Therefore,

-

( 7.1 1)

96

PROB. 0, (7. 14) implies that E[Fnl < E[ Fn _ 1 ]. Therefore, ,

1

=

( 7. 1 6) in the subfair case, and

( 7 .17) in the fair case. (If p < q and P[ Wn > 0] > 0, there is strict inequality in (7.16).) Thus no betting system can convert a subfair game into a profitable enterprise. Suppose that in addition to a betting system, the gambler adopts some policy for quitting. Perhaps he stops when his fortune reaches a set target, or his funds are exhausted, or the auguries are in some way dissuasive. The decision to stop must depend only on the initial fortune and the history of play up to the present. Let r(F0, w) be a nonnegative integer for each w in n and each F0 > 0. If n, the gambler plays on the nth trial (betting Wn ) and then stops; if r = 0, he does not begin gambling in the first place. The event [ w : r( F0, w) = n] represents the decision to stop just after the nth trial, and so, whatever value F0 may have, it must depend only on Xp . . . , Xn. Therefore, assume that

-r =

(7. 18)

[ w : r ( F0 , w) = n ]

E

A r satisfying this requirement is a

�'

n 0, 1 , 2, . . . =

stopping time. On general it has infinite

range and hence is not a simple random variable; as expected values of r play no role here, this does not rna tter.) It is technically necessary to let r(F0, w ) be undefined or infinite on an w-set of probability 0. This has no effect on the requirement (7. 18), which must hold for each finite n. But it is assumed that r is finite with probability 1: play is certain to terminate. A betting system together with a stopping time is a gambling policy. Let 7T denote such a policy. Suppose that the betting system is given by Wn = Bn , with Bn as in Example 7.4. Suppose that the stopping rule is to quit after the first Example 7.5.

100

PROBABILITY

loss of a wager. TheiF [ r = n] = U Z� 1 [ Nk = n, Y1 = · · · = Yk - l = + 1, Yk = - 1 ]. For j < k < n, [ Nk = n, }j = x ] = U �� 1 [ Nk = n, Ni = m, Xm = x ] lies in 9,; by (7. 1 1); hence r is a stopping time. The values of r are shown in the • rightmost column of the table . The sequence of fortunes is governed by (7. 14) until play termina tes, and then the fortune remains for all future time fixed at FT (with value FT< F iw )). u Therefore, the gambler's fortune at time n is .•

if r > n , if r < n .

( 7 . 19)

Note that the case r = n is covered by both clauses here If n - 1 < n < r, then Fn* = Fn = Fn _ 1 + Wn Xn = Fn*_ 1 + U� Xn; if r < n - 1 < n, the n Fn* = r� = Fn*- 1 • Therefore, if W,.* = /[ n � n J W,. , the11 ( 7 . 20 ) But this is the equation for a new betting system i n which the wager placed at time n is W�*. If r > n (play has not already terminated), Wn* is the old amount Wn ; if r < n (play has terminated), W/ is 0. Now by (7. 18), [ r > n ] = [r < nY lies i n � - I · Thus /I n n] is measurable 9,;_ 1 , so that Wn* as well as Wn is measurable �_ 1 , and {Wn*} represents a legitimate betting system. Therefore, (7. 16) and (7.17) apply to the new system; ( 7.2 1 ) if

p < !-, an d

F0 = F*0 >- E[ F*I ] >- . . · >- E[ F* n ]> -

( 7 . 22) I' f

p = 2I ·

The gambler's ultimate fortune is FT . Now limn Fn* = FT with probability 1 , si nce in fact Fn* = F7 for n > r. If ( 7 .23 )

lim £[ Fn* ] = E[ F7 ] ,

n

then (7.21) and (7.22), respectively, imply that E[FT ] < F0 and £[ F7] = F0. According to Theorem 5.4, (7.23) does hold if the Fn* are uniformly bounded. Call the policy bounded by M ( M nonrandom) if ( 7 .24)

n = 0, 1 , 2, . . . .

If Fn* is not bounded above, the gambler's adversary must have infinite capital. A negative Fn* represents a debt, and if Fn* is not bounded below,

S ECTION 7.

101

GAM BLING SYSTEMS

the gambler must have a patron of infinite wealth and generosity from whom to borrow and so must in effect have infinite capital. In case Fn* is bounded below, 0 is the convenient lower bound-the gambler is assumed to have in hand all the capital to which he has access. In any real case, (7.24) holds and (7.23 ) follows. (There is a technical poi nt that arises because the general theory of integration has been postponed: FT must be assumed to have finite range so that it will be a simple random variable and hence have an expected value in the sense of Section 5.t) The argument has led to this result:

7. 2 . For every policy, (7.2 1) holds if p < i and (7.22) holds if p = i- If the policy is bounded (and F has finite range), then E[F7] < f0 for p < i Theorem

and E[FT ] = F0 for p = t .

T

The gambler has initial capital a and plays at unit stakes until his capital increases to c (0 < a < c) or he is ruined. Here F0 = a and wn 1, and so Fn = a + S n . The policy is bounded by c, and FT is c or 0 according as the gambler succeeds or fails. If p = t and if s is the probability of success, then a = F0 = E[F, ] = sc. Thus s = ajc. This gives a new deriva tion of (7.7) for the case p = �. The argument assumes however that play is certain to terminate. If p < 4, Theorem 7.2 only gives s < ajc, which is weaker than (7.7). • Example 7. 6. =

Suppose as before that F0 = a and Wn = 1, so that Fn = a + Sn , but suppose the stopping rule is to quit as soon as Fn reaches a + b. Here Fn* is bounded above by a + b but is not bounded below. If p = t, the gambler is by (7.8) certain to achieve his goal, so that F7 = a + b. In this case F0 = a < a + b = E[fJ This illustrates the effect of infinite capital. It also illustrates the need for uniform boundedness in Theorem 5.4 (compare Example 5 .7) . • Example 7. 7.

For some other systems {gamblers call them "martingales"), see the problems. For most such systems there is a large chance of a small gain and a small cha nce of a large loss.

Bold Play*

The formula (7.7) gives the chance that a gambler betting unit stakes can increase his fortune from a to c before being ruined. Suppose that a and c happen to be even and that at each trial the wager is two units instead of T See Problem 7. 1 1 . • This topic may be omitted.

102

PROBABI LITY

one. Since this has the effect of halving

-

a and c, the chance of success is now

pa - l pc / 2 + pc pa / 2 +

1 1'

1

- = p =l= l .

q p

If p > 1 ( p < j), the second factor on the right exceeds 1: Doubling the stakes increases the probability of success in the unfavorable case p > 1. In the case p = 1, the probability remains the same. There is a sense in which large stakes are optimal. It will be conve nient to rescale so that the initial fortune satisfies 0 < F0 < 1 and the goal is 1. Tile policy of bold play is this: At each stage the gambler bets his entire fortune, unless a win would carry him past his goal of 1, i n which case he bets just enough that a win would exactly achieve that goal:

(1

Fn - 1 Wn = - Fn -· 1

(7 . 25)

if 0 < Fn - I < · L 1 f j < Fn - l < 1 .

{It is convenient to allow even irrational fortunes.) As for stopping, the policy is to quit as soon as Fn reaches 0 or 1. Suppose that play has not terminated by time k - 1; under the policy (7.25), if play is not to terminate at time k, then Xk must be + 1 or - 1 according as Fk - 1 < j or Fk _ 1 > j, and the conditional probability of this is at most m -= max{ p, q}. It follows by induction that the probability that bold n play continues beyond time n is at most m , and so play is certain to terminate ( r is finite with probability 1). It will be shown that in the subfair case, bold play maximizes the probabil ity of successfully reaching the goal of 1. This is the Dubins -Savage theorem . It will further be shown that there are other policies that are also optimal in this sense, and this maximum probability will be calculated. Bold play can be substantially better than betting at constant stakes. This contrasts with Theorems 7.1 and 7.2 concerning respects in which gambling systems are worthless. From now on, consider only policies 7T that are bounded by 1 (see (7.24)). Suppose further that play stops as soon as Fn reaches 0 or 1 and that this is certain eventually to happen. Since F7 assumes the values 0 and 1, and since [ F'T = x 1 = U � 0[ r = n 1 n [ Fn = x 1 for x = 0 and x = 1, F7 is a simple random variable. Bold play is one such policy 7r. The policy 7T leads to success if F7 = 1. Let Q"'(x) be the probability of this for an initial fortune F0 = x: =

( 7 .26)

Qrr( x) = P [ F'T = 1 ]

,

for F0 = x .

SECTIO N

7. GAMBLI NG SYSTEMS

103

Since Fn is a function ifrn(F0, X1(w), . . . , Xn(w)) 'l'n(F0, w), (7.26) in ex p anded notation is Q,.( x ) P[w: 'I'T ( x . w >(x, w) 1]. As 7T specifies that play stops at the boundaries 0 and 1, =

=

=

Qrr(O) = 0, Q,.( 1) 1 , O < Qrr( x ) < 1 , O < x < l. =

( 7. 27 )

Let Q be the Qrr for bold play. (The notation does not show the dependence of Q and Qrr on p, which is fixed.) Theorem 7.3.

PROOF.

(7.28)

In the subfair case, Qrr(x) < Q( x ) for all 7T and all x.

Under the assumption

p < q , it will be shown later that

Q( x ) > pQ ( x + t) + qQ( x -

t),

0 <X t <X < X + t < 1. --

This ca n be interpreted as saying that the cha nce of success under bold play starting at x is at lt:>ast as great as the chance of �uccess if the amount t is wagered and bold play then pursued from x + t in case of a win and from x - t in case of a loss. Under the assumption of (7.28), optimality can be proved as follows. Consider a policy 7T, and let Fn and Fn* be the simple random variables defined by (7. 14) and (7. 19) for this policy. Now Q(x) is a real function, and so Q(Fn* ) is also a simple random variable; it ca n be interpreted as the conditional chance of success if 7T is replaced by bold play after time n. By (7.20 ) Fn* = X + tXn if Fn*- • = X and Wn* t. Therefore, =

,

Q( Fn* )

=

L f[F,i_ 1 � x . W"* � r ] Q { X + tXn),

X,

t

where x and t vary over the (finite) ra nges of Fn*- • and Wn* , respectively. For each x and t, the indicator above is measurable .9,;" _ 1 and Q(x + tXn) is measurable a(Xn); since the Xn are independent, (5.25) and (5. 17) give ( 7 .29)

By (7.28), of 7T that

X,

t

E[Q(x + tXn)] < Q(x ) if 0 < x - t < x < x + t < 1 . As it is assumed Fn* lies in [0, 1] (that is, Wn* < min{ Fn* 1 , 1 - Fn*- 1 }), the probability -

104

PROBABI LITY

in (7.29) is

0 unless x and t satisfy this constraint. Therefore, X, I X

=

=

=

This is true for each n, and so E[Q(Fn* )] < E[Q(Ft )] Q(F0). Since Q(F,n Q(F) for n > r, Theorem 5.4 implies that E[ Q(Fr)] < Q(F0). Since x 1 implies that Q(x) 1, P[ Fr 1] < E[ Q(F)] < Q(F0). Thus Q rr(F0) < Q(F0) for the policy 7T, whatever F0 may be. It remains to analyze Q and prove (7.28). Everything hinges on the functionai equation

=

( 7 .30 )

=

( pQ 2x) Q x) = (

=

(

,

O < x < },

p + qQ ( 2 x - 1 ) , � < x < l .

=

=

For x = 0 and x 1 this is obvious because Q(O) 0 and Q{l) 1. The idea is this: Suppose that the initial fortune is x. If x < ·L the first stake under bold play is x; if the gambler is to succeed in reaching 1, he must win the first trial (probability p) and then from his new fortune x + x 2 x go on to succeed (probabili ty Q(2x )); this makes the first half of (7.30) plausible. If x > � ' the first stake is 1 - x; the gambler caD succeed either by winn ing the first trial (probability p) or by losing the first trial (probability q) and then going on from his new fortune x - ( 1 - x) = 2 x - 1 to succeed (probabili ty Q(2 x - 1)); this makes the second half of (7.30) plausible. It is also intuitively clear that Q(x) must be an increasing function of x (0 < x < 1): the more money the gambler starts with, the better off he is. Finally, it is intuitively clear that Q(x) ought to be a contin uous function of the initial fortune x. =

=

=

A fm mal proof of (7.30) can be constructed as for the difference equation (7.5). If {3(x) is X for X < t and 1 - X for X > i, then under bold play Wn {3(Fn - l ). Starting from f0( x ) x , recursively define

If n

F0

=

=

x, then Tn(x) [ gn(x; X1 ,

=

, Xn) 1] is the event that bold play will by time successfully increase the gambler's fortune to 1. From the recursive definition i t • • •

SECTI ON 7.

105

GAMBLING SYSTEMS

follows by induction on n that for n c. 1, f,(x; x 1 , , x,) =f, _ 1(x + f3(x)x 1 ; x 2 , , x,) and hence that g,(x; x 1 , , x,) = max{x, g, _ lx + {3(x)x 1 ; x 2 , , x,)}. Since x 1 implies g, _ 1(x + (3(x)x 1 ; x2, . . . , x,) c.x + {3(x )x 1 = 1, T,(x) = [ g, _ 1(x + f3(x)X1 ; X2 , , X,) = 1], and since the X; are independent and identically distribu ted, P(T,( x )) = P([ X 1 = + 1 ] n T,(x)) + P([ X 1 = - 1 ] n T,( x )) = pP[ g, _ 1(x + {3(x); X2 , , X,) = 1] + qP[ g, _ 1( x - (3(x); X2 , , X,) = pP(T, _ 1(x + (3(x ))) + qP(T, _ 1(x - {3( x))). Letting n --> oo now gives Q(x) = pQ(x + {3(x)) +qQ(x - {3(x )), which reduces to (7.30) because Q(O) 0 and Q(l ) = 1 . Suppose that y = f, _ 1(x; x 1 , , x, _ 1 ) is nondecreasing in x. If x, = + 1, then f,(x; x i > . , x, ) is 2y if 0 < y < � and 1 if i < v < 1; if x, = - 1, then f,(x; x 1 , . . . , x, ) is 0 if 0 < y < � and 2y - 1 if i < y < 1. In any case, f,(x; x 1 , . . . , x,) is also nondecreasing in x, and by induction this is true for every n. It follows that the same is true of g,(x; x 1 , . . . , x, ), of P(T,,(x )), and of Q(x). Thus Q(x ) is nondecreasing. 2 Since Q(l ) = 1, (7.30) implies that QC! > = pQ(l) = p, Q( �) = pQ( i) = p , Q(�) = p + qQ(i ) p + pq More generally, if Po = p and p 1 = q, then • • •

• • •

• • •

• • •

==

•

•

•

• • •

• • •

=

=

the sum extending over n-tuples (u 1 , . . . , u,) of O's and 1 ·� satisfying the condition indicated. Indeed, it is easy to see that (7.31) is the same thing as (7.32) n for each dyadic rational .u 1 u, of rank n. If .u 1 u11 + 2 - < � ' then u 1 = 0 and by (7.30) the difference in (7.32) is p0[Q(.u 2 . . . u, + 2 - " + 1 ) - Q(.u 2 u,)]. But (7.32) follows inductively from this and a similar relation for the case .u 1 u, > i· Therefore Q(k 2 - ") - Q((k - 1)2 -") is bounded by max{p", q"}, and so by mono tonicity Q is continuous. Since (7.32) is positive, it follows that Q is strictly increasing over [0, 1]. . . •

• • •

• • •

• • •

Thus Q is continuous and increasing and satisfies (7.30). The inequality (7.28) is still to be proved. It is equivalent to the assertion that

d( r, s ) = Q( a) -pQ( s ) - qQ( r ) > 0 if 0 < r < s < 1 , where a stands for the average: a = !(r + s ). Si nce Q is conti nuous, it suffices to prove the inequality for r and s of the form k/2 " , and this will be done by i nduction on n. Checking all cases disposes of n 0. Assume that the inequality holds for a particular n , and that r and s have the form k12 " + 1 • There are four cases to consider. =

CAs E 1. s < 1 · By the first part of (7.30), Mr, s) = pd (2 r, 2s). Since 2 r and 2s have the form k/2 " , the induction hypothesis implies that M2 r, 2s) > 0.

106

PROBABILITY

CASE 2. ! < r. By the second part of (7.30),

d( r, s ) = qd( 2r - 1 , 2s - 1 ) > 0. CASE

3. r < a < 4 < s. By (7.30), d( r, s ) = pQ( 2a ) - p[ p + qQ ( 2s - 1 ) ] - q [ pQ ( 2 r )] .

From 4

< s < r + s = 2a < 1, follows Q(2 a) = p + qQ(4 a - 1); and from 0 < 2a - t < t , follows Q(2 a - 4) = pQ(4 a - 1). Therefore, pQ(2 a) = p 2 + qQ(2a - 4 ), a nd it follows that d( r, s ) = q [ Q ( 2a - 4 ) - pQ( 2s - 1 ) - pQ ( 2r) ] . Si nce p < q , the right side does not increase if either of the two p's is changed to q . Hence d ( r, s )

>

q max [ d ( 2 r, 2 s - 1 ) , d ( 2 s - 1 , 2 r) ] .

The induction hypothesis applies to 2r < 2s - 1 or to 2s - 1 may be, so one of the two d's on the right is non negative. CASE 4. r

0.

If 2 r < 2s - 1, the right side is

P [ ( q - p)( 1 - Q ( 2s - 1 )) + d( 2 r , 2s - 1)]

>

0.

This completes the proof of (7.28) and hence of Theorem 7.3.

•

SECTION 7.

107

GAMBLI NG SYSTEMS

The equation (7.3 1) has an interesting i nterpretation. Let Z 1, Z 2 , be and P[Zn = 1] = indepen dent random variables satisfying P[Z11 = 0] = p0 p P[Z"n = l i.o.] = 1 and L.in > nZ;2 - i < 2 - n it follows that p1 = q . From P[L.�= 1 Z; 2- i < k2 - ] < P[L.7= 1 Z; 2- in < k2- ] < P[L.�= 1 Z; 2 - i < k2- n ]. Since by (7. 3 1 ) the middle term is Q(k2- ). •

•

.

=

( 7 .33 )

holds for dyadic rational x and hence by continuity holds for all x. In Section 3 1 Q will reappear as a continuous, strictly increasing function singular in the sense of Lebesgue. On p. 408 is a graph for the case p0 = .25. Note that Q(x) = x in the fair case p = � . In fact, for a bounded policy Theorem 7.2 implies that E[ FT ] = F0 in the fair case, and if the policy is to stop as soon as the fortune reaches 0 or 1, the n the chance of successfully reaching 1 is P[ F7 1] E[ F7] F0• Thus i n the fair case with initial fortune x, the chance of success is x for every policy that stops at the boundaries, and x is an upper bound even if stopping earlier is allowed. ,

=

=

=

The gambler of Example 7. 1 has capital $900 and goal $1000. For a fair game (p = �) his chance of success is .9 whether he bets unit stakes or adopts bold play. At red-and-black (p = �� ), his chance of success with unit stakes is .00003; an approximate calculation based on (7.3 1) shows that under bold play his chance Q{.9) cf success increases to abou t .88, which compares we!! with the fair case. • Example 7.8.

In Example 7.2 the capital is $ 100 and the goal $20,000. At unit stakes the chance of successes is .005 for p =- i and 3 x 1 0 - 9 1 1 foe p = �� . Another approximate calculation shows that bold play at red-and-black gives the gambler probability about .003 of success, which again compares well with the fair case. This example illustrates the point of Theorem 7.3. The gambler enters the casino knowi ng that he must by dawn convert his $ 100 into $20,000 or face certain death at the hands of criminals to whom he owes tha t amount. Only red-a nd-black is available to him. The question is not whether to gamble-he must gamble. The question is how to gamble so as to maximize the chance of survival, and bold play is the answer. • Example 7.9.

There are policies other tha n the bold one that achieve the maximum success probability Q(x). Suppose that as long as the gambler's fortune x is less than t he bets x for x < * and � x for * < x < � - This is, in effect, the -

108

PROBABILITY

bold-play strategy scaled down to the interval [0 , H a nd so the chance he ever reaches t is Q(2x ) for a n initial fortune of x. Suppose further that if he does reach the goal of t, or if he starts with fortune at least t in the first place, then he continues, but with ordina ry bold play. For an initial fortune x > �, the overall cha nce of success is of course Q(x), and for an initial fortune x < t, it is Q (2x)Q(t ) = pQ(2x) = Q(x). The success probability is indeed Q(x) as for bold play, although the policy is different. With this example in mind, one can generate a whole series of distinct optimal policies . "

Timid Play •

The optimality of bold play seems reasonable when one considers the effect Of its oppcsite, timid play. Let the E-timiJ policy be tO bet Wn = min k Fn _ 1 , 1 - F,, _ 1 } and stop when Fn reaches 0 or 1 . Suppose that p < q, fix an initia l fortune x = F0 with 0 < x < 1, a nd consider what happens as 1 E 0. By the strong law of iarge numbers, limn n - S = E[ X1 ] p - q < 0. There is therefore probability 1 that sup k Sk < oo a nd lim n Sn = - oo. Given 11 > 0, choose E so that P[ �u p k(x + ES k ) < 1 ] > 1 - 11 · Si nce P( U �= 1 [ x + ESn < OJ) = 1, with probability at least 1 11 there exists an n such that x + ESn < 0 and max k < n(x + ES k ) < 1. But un der the E-timid policy the gambler is in this circumstance ruined. If Q,(x) is the probability of success under the E-timid policy, then lim, � 0 Q,(x) = 0 for 0 < x < 1 . The law of large number s carries the timid player to his ruin. t --+

n

=

-

PROBLEMS

7.1.

A

gambler with initial capital a plays until his fortune increases b units or he is ruined. Suppose that p > 1. The chance of success is multiplied by 1 + 8 if his initial capital is infi!lite instead of a. Show that 0 < 0 < (p a - 1) - 1 < (a(p - 1)) - 1 ; relate to Example 7.3.

7.2 As shown on p. 94, there is probability 1 that the gambler either achieves h is goal of c or is ruined. For p -=!= q, deduce this directly from the strong law of large numbers. Deduce it (for all p) via the Borel -Cantelli lemma from the fact that if play never terminates, there can never occur c successive + 1 's. 7.3.

6 12 i

If V,, is the set of n-long sequences of + 1 's, the function b, in (7.9) maps V,, _ 1 into {0, 1}. A selection system is a sequence of such maps. Although there are uncountably many selection systems, how many have an effective

* This top ic may be omitted t For each E , however, there exist optimal policies under which the bet never exceeds OUBINS

& SAVAGE.

E;

see

SECTION 7.

109

GAMBLING SYSTEMS

description in the sense of an algorith m or finite set of instructions by means of which a deputy (perhaps a machine) could operate the system for the gambler? An analysis of the question is a matter for mathematical logic, but one can see that there can be only countably many algorithms or finite sets of rules expressed in finite alphabets. be the random variables of Theorem 7.1 for a particular Let Yiul, YY1, system u, and let C be the w-set where every k-tuple of + l 's (k arbitrary) occurs in Yiu1(w), y[u>(w), . . . with the right asymptotic relative frequency (in the sense of Problem 6. 12). Let C be the intersection of Cu over all effective selection systems u. Show that C lies in !f (the u-field in the probability space 5', P) on which the Xn are defined) and that P(C ) = 1. A sequence (X1(w ), Xz(w), . . . ) for w b C is called a coliectiue: a subsequence chosen by any of the effective rules u contains all k-tuples in the correct proportions • • •

(D.,

7.4. Let Dn be 1 or 0 according as X2n _ 1 -=/= X2n or not, and let Mk be the time at the k th 1-the smallest n such that "L7= 1 D; = k. Let Zk X2 M In otner words, look at successive nonoverlapping pairs (X2n _ 1, X2n), discarJ accordant (X2n _ 1 X2) pairs, and keep the second element of discordant ( X zn - l -=1= X2n ) are independent pairs. Show that this process simulates a fair coin: Z 1 , Z2, and identically distributed and P[Zk = + 1] P[Zk - 1] �' whatever p may be. Follow the proof of Theorem 7. 1. =

•

=

• • •

=

=

=

7.5. Suppose that a gambler with initial fo rtune 1 stakes a proportion 0 (0 < 0 < 1) of his current fortune: F0 1 and Wn OFn - l · Show that Fn IIZ= 1(1 + OXk ) and hence that =

Show that

Fn

-->

=

=

0 with probability 1 in the subfair case.

7.6. In "doubl ing," W1 1, Wn 2Wn _ 1, and the rule is to stop after the first win. For any positive p, play is certain to terminate. Here Fr Fu + 1, but of course k Wn cannot exceed Fn _ 1 , the infi n ite capital is required. If F0 2 - 1 and -k probability of Fr F0 + 1 in the fair case is 1 - 2 . Prove this via Theorem 7.2 and also directly. =

=

=

=

=

7.7. In "progress and pinch," the wager, initially some integer, is increased by 1 after a loss and decreased by 1 after a win, the stopping rule being to quit if the next bet is 0. Show that play is certain to terminate if and only if p > �· Show that Fr F0 + 1-W? + �( T - 1). Infinite capital is required. =

7.8. Here is a common martingale. Just before the nth spin of the wheel, the gambler has before him a pattern x 1 , , xk of positive numbers (k varies with n). He bets x 1 + xk , or x 1 in case k 1. If he loses, at the next stage he uses the pattern x 1 , , xk, x 1 + xk (x" x 1 in case k 1). If he wins, at the next stage he uses the pattern x 2, , x k _ 1 , unless k is 1 or 2, in which case he quits. Show • • •

=

=

• • •

•

•

•

110

PROBABILITY

that play is certain to terminate if p > t and that the ultimate gain is the sum of the numbers in the initial pattern. Infinite capital is again required.

7.9. Suppose that wk 1, so that Fk Fo + sk. Suppose that p > q and T is a stopping time such that 1 < T < n with probability 1. Show that E[F7] < E[Fnl, with equality in case p q. Interpret this result in terms of a stock option that must be exercised by time n, where F0 + Sk represents the price of the stock at time k. =

=

=

7.10. For a given policy, let A;, be the fortune of the gambler's adversary at time n . Consider these conditions on the policy. (i) Wn* < Fn* 1 ; (ii) Wn* < A� _ 1 ; (iii) Fn* + A ;, is constant. Interpret each condition, and show that together they imply that the policy is bounded in the sense of (7.24). 7.11. Show that FT has infinite range if Fo which xn = + 1.

=

n 1, wn = 2 - , and

T

is the smallest

n

for

7.12. Let u be a real function on [0, 1], u(x ) representing the utility of the fortune x. Consider policies bounded by 1; see (7.24). Let Q'"(F0) E[u(F7)]; th is repre sents the expected utility under the policy r. of an initial fortune F0. Suppose of a policy '7To that =

u ( x ) < Q'""( x ) ,

( 7.3 4 )

O <x < 1,

and that

(7.35 )

Q_ (x) > pQ_ ( x + t ) + qQ'" ( x - t ) , ·· u

·· o

(t

Show that Q'"(x) < Q'" (x) for all x and all policies '7T. Such a 7T o is optimal. Theorem 7.3 is the special case of this result for p < 4, bol d play in the role of '7T 0 , and u(x) = 1 or u(x) = 0 according as x = 1 or x < 1. The condition (7.34) says that gambling with pol icy 7T o is at least as good as not gambling at all; (7.35) says that, al though the prospects even under '7T 0 become on the average less sanguin� as time passes, it is better to use 7To now than to use some other policy for one step and then change to 7T 0 .

7.13. The functional equation (7.30) and the assumption that Q is bounded suffice to determine Q completely. First, Q(O} and Q(l) must be 0 and 1, respectively, and so (7.31) holds. Let T0x = �-x and T1 x = �-x + i; let f0x = px and f1 x =p + qx. Then Q(Tu , · · · Tu x) = fu · · · fu Q(x ). If the binary expansions of x and y both begin with the, digits' u 1 , un, they have the form x = Tu . . . Tu x' and y = Tu , · · · Tu,Y'· Ifn K bounds Q and if m = max{p, q}, it follows that I Q(x) - Q(y )I < Km . Therefo re, Q is continuous and satisfies (7.31) and (7.33 ). •

•

",

•

I

n

SECTION

8.

SECTION

lll

MARKOV CHAINS

8. MARKOV CHAINS

As Markov chains illustrate in a clear and striking w ay the connection between probability and measure, their basic properties are developed here in a measure-theoretic sett"ing. Definitions

Let S be a finite or countable set. Suppose that to each pair i and j in S there is assigned a nonnegative number P;j and that these numbers satisfy the constraint

L P;, = 1 ,

( 8. 1 )

jeS

i E S.

Let X0, X1 , X2 , be a sequence of random variables whose raP..g es are contained in S. The sequence is a Markov chain or Markov process if • • •

(8.2)

P[ xn + I = jiXo = i o , . . . ' xn = i n ] = P [ xn + I = ji Xn = i n ] = P;nj

for every n and every sequence i0, . . . , in in S for which P[ X0 = i0, . . . , Xn = in ] > 0. The set S is the state space or phase space of the process, and the P;j are the transition probabilities. Part of the defining condition (8.2) is that the transition probability

(8.3) does not vary with n . t The elements of S are thought of as the possible states of a system, Xn representing the state at time n. The sequence or process X0, X 1 , X2 , then represents the history of the system, which evolves in accordance with the probability law (8.2). Th e conditional distribution of the next state xn + I given the present state Xn must not further depend on the past X0, . . . , Xn _ 1 • This is what (8.2) requires, and it leads to a copious theory. Th e initial probabilities are • • •

( 8.4) Th e ai are nonnegative and add to 1, but the definition of Markov chain places no further restrictions on them.

P[Xn+

jiXn

tsometimes in the definition of the Markov chain 1 i] is allowed to depend on n. A chain satisfying (8.3) is then said to have stationary transition probabilities, a phrase that will be omitted here because (8 3) will always be assumed. =

=

112

PROBABILITY

The following examples illustrate some of the possibilities. In each one, the state space S and the transition probabilities P;, are described. but the underlying probability space (f!, .'9", P) and the Xn are left unspecified for now: see Theorem 8.1. t Imagine r black balls and r white balls distributed between two boxes, wit h the constraint that each box contains r balls. The state of the system is specified by the number of white balls in the first box, so that the state space is S {0, 1, . . . , r}. The transition mechanism is this: at each stage one ball is chosen at random from each box and the two are interchanged. If the present state is i, the chance of a transition to i - 1 is the chance ijr of drawing one of the i white balls from the fim box times the chance ijr of drawing one of the i black balls from the second box. Together with similar arguments for the other possibilities, this shows that the transition probabilities are Example 8.1.

The Bernoulli -Laplace model of diffusion.

=

P; . ; - 1

=-

( � f.

Pi , i + l

( i )2 r = '

Pl.l.

r

i( r - i ) = -? r 2 ..:..._ ' -'----::-

the others being 0. This is the probablistic analogue of the model for the flow • of two liquids between two containers. form the transition matrix P = [ P;) of the process. A stochastic matrix is one whose entries are nonnegative and satisfy (8. 1 ); the transition matrix of course has this property. The

P;j

Example 8.2.

{O, l , . . . , r} and

Random walk with absorbing barriers. 1 J.

P=

q 0 . .. 0 0 0

0 0 q . . 0 0 0

0 p 0 . . . 0 0 0

.

0 0 p. . 0 0 0

. .. . . ...

0 0 0 . . q 0 0

.

0 0 0. 0 q 0

.

.

0 0 0 . . p 0 0

Suppose that S =

0 0 0 . .. 0 p 1

That is, Pi , i + l = p and P;, ; _ 1 = q = 1 - p for O < i < r and p00 = p, = 1. The chain represents a particle in random walk. Th e particle moves one unit to the right or left, the respective probabilities being p and q, except that each of 0 and r is an absorbing state-once the particle enters, it cannot leave. The state can also be viewed as a gambler ' s fortune; absorption in 0 t For an excellent collection of examples from physics and biology, see FELLER, Volume 1 , Chapter XV.

SECT ION

8.

113

MARKOV CHAINS

represents ruin for the gambler, absorption in r ruin for his adversary (see Section 7). Th e gambler' s initial fortune is usually regarded as nonrandom, so • that (see (8.4)) a i = 1 for some i.

i

Example 8.3. Unrestricted random walk. Let S consist of all the integers = 0, + 1, + 2, . . . , and take Pi, . + 1 = p and Pi , i 1 = q = 1 p. This chain rep _

-

resents a random walk without barriers, the particle being free to move • anywhere on the integer lattice. The walk is symmetric if p = q.

The state space may, as in the preceding example, be countably infinite. If so, the Markov chain consists of functions Xn on a probability space (f!, /Y, P), but these will have infinite range and hence will not be random variables in the sense of the preceding sections. This will cause no di fficulty, however, because expected values of the Xn will not be considered. All that is required is that for each i E S the set [w: Xn(w) = iJ lie in :Y and hence have a probability. Let S consist of the integer lattice points in k -dimensional Euclidean space R k ; x = (x 1 , . . . , x k ) lies in S if the coordinates are all integers. Now x has 2k neighbors, points of the form y = ( x w . . , x i + 1 , . . . , xk); for each such y let Px y = (2 k ) - 1 • The chain represents a particle moving randomly in space; for k = 1 it reduces to Example 8.3 with p = q = t. Th e cases k < 2 and k > 3 exhibit an interesting difference. If k < 2, the particle is certain to return to its initial • position, but this is not so if k > 3; see Example 8.6. Example 8.4.

Symmetric random walk in space.

Since the state space in th is example is not a subset of the line, the X0, X P . . . do not assume real values. This is immaterial because expected values of the xn play no role. All that is necessary is that xn be a mapping from f! into S (finite or countable) such that {w: Xn(w) = i] E :Y for i E S. There will be expected values E[f(Xn)l for real functions f on S with finite range, but then f(Xn(w)) is a simple random variable as defined before.

8.5. A

selection problem. A princess must chose from among r suitors. She is definite in her preferences and if presented with all r at once could choose her Example

favorite and could even rank the whole group. They are ushered into her pres-ence one by one in random order, however, and she must at each stage either stop and accept the suitor or else reject him and proceed in the hope that a better one will come along. What strategy wil l maximize her chance of stopping with the best suitor of all? Shorn of some details, the analysis is this. Let 5�> 52, • • • , 5r be the suitors in order of presentation; this sequence is a random permutation of the set of suitors. Let X1 = 1 and let X2, X3, . . . be the successive positions of suitors who dominate (are preferable to) all their predecessors. Thus Xz = 4 and x3 = 6 means that 5 I domi nates 52 and 53 but 54 dominates S 1 , 52, 53, and that 54 dominates 55 but 56 dominates 5 1 , . . . , 55 . There can be at most r of these dominant suitors; if there are exactly m , xm + I = Xm + 2 = . . . = r + 1 by convention.

114

PROBABILITY

As the suitors arrive in random order, the chance that S; ranks highest among s I , . . . , S; is (i - l)!ji! = 1 ji. The chance that sj ranks highest among s ( l , sj and S; ranks next is (j - 2)! jj! = 1/j(j - 1). This leads to a chain with transition probabili tiest • • •

1 < i <j < r.

(8.5)

If X11 = i , then X11 1 = r + 1 means that S; dominates S; and the conditional probability of this is +

(8.6)

1

+

1,.

m

p< � lJl'. 0, and th erefore fo = 1 . By (8. 16), P;[ Xn = j i.o.] = fij = 1. If Ln Pfi l were to converge for some i and j, it would follov1 by th e first Borel -Cantelli lemma that P;[ Xn = j i.o.] = 0. • Example 8. 7.

impossible if

S

Since L,j p};l = 1, th e first alternative in Theorem 8.3 is • is finite: a finite, irreducible Markov chain is persistent.

The chain in P6lya's theorem is certainly irreducible. If the dimension is 1 or 2, there is probability 1 that a particle in symmetric random walk visits every state infinitely often. If the dimension is 3 or more, the • particle goes to infinity. Example 8.8.

SECTION 8.

121

MARKOV CHAINS

Consider the unrestricted random walk on the line (Exam ple 8.3). According to the ruin calculation (7.8), f0 1 = p jq for p < q. Since the chain is irreducible, all states are transient. By symmetry, of course, the chain is also transient if p > q, although in this case (7 .8) gives f0 1 = 1. Thus . -r . = 1 (i =I= is possible in the transient case.t 1 1) If p = q = t, the chain is persistent by P6lya ' s theorem. If n and - i have the same parity, Example 8.9.

j)

j

n

p}j> = n + j - i

l j - il < n .

2

j

Th is is maximal if = i or j = i ± 1, and by Stirling's formula the maximal value is of order n - 1 12 . Therefore, lim n p1�� > = 0, which always holds in the transient case but is thus possible in the persistent case as well (see Theorem • 8.8). Another Criterion for Persistence

Let Q = [qi i ] be a matrix with rows and columns indexed by the elements of a finite or countable set U. Suppose it is substochastic in the sense that qii > 0 n and [1qii < 1 . Let Q [ q}j>] be the nth power, so that =

( 8 . 20)

(r� ) q I

1

I)

i =F j for which h < 1 ; see Problem 8.7.

)

V

IV

. =

V)

122

PROBABILITY

exist. By (8.22) and the Weierstrass M-test [A28], a-1 = L1q;/'i· Thus the a-1 solve the system x1 =

( 8.24)

L q;1x, , i E U,

jeU

i E U.

. . For an arb.ltrary soI utlon , x1 = " L..1 q11 xi < " L..1 q11 = a-1( ll , an d x: < a-1( n J for a II t implies x1 < [i q1 i a-? l = a-/" + I ) by (8. 22). Thus x 1 < a-/" l for all n by induc tion, and so x 1 < a-1 Thus the a-1 give the maximal solution to (8.24): •

Lemma

l.

For a substochastic matrix Q the limits (8.23) are :he maximal

solution of (8.24).

Now suppose that U is a subset of the state space S. The pli for i and j in U give a substochastic matrix Q. The row sums (8.21) are a} " l = [ p 1I p1 1h · · · Pin - i n' where the j , . . . , jn range over U, and so a-;( " ) = P1[ X, E U, t � n ] . l 1 Let n � oo: 1

_

i E U.

(8.25)

In this case, a-1 is thus the probability that the system remains forever in U, given that it starts at i. The following theorem is now an immediate consequence of Lemma 1.

. 4 8. of the system Theorem

For U c S the probabilities (8.25) are the maximal solution x1 =

( 8.26)

L p i x i , i E U,

jEU

1

iE

U.

The constraint x 1 > 0 in (8.26) is in a sense redundant: Since x 1 0 is a solution, the maximal solution is automatically nonnegative (and similarly for (8.24)). And the maximal solution is x 1 1 if and only if L-1 uP;i = 1 for all i in U, which makes probabilistic sense. =

=

e

For the random walk on the line consider the set The System (8.26) is

Example 8.10.

{0, 1 , 2, . . . }.

px i + I + qx i- I x o px l . Xi =

'

U

=

i > 1'

=

n It follows [A19] that X11 =A + An if p = q and x 11 = A - A(qjp) + l if p -:f:. q . The only bounded solution is x 11 0 if q > p, and in this case there is =

SECTION 8.

MARKOV CHAINS

123

probability 0 of staying forever among the nonnegative integers. If q < p , A 1 gives the maximal solution x = 1 (q jp ) n + 1 (and 0 < A < 1 gives exactly the solutions that are not maximal). Compare (7.8) and Example 8.9 . n

=

-

•

Now consider the system (8.26) with U = S state i0: xi = L

i * io

( 8.27)

-

piJ x 1 , i =I= i0

{i0} for an arbitrary single ,

O < x, < 1 ,

There is always the trivial solution-·the one for which x1 Theorem 8.5.

nontrivial solution .

=

0.

An irreducible chain is transient if and only if (8.27) has a

PRooF. The probabilities ( 8.28) are by Theorem 8.4 the maximal solution of (8.27). Therefore (8.27) has a nontrivial solution if and only if fiio < 1 for some i =I= i0. If the chain is persistent, this is impossible by Theorem 8.3{ii). Suppose the chain is transient. Since 00

fioio = P;J XI = i l) ] + L L P,J XI = i ' x2 .f. i o ' . . . , xn - I =I= i o ' xn = i o ] n = 2 i * io =

P;0;0 + L P;0Ju0> i * io

and since fio o < 1, it follows that fu0 < 1 for some i =I= i0• i

•

Since the equations in (8.27) are homogeneous, the issue is whether they have a solution that is nonnegative, nontrivial, and bounded. If they do, 0 < x i < 1 can be arranged by rescaling. t t See

Problem 8.9.

124

PROBABILITY

In the simplest of queueing models the state space is {0, 1, 2, . . . } and the transition matrix has the form Example 8. 11.

to t l t 2 to t l t 2 0 to t l 0 0 to 0

0

0 0

0 0 0 0

0 0 0

t2 tl t2 0 to t l t 2

..................

.

.

.

�

.

.

If there are i customers in the queue and i > 1, the customer at the head of the queue is served and leaves, and then 0, 1, or 2 new customers arrive (probabilities t0, t 1 , t 2 ), which leaves a queue of length i - 1, i, or i + 1. If i 0, no one is served, and the new customers bring the queue length to 0, 1, or 2. Assume that t0 and t2 are positive, so that the chain is irreducible. For i0 = 0 the system (8.27) is =

(8.29)

x l = t l x l + t2 x2 , x k to x k - 1 + t l x k + t 2 x k + =

P

k > 2.

Since t0, tp t 2 have the form q{l - t), t, p{l - t) for appropriate p, q, t, the k > 2. Now the second line of (8.29) has the form x k = pxkk + 1 + qx k solution [A19] is A + B(qjp) k =A + B(t0jt 2 ) if t0 =I= t 2 (p =l= q) and A + Bk if t0 = t 2 ( p = q), and A can be expressed in terms of B because of the first equation in (8.29). The result is _

P

if t0 =I= t 2 , if t0 = t 2 • There is a nontrivial solution if t0 < t 2 but not if t0 > t 2 • If t0 < t 2 , the chain is thus transient, and the queue size goes to infinity with proability 1. If t0 � t 2 , the chain is persistent. For a nonempty queue the expected increase in queue length in one step is t2 - t0, and the queue goes • out of control if and only if this is positive. Stationary Distributions

Suppos e that the chain has initial probabilities 7Ti satisfying (8.30)

L 7Ti Pij

ieS

=

7Tj ,

j E S.

SECTION 8.

125

MARKOV CHAINS

It then follows by induction that j E S, n = 0, 1 , 2, . . . .

( 8.3 1 )

If 7Ti is the probability that X0 = i, then the left side of (8.31) is the probability that xn = j, and thus (8.30) implies that the probability of [ xn = j] is the same for all n. A set of probabilities satisfying (8.30) is for this reason called a stationary distribution. The existence of such a distribution implies that the chain is very stable. To discu�s this requires the notion of periodicity. The state j has period t if pjjl > 0 implies that t divides n and if t is the largest integer with this property. In other words, the period of j is the greatest common divisor of the set of integers

[ n: n > 1 PL� ) > o] .

( 8.32 )

>

If the chain is irreducible, then for each pair i and j there exist r and s for which p}? and p)/l are positive, and of course +n ) > p !� lp< � lp \n PI(r+s I) )} ]l ' I

(8.33)

Let t i and tj be the periods of i and j. Taking n = 0 in this inequality shows that t; divides r + s; and now it follows by the inequality that pjjl > 0 implies that t; divides r + s + n and hence divides n. Thus t; divides each integer in the set (8.32), and so t; < tj. Since i and j can be interchanged in this argument, i and j have the same period. One can thus speak of the period of the chain itself in the irreducible case. The random walk on the line has period 2, for example. If the period is 1, the chain is aperiodic

2. In an irreducible, aperiodic chain, for each i and j, all n exceeding some n0 ( , j). Lemma

i

Pt� ) >

0 for

Since p( m + n ) > p( m lp\ � l if M is the set (8.32) then m E M and n E M together imply m + n E M. But it is a fact of number theory [A21] that if a set of positive integers is closed under addition and has greatest common divisor 1, then it contains all integers exceeding some n 1 • Given i and j, • choose r so that p[jl > 0. If n > n0 = n 1 + r, then p�jl > p�lpj'j - r ) > 0. PROOF.

)}

-

)}

"

'

8. 6 . Suppose of an irreducible, aperiodic chain that there exists a stationary distribution -a solution of (8.30) satisfying 0 and 1. Theorem

7T;

Then the chain is persistent,

(8.34)

>

L(Tri =

lim n p < � l = 7T1· l}

for all i and j, the 1rj are all positive, and the stationary distribution is unique.

126

PROBABILITY

The main point of the conclusion is that the effect of the initial state wears off. Whatever the actual initial distribution {a;} of the chain may be, if (8.34) holds, then it follows by the M-test that the probability L.;a; p,<j'l of [ Xn j] converges to '"> =

If the chain is transient, then p�jl � 0 for all i and j by Theorem 8.3, and it follows by (8.3 1) and the M-test that 7rj is identically 0, which contradicts L;'TT; = 1. The existence of a stationary distribution therefore implies that the chain is persistent. Consider now a Markov chain with state space S X S and transition probabilities p{ij, kl) = P;k Pji (it is easy to verify that these form a stochastic matrix). Call this the coupled chain; it describes the joint behavior of a pair of independent sy�tems, each evolving according to the laws of the original Markov chain. By Theorem 8.1 there exists a Markov chain (Xn, Yn), n = 0, 1, . . . , having positive initial probabilities and transition probabilities PROOF.

P [ ( Xn + l • Y,. + 1 )

=

( k , L ) j( Xn , Y,J

=

( i, J ) ) = p ( ij , kl ) .

For n exceeding some n0 depending on i. j, k, I, the probability n p < l( ij, kl ) pf/:lp,\tl is positive by Lemma 2. Therefore, the coupled chain is irreducible. (This proof that the coupled chain is irreducible requires only the assumptions that the original chain is irreducible and aperiodic, a fact needed again in the proof of Theorem 8. 7.) It is easy to check that 7r{ij) = 7r; 7rj forms a set of stationary initial probabilities for the coupled chain, which, like the original one, must there fore be persistent. It follows that, for an arbitrary initial state {i, j) for the chain {(Xn, Yn )} and an arbitrary i0 in S, one has Pij[(Xn, Yn) = {i0, i0) i.o.] 1. If r is the smallest integer such that X7 = Yr = i0, then r is finite with probability 1 under Pij. The idea of the proof is now this: Xn starts in i and Y,. starts in j; once xn = l: = io occurs, xn and yn follow identical probability laws, and hence the initial states i and j will lose their influence. By (80 13) applied to the coupled chain, if m < n, then =

=

Pd ( Xn , Y,. ) = ( k, l ) , r = m ]

= Po( ( X; , Y, ) =I= ( i0 , i0), t < m , ( Xm , Ym ) = (i0 , i0)] X P; 0d ( Xn - m ' Y,. _m) = ( k , / ) ] < n - m) = P. . [ = m ] pt0< nk-mJpi0/ T

''

o

Adding out I gives Pij[Xn = k, r = m] = P;J r = m]p,\;k- m l, and adding out k gives P;) Y,. I r = m] P1) r = m]pf� - m l. Take k = I, equate probabilities, and add over m = 1, , n: =

,

=

0

•

•

Pd Xn

=

k , r < n ] = Pu( Y,. = k, r < n ] .

SECTION 8.

127

MARKOV CHAINS

From this follows

Pd xn = k ] < Pu( xn = k ' < n ] + Pd > n ] = Pu( Yn = k, < n] + P;A > n ] < P;j[ Y,. = k ] + Pd > n ] . T

T

T

T

1

This and the same inequality with X and Y interchanged give

Since

r

is finite wirh probability 1,

(kn ) - p)(kn ) I = 0 ' lim I p L n

(8 .35)

(This proof of (8.35) goes through as long as the coupled chain is irreducible and persistent- no assumptions on the original chain are needed. This fact is used in the proof of the next theorem.) . By (8.31), 'TT' k - p1\;l = L;7T;( P�J:l -p)J:l), and this goes to 0 by the M-test if (8.35) holds. Thus lim n p)J:l = 'TT'k . As this holds for each stationary distribu tion, there can be only one of them. It remains to show that the 'TT'j are all strictly positive. Choose r and s so that pfjl and p]fl are positive. Letting n � oo in (8.33) shows that 'TT'; is positive if 'TT'j is; since some 'TT'j is positive (they add to 1), all the 7T'; must be • positive.

Example

For the queueing model in Example 8.11 the equations

8.12.

(8.30) are

•

= = =

'TT' o 'TT'o to + 'TT' 1 to , 'TT' 1 7T'o t i + 'TT' 1 t 1 + 'TT' 2 to , 'TT' 2 7T' o t 2 + 'TT' 1 t2 + 'TT' 2 t 1 + 'TT'3 to , 'TT' k = 7T'k - l t 2 + 'TT'k t l + 'TT' k + l to , k > 3 .

Again write t0 , t 1 , t 2 , as q{l - t), t, p(l - t). Since the last equation here is 'TT'k = q7r k+ l + p 7rk _ 1 , the solution is

k k t ) if t0 =1= t A jt0 B = B + + pjq) ( , ( A 2 2 f = k 7T' \ A + Bk if t0 = t 2 for k > 2. If t0 < t 2 and L'iTk converges, then 'TT'k 0, and hence there is no stationary distribution; but this is not new, because it was shown in Example 8.1 1 that the chain is transient in this case. If t0 = t 2 , there is again no =

128

PROBABILITY

stationary distribution, and this is new because the chain was in Example 8.1 1 shown to be persistent in this case. If t0 > t2 , then L1Tk converges, provided A = 0. Solving for 7T o and 7T 1 in the first two equations of the system above gives 1To = Bt2 and 7T 1 = Bt/1 t0)jt0 . From Lk1Tk = 1 it now follows that B = (t0 - t2 )jt2 , and the 1Tk can k be written down explicitly. Since 1Tk = B(t 2jt 0) for k > 2, there is small • chance of a large queue length. If t0 ' = t2 in this queueing model, the chain is persistent (Example 8. 1 1 ) but has no stationary distribution (Example 8. 1 2). The next theorem de scribes the asymptotic behavior of the p,�rl in this case. Theorem

tion, then

8.7.

If an

irreducible, aperiodic chain has no stationary distribu

( 8.36)

for all i and j. If the chain is transient, (8.36) follows from fheorem 8.3. What is interesting here is the persistent case. By the argument in the proof of2Theorem 8.6, the coupled chain is irreducible. If it is transient, then L ( pfj l) converges by Theorem 8.2, and the conclusion follows. Suppose, on the other hand, that the coupled chain is (irreducible and) persistent. Then the stopping-time argument leading to (8.35) goes through n as before. If the p� ) do not all go to 0, then there is an increasing sequence {nJ of integers along which some p}jl is bounded away from 0. By the diagonal method [A14], it is possible by passing to a subsequence of {n,} to ensure that each pfi"l converges to a limit, which by (8.35) mu 0 and =

u = 1/JLj•

•

The law of large numbers bears on the relation u = 1/JLj in the persistent case. Let Vn be the number of visits to state j up to time n. If the time from one visit to the next is about JL,, then Vn should be about n/11-j: Jl�jn ::::: 1/JLj· But (if X0 = j) �In has expected value n - I 'LZ 1 pjf l, which goes to u under the hypothesis of Lemma 3 [A30]. Consider an irreducible, aperiodic, persistent chain. There are two possi bilities. If there is a stationary distribution, then the limits (8.34) are positive, and the chain is called positive persistent. It then follows by Lemma 3 that 11-j < co and 1rj = 1 /v, for all j. In this case, it is not actually necessary to assume persistence, since this foilows from the existence of a stationary distribution. On the other hand, if the chair. has no stationary distribution, then the limits (8.36) are all 0, and the chain is called null persistent It then follows by Lemma 3 that 11- j = oo for all j. This, taken together with Theorem 8.3, provides a complete classification: =

Theorem

ties:

For an irreducible, aperiodic chain there are three possibili

8.8.

{i) The chain is transient ; then for all i and j, lim n pfj l = 0 and

in

fact

L.." n p 1 ). Since the chance of no return to 0 in n steps is p0 · • • p" - I ' in the pe rsistent case (8.38) gives 11-o Lk = o Po · Pk - l · In the null persistent case this checks with 11-o = oo; in the positive persistem case it gives 11-o = 11 L� =o7Td7To = 1 /7T 0 , which again is consistent. =

•

=

_

_

1

=

·

·

Since [jpfjl 1, possibilities (i) and (ii) in Theorem 8.8 are impossible in the finite case: A finite, irreducible, aperiodic Markov chain • has a stationary distribution. =

Example 8.14.

Exponential Convergence*

In the finite case, pfjl converges to 7Tj at an exponential rate:

8. 9 . If the state space is finite and the chain is irreducible and aperiodic, then there is a stationary distribution and Theo rem

{7TJ,

· I�(�) ��"''' - 7T1

I -< Ap "

'

where A > 0 and 0 < oo for each i (compare Theorem 8.3(i)). If j is transient, then

[;i =

pjj >/ ( 1 + £: pjj'> ) . £: 11 = 1 11 = 1

tThe only essential change in the argument is that Fatou's lemma (Theorem 16.3) must be used in place of Theorem 5 4 in the proof of Lemma 5. * See Problems 8 36 and 8.37

140

PROBABILITY

Specialize to the case i = 1 : in addition to implying that i is transient (Theorem 8.2(i)), a finite value for r:;; 1 pfr > suffices to determine [;; exactly. =

8.5. Call {x,} a subsolution of (8.24) if X; < L,iqiixi and 0 < X; < 1, i E U. Extending Lemma 1, show that a subsolution {x,} satisfies X; < a; : The solution {a;} of (8.24) dominates all subsolutions as well as all solutions. Show that if X; = LjQ;ix' and - 1 < x; < 1, then {lx;l} is a subsolution of (8.24). 8.6. Show by solving (8.27) that the unrestricted random walk on the !ine (Example 8.3) is persistent if and only if p = 1 8. 7. ( a) Generalize an argument in the proof of Theorem 8.5 to show that [;k = P; k + Lj .., k Pufjk . Generalize this further to +( I )

, , /'( n ) + + lik i - l ik

fk

_

+

L P;[XI ol- k , . . . , Xn - 1 ol- k , Xn = J ] "' k

j* k

(b) Take k = i. Show that

[; j > O if and only if P;(X 1 o�- i, . . , Xn_ 1 o�- i, Xn = j] > 0 for some n, and conclude that i is transient if and only if -"; < 1 for some j if i such that [;j > 0. (c) Show that an irreducible chain is transient if and only if for each i there is a j * i such that [j; < 1. 8.8. Suppose that S = {0 , 1, 2, . . . }, p00 = 1, and [;o > 0 for all i. (a) Show that P;(U i= 1 [Xn = j i.o.]) = 0 for all i. (b) Regard the state as the size of a population and interpret the conditions p00 = 1 and [;o > 0 and the conclusion in part (a). 8.9. 8.5 i Show for an irreducible chain that (8.27) has a nontrivial solution if and only if there exists a nontrivial, bounded sequence {x;} (not necessarily nonnega tive) satisfying X; = [i * io P;ixi, i * i0. (See the remark following the proof of Theorem 8.5.) 8.10. i Show that an irreducible chain is transient if and only if (for arbitrary i0) the system Y; = Li P;j Yj , i * i0 (sum over all j), has a bounded, nonconstant solution {Y;, i E S} . 8.11. Show that the ?;-probabilities of ever leaving U for solution of the system. Z; = L P;, zj + L P;j,

i E U,

O < z; < l ,

i E U.

jeU

(8 .51)

iE

1 1.

'

Show that the chain is irreducible and aperiodic. For which p's is the chain persistent? For which p's are there stationary probabilities?

8.16. Show that the period of j is the greatest common divisor of the set (8.52)

Recurrent events. Let [1 , [2, be nonnegative numbers with f = L,. dn < 1. Define u1, u 2 , . . . recursively by u1 = f1 and

8.17. i

• • •

=

(8.53) (a) Show that f < 1 if and only if L11U11 < oo. (b) Assume that f = 1, set JL = r::: 1 nfn , and assume that =

(8.54)

gcd [ n: n > 1 , fn > 0] = 1 .

Prove the renewal theorem. Under these assumptions, the limit u = lim n u� exists, and u > 0 if and only if JL < oo, in which case u = 1/JL · Although these definitions and facts are stated in purely analytical terms, they have a probabilistic interpretation: Imagine an event G' that may occur at times 1, 2, . . . . Suppose fn is the probability G' occurs first at time n. Suppose further that at each occurrence of G' the system starts anew, so that fn is the probability that G' next occurs n steps later. Such an G' is called a recurrent event. If U11 is the probability that G' occurs at time n, then (8.53) holds. The recurrent event G' is called transient or persistent according as f < 1 or f = 1, it is called aperiodic if (8.54) holds, and if f = 1, JL is interpreted as the mean recurrence time

8.18. (a) Let T be the smallest integer for which X7 = i 0 Suppose that the state space is finite and that the P;j are all p� sitive. Find a p such that max;O - P;;) n] :s: p " for all 1 . (b) Apply this to the coupled chain in the proof of Theorem 8.6: I p}f> - pJf> l < p " . Now give a new proof of Theorem 8.9. •

142

PROBABILITY

8.19. A thinker who owns r umbrellas wanders back and forth between home and office, taking along an umbrella (if there is one at hand) in rain (probability p) but not in shine (probability q ). Let the state be the number of umbrellas at hand, irrespective of whether the thinker is at home or at work. Set up the transition matrix and find the stationary probabilities. Find the steady-state probability of his getting wet, and show that five umbrellas will protect him at the 5 % level against any climate (any p ). 8.20. (a) A transition matrix is doubly stochastic if L; P;, = 1 for each J. For a finite, irreducible, aperiodic chain with doubly stochastic transition matrix, show that the stationary probabilities are all equal. (b) Generalize Example 8.15: Let S be a finite group, let p(i) be probabilities, and put P;j = p ( J i - 1 ), where product and inverse refer to the group operation. Show that, if all p(i) are positive, the states are all equally likely in the limit. (c) Let S be the symmetric group on 52 elements. What has (b) to say about card shuffling? ·

8.21. A set e in S is closed if I: e c Pij = 1 for i E e: once the system enters e it cannot leave. Show that a ch ain is irreducible if and only if S has no proper closed subset. 8.22. i Let T be the set of transient states and define persistent states i and j (if there are any) to be equivalent if fij > 0. Show that this is an equivalence relation on S - T and decomposes it into equivalence classes e , , e2, , so that S = T u e u e2 u Show that each em is closed and that [;j = 1 for i and j 1 in the same em. • • •

· ·

8.23. 8. 1 1 8.21 i ' Let T be the set of transient states and let e be any closed set of persistePt states. Show that the ?;-probabilities of eventual absorption in e for i E T are the minimal solution of Y; =

( 8.55)

[ pijy1 + L P;j

jE T

0 < Y; < 1 ,

jEC

'

i E T, i E T.

8.24. Suppose that an irreducible chain has period t > 1. Show that S decomposes into sets So, . . . ' s, _ l such that Pij > 0 only if i E sv and j E sv + I for some " (v + 1 reduced modulo t ). Thus the system passes through the S v in cyclic succession. 8.25. i Suppose that an irreducible chain of period t > 1 has a stationary distribu tion (7r). Show that, if i E sv and j E sv + a (v + a reduced modulo t), then lim n pfJ ' + a > = '7T1. Show that lim n n - • r:;�= 1 pft > = '7T1jt for all i and j. 8.26. Eigenvalues. Consider an irreducible, aperiodic chain with state space (1, . . , s}. Let r0 = ( '7T 1 , , '7T5 ) be (Example 8.14) the row vector of stationary probabili ties, and let c0 be the column vector of 1's; then r0 and c0 are left and right eigenvectors of P for the eigenvalue A = 1. (a) Suppose that r is a left eigenvector for the (possibly complex) eigenvalue A : rP = Ar. Prove: If A = 1 , then r is a scalar multiple of r0 (A = 1 has geometric •

•

•

SECTION

8.

143

MARKOV CHAINS

multiplicity 1). If A -=F 1, then lA I < 1 and rc0 = 0 (the 1 x 1 product of 1 X s and s X 1 matrices). (b) Suppose that c is a right eigenvector: Pc = Ac. If A = 1, then c is a scalar multiple of c0 (again the geometric multiplicity is 1). If A -=F 1, then again !A I < 1 , and r0c = 0.

8.27. i Suppose P is diagonalizable; that is, suppose there is a nonsingular C such that c - 1 PC = A, where A is a diagonal matrix. Let A 1 > . . . , A5 be the diagonal elements of A, let c I > , c be the successive columns of C, let R = C - 1 , and let r 1, , r5 be the successive rows of R. (a) Show that c; and r; are right and left eigenvectors for the eigenvalue A;, i = 1 , . . . , s. Show that r;Cj = oij. Let A; = C;r; (s X ns ). Show that N is a diagonal matrix with diagonal elements A�, . . . , A� and that p = CN R = I:� � :A: A u , n > 1. (b) Part (a) goes through under the sole assumption that P is a diagonalizable matrix. Now assume also that it is an irreducible, aperiodic stochastic matrix, and arrange the notation so that A 1 = 1 . Shov. that each row of A 1 is the vector ('7T 1, . . . , '7T5) of stationary probabilities. Since •

•

(8 .56)

•

•

•

s

•

s

p = A l + L A':, A u n

u�2

and lA) < 1 for 2 < u < s, this proves exponential convergence once more. (c) Write out (8.56) explicitly for the case s = 2. (d) Find an irreducible, aperiodic stochastic matrix that is not diagonalizable.

8.28. i (a) Show that the eigenvalue A = 1 has geometric multipl icity 1 if there is only one closed, irreducible se� of states; there ma:r be transient states, in which case the chain itself is not irreducible. (b) Show, on the other hand, that if there is more than one closed, irreducible set of states, then A = 1 has geometric multiplicity exceeding 1. (c) Suppose that there is only one closed, irreducible set of states. Show that the chain has period exceeding 1 if and only if there is an eigenvalue other than 1 on the unit circle. 8.29. Suppose that {Xn} is a Markov chain with state space S, and put Y,. = (Xn, xn + I ). Let T be the set of pairs (i, j) such that Pij > 0 and show that {Y,.} is a Markov chain with state space T. Write down the transition probabilities. Show that, if {Xn} is irreducible and aperiodic, so is {Y,J Show that, if '7T; are stationary probabilities for {Xn}, then '7T; Pij are stationary probabilities for {Yn}. 8.30. 6.10 8.29 i Suppose that the chain is finite, irreducible, and aperiodic and that the initial probabilities are the stationary ones. Fix a state i, let An = [X; = i], and let Nn be the number of passages through i in the first n steps. Calculate an and f3n as defined by (5.41). Show that f3n - a� = O(l jn), so that n 1Nn --+ '7T; with probability 1. Show for a function f on the state space that n - I I:'k J f(Xk ) L ;'1TJ(i) with probability 1. Show that n - I I:'k � I g (Xk ' xk + I ) --+ Lij'7T; Pij g (i, j) for functions g on S X S. -

--+

�

144

PROBABILITY

8.31. 6.14 8.30 j If X0(£ 0 such that (i, i + 1, . . . , i + n) E /n . If there is no such E;[ f( Xr)] = 0. If there is one, then

n,

then

and hence the only possible values of £;[!( Xr)] are 0, f ( i ) , pJ( i + 1 ) =f( i ) A ; , P; P; + J i(i + 2 ) = f ( i )A;A;+ I • · · · . Thus u(i) = f(i) A IA 1 A; _ 1 for i > 1 ; no strategy this value. The support set is M = {0}, and the hitting time To for M is finite, but E;[f( Xr)l = 0. •

•

•

8.38. 5. 1 2 1' Consider an irreducible, aperiodic, positive persistent chain. Let 'Tj be th� smaliest n such that Xn = j, and let m;j = E;[ T). Show that there is an r n) sueh th at p = PjL- X 1 -=1= J,· . . . , Xr - l -=1= ],· Xr = z· 1 ·IS po� ·1 t ·1ve; from Jjj< n + r ) > p,/"( 1 ij an d mj, < oo, conclude that m;1 < oo and m;j = I.:��0 P;( T1 > n]. Starting from p}; ' = '"' .< s < ' s L... s = l J' >p - > show th at '

JJ

,

n

L ( Ph> - Pj�>)

t=l

n

=

1 - L P]p -m>pi[ T; > m ] . m=O

Use the M-test to show that 00

1Tjmij = 1 + L ( pJp > - p}J' > ) .

n= l

If i J, this gives m F = 1 /1Tj again; if i -=1= j, it shows how in principle m ij can be calculated from the fransition matrh: and the stationary probabilities. =

9.

SECTION LARGE DEVIATIONS AND THE LAW OF THE ITERATED LOGARITHM *

It is interesting in connection with the strong law of large numbers to estimate the rate at which Sn/n converges to the mean m . The proof of the strong law used upper bounds for the probabilities P[ I sn - m I > a l for large a. Accurate upper and lower bounds for these probabilities will lead to the law of the iterated logarithm, a theorem giving very precise rates for Sn/n � m . The first concern will be to estimate the probability of large deviations from the mean, which will require the method of moment generating func tions. The estimates will be applied first to a problem in statistics and then to the law of the iterated logarithm. * This section may be omitted

\

146

PROBABILITY

Moment Generating Functions 1,

Let X be a simple random variable asssuming the distinct values x • • • , x1 with respective probabilities p i > . . . , p 1• Its moment generating function is

( 9.1)

M( t )

=

E[e' X ]

=

I

L P;e 1x1 •

i= I

(See (5.19) for expected values of functions of random variables.) This function, defined for all real t, can be regarded as associated with X itself or as associated with its distribution-that is, with the measure on the line having mass Pi at X; (see (5.12)). If c = max; l x;l, the partial sums of the series e'x r.;= o t kX k jk ! are bounded by e\1\C, and so the corollary to Theorem 5.4 applies: �

00

(9.2 )

M( t )

=

L

k�O

�

k ! E[Xk].

Thus M(t) has a Taylor expansion, and a s follows from the general theory [A29], the coefficient of t k must be M< k ltO)/k ! Thus ( 9.3)

Furthermore, term-by-term differentiation in (9.1) gives I

M< k l( t) = L P; xte ix. = E[ X ke i X ] ; i

=I

taking t = 0 here gives (9.3) again. Thus the moments of X can be calculated by successive differentiation, whence M(t) gets its name. Note that M(O) = 1. Example 9.1. If X assumes the values 1 and 0 with probabilities p and q = 1 p , as in Bernoulli trials, its moment generating function is M(t) = pe ' + q. The first two moments are M'(O) = p and M"(O) = p, and the • variance is p - p 2 pq. -

=

If XI ' . . . , Xn are independent, then for each t (see the argument follow ing (5.10)), e'x 1 , • • • , e'x" are also independent. Let M and Ml' . . . , Mn be the respective moment generating functions of S X1 + · · · + X11 and of X I > . . . , Xn; of course, e'5 = D ;e' x . Since by (5.25) expected values multiply for independent random variables, there results the fundamental relation =

'

(9.4)

SECTION

9.

147

LARGE DEVIATIONS AN::> THE ITERATED LOGARITH M

This is an effective way of calculating the moment generating function of the sum S. The real interest, however, centers on the distribution of S, and so it is important to know that distributions can in principle be recovered from their moment generating functions. Consider along with (9.1) another finite exponential sum N( t ) = L,jqje 'YJ, and suppose that M(t ) = N(t) for all t. If X; = max X; and yj = max yj, then M(t) - p1o. e ' x;o and N(t ) - qJ.oe'Yio as t � oo, and so X 1·o = yJo. and p.' o = q'.o. The same argument now applies to L; * ;0p,. e' x; = Lj *joqje'YJ, and it follows inductively that with appropriate relabeling, X; = Y; and P; = Q; for each i. Thus .

0

0

the function (9.1) does uniquely determine the x ,. and P; ·

If X1, Xn are independent, each a�suming values 1 and 0 with probabilities p and q, then S = X 1 + · · · +Xn is the number of successes in n Bernoulli trial�. By (9.4) and Example 9 . 1, S has the moment generating function Example 9.2.

• • •

,

(� )

The right-hand form shows this to be the moment generating function of a n distribution with mass p kq - k at the integer k, 0 < k < n. The uniqueness kq n - k . just established therefore yields the standard fact that P[ S k] = =

( � )P

•

The cumulant generating function of X (or of its distribution) is C ( t ) = log M( t ) = log E ( e'x] .

(9.5)

(Note that M(t) is strictly positive.) Since C' = M'jM and C" = (MM" (M') 2 )jM 2 , and since M(O) = 1, ' C(O) 0, =

( 9.6)

C' (O) = E[ X ] ,

C" (O) = Var[ X ] .

Let m k E [ X k ]. The leading term in (9.2) is m0 expansion of the logarithm in (9.5) gives =

=

1, and so a formal

(9.7) Since M(t) � 1 as t � 0, this expression is valid for t in some neighborhood of 0. By the theory of series, the powers on the right can be expanded and

148

PROBABILITY

terms with a common factor t ; collected together. This gives an expansion

C( t) =

(9.8)

c . '. , L -l-t l. 00

i

=I

valid in some neighborhood of 0. The c; are the cumulants of X. Equating coefficients in the expansions (9.7) and (9.8) leads to c f = m 1 and c 2 = m 2 - mf, which checks with (9.6). m ; and conversely, Each c; can he expressed as a polynomial in m 1 , although the calculations soon become tedious. If £[ X] 0, however, so that m 1 c = 0, it is not hard to check that .

•

•

,

=

=

1

C4 - m4

(9.9)

-

<m 22 •

�

Taking logarithms converts the multiplicative relation (9.4) into the addi tive relation

( 9 . 1 0) for the corresponding cumulant generating functions; it is valid in the presence of independence. By this and the defir.ition (9.8), it follows that cumulants add for independent random variables. Clearly, M"(t) = E[X 2 e ' x] > 0. Since (M'(t)) 2 = E 2 [Xe'x] < E[e'x] · E[X 2e'x1 = M(t )M"(t) by Schwarz's inequality (5.36), C"(t) > 0. Thus the

moment generating function and the cumulant generating function are both convex. Large Deviations

Let Y be a simple random variable assuming values yj with probabilities Pj · The problem is to e�timate P [ Y > a] when Y has mean 0 and a is positive. It is notationally convenient to subtract a away from Y and instead estimate P[ Y > 0] when Y has negative mean. Assume then that

(9.11)

E [ Y ] < O,

P[ Y > O] > O,

• •

the second assumption to avoid trivialities. Let M(t) Ljpje'Yj be the moment generating function of Y. Then M (O) < 0 by the first assumption in =

'

M (t)

SECTIO N

9.

149

LARGE DEVIATIONS AND THE ITERATED LOGARITHM

(9.11), and M(t) � oo as t � oo by the second. Since M(t) is convex, it has its minimum p at a positive argument T: inf M( t) = M ( T) = p ,

( 9.12)

I

0
0.

Construct (on an entirely irrelevant probability space) an auxiliary random variable Z such that

( 9.13) for each yj in the range of Y. Note that the probabilities on the right do add to 1. The moment generating function of Z is

( 9 . 14) and therefore

( 9.15)

E[Z] =

M' ( T ) p

= 0,

For all positive t, P[ Y> 0] P[e' Y > 1] < M(t) by Markov's inequality (5.31 ), and hence =

(9.16)

P[ Y> 0 ] 0, then

r:

denotes

( 9 . 17 ) Put the final sum here in the form e - 9 , and let p = P[ Z > O]. By (9.16), () > 0. Since log x is concave, Jensen ' s inequality (5.33) gives

- () = log L:' e- T Yjp - 1 P [ Z = Yj] + log p > [ ' ( - TyJp - 1 P ( Z = yj] + log p = - Tsp - 1 [ ' ; P ( Z = yj] + log p.

By (9. 15) and Lyapounov's inequality (5.37),

150

PROBABI LITY

The last two inequalities give

0 O] = pe - 9, where 0 satisfies (9. 18). To use {9. 18) requires a lower bound for P[ Z > 0].

If E[ Z ] 0, E[ Z 2 ] = s2, and E[Z 4 ] = e > 0, then P[ Z > 0)

Theorem 9.2.

=

> s 4 /4 e. t

Let z + = ZI[Z ?: O] and z - = - Zi � O]" Then z + and z - are lf nonnegative, Z z + - z-, Z 2 = ( Z + ) 2 + (Z-) , and P ROOF.

=

( 9 . 1 9) Let p

=

P[Z > 0]. By Schwarz's inequality (5.36),

E ( z + ) 2 = E [ I[Z ?: OJ Z 2

[

]

]

< £ 1 / 2 [ I[� ?; OJ] £ 1 /2 [ z 4 ]

By Holder's inequality (5.35) (fo r p

=

E ( z - ) 2 = E c z - ) 2 / ' c z - ) 4/3

[

[

]

q

t and

=

p 1 1 2e .

= 3)

]

< £ 2 13 [ z - ] £ 1 !3 c z - ) 4 < £ 2/3 [ z - u · 4/ 3.

[

Since E[ Z ] . q 3 gives ) 4

=

]

0, another application of Holder's inequality (for p = 4 and

=

E[z-] = E[ z+]

= E ( ZI[Z ?; OJ ]

< £114 ( z 4 ] £ 314 ( I[¥; 01 ] = gp 3 /4 . Combin ing these three inequalities with (9. 1 9) gives

({p 3 /4 ) 2/3e/3

=

2 p 1 1 2e .

t For a related result, see Problem 25.19.

s 2 < p 1 1 2e + •

SECTION

9.

LARGE D EVIATIONS AND THE ITERATED LOGARITHM

151

Chernoff's Theorem t

9. 3 . Let 2, andom variables satisfying

• be independent , identically distributed simple E[ X11] < 0 and P[ X11 > 0] > 0, Let M(t) be their moment generating function , and put p = inf, M(t). Then X1 , X

Theorem

r common ( 9 . 20 )

•

•

. . . +X11 > 0] = 1og p . + X _!_ log P lim [ 1 n �. oo n .

PROOF. Put Y;. = X 1 + · · · +X11• Then E[ Y;. J < 0 and P[Y;. > 0] > P"[ X 1 > 0] > 0, and so the hypotheses of Theorem 9 1 are satisfied. Define p11 and T11 by inf1 M11(t) = Mn( T) = Pn ' where M11( t) is the moment generating " function of Yn . Since M11(t) = M (t ) it follows that Pn = p " and Tn = T, where ,

M(T) = p.

Let Zn be the analogue for Yn of the Z described by (9. 13). Its moment generating function (see (9. 14)) is M11( T + t)jp " = (M( T + t)jp) " . This is also the moment generating function of V1 + · · · + Vn for independent r andom variables V1 , • • • , V,. each having moment generating function M( T + t)/p. Now each V; has (see (9.15)) mean 0 and some positive variance rr 2 and fourth moment e independent of i. Since Zn must have the same moments as V1 + · · · + V,. , it has mean 0, va riance s� = ncr 2 , and fourth moment g: = ne + 3n(n - l)cr4 = O(n 2 ) (see (6.2)). By Theorem 9.2 , P[ Z" > 0] > s:/4{: > a for some positive a independent of n. By Theorem 9.1 then, 1 P[ Yn > 0] = p"e - 9 , where 0 < ()n < T11 S11a- 1 - log a = Ta- crVn - log a. This gives (9.20), and shows, in fact, that the rate of convergence is O(n - 1 12 ). • "

This result is important in the theory of statistical hypothesb testing. An informal treatment of the Bernoulli case will illustrat e the connection. Suppose sn = XI + . . . + xn, where the X; are independent and assume the values 1 and 0 with probabilities p and q. Now P[Sn :::: na] = PD.::;; = ( Xk - a) � 0], and Chemoff's theorem applies if p < a < i. In this case M(t) = E[ e ' na and in favor of H if sn < na , where a is some number satisfying p 1 < a < Pz · The problem is to find an advantageous value for the threshold a .

I

By (9.21 ),

P [ Sn > na i H I ] "" e - nK (a . p 1 ) '

( 9 .22)

where the notation indicates that the probability is calculated for p = p 1 -that is, under �he assumption of H1 • By symmet:y,

P[ sn < na i H2 ] = e - nK(a , P 2 > .

( 9 23)

The left sides of (9.22) and (9 23) are the probabilities of erroneously deciding in favor of Hz when H1 is, in fact, true and of erroneously deciding in favor of H1 when Hz is, in fact, true-the probabilities describing the level and power of the test. Suppose a IS chosen so that K(a, p 1 ) K(a,p 2 ), which makes the two error probabilities approximately equal. This constraint gives for a a linear equation with solution =

(9 .24) where Q; = 1 - P; The common error probability is approximately e - nK (a . P1 > for this value of a, and so the larger K(a, p 1 ) is, the easier it is to distinguish statistically between p 1 and P z · Al tho ugh K( a( p l > P z ), p 1 ) is a complicated function, it has a simple approximation for p 1 near Pz · As x -+ 0, log(1 + x) = x - 4x z + O(x 3 ). Using this in the definition of K and collecting terms gives (9.25)

X -+ 0.

Fix p 1 = p, and let P z p logarithms gives =

(9 .26)

+

t; (9.24) becomes a function r/J(t) of t, and expanding the t -+ 0,

after some reductions. Finally, (9.25) and (9.26) together imply that (9 .27 )

t -+ 0.

In distinguishing p 1 = p from Pz = p + t for small t, if a is chosen to equalize the q 2 n two error probabilities, then their common value is about e - r fB p . For t fixed, the nearer p is to 4, the larger this probability is and the more difficult it is to distinguish z p from p + t. As an example, compare p = . 1 with p = .5. Now .36nt j8(.1X.9) z nt j8(.5)(.5). With a samp le only 36 percent as large, . 1 can therefore be distin guished from . 1 + t with abou t the same precision as .5 can be distinguished from .5 + t. =

SECTION

9.

LARGE DEVIATIONS AND THE ITERATED LOGARITHM

153

The Law of the Iterated Logarithm

The analysis of the rate at which Sn/n approaches the mean depends on the following variant of the theorem on la rge deviations.

9. 4 . Let = X I + . . . + X where the are independent and identically distributed simple random variables with mean 0 and variance 1. If n,

sn

Theorem

xn

an are constants satisfying

a n � oo '

( 9.28)

an

_ � o'

.r,;

then (9 .29) for a sequence

(n

going to 0.

PROOF. Put Y,. == sn - anVn = 'Lk. = I(Xk - an/ Vn). Then £[ }:1 < 0. Since X 1 has mean 0 and variance 1, P[ X1 > 0] > 0, and it follows by (9.28) that P[X1 > an/ Vn] > 0 for n sufficiently large, in which case P[Yn > 0] > n p [ XI a n/ rn > 0] > 0. Thus Theorem 9.1 applies to Y,. for all large enough n. Let Mn(t), pn, Tn, and Zn be associated with Y,. as in the theorem. If m(t ) and c(t) are the moment and cumulant generating functions of the Xn, then Mn(t) is the n th power of the moment generating function e -ta n !F m(t) of X I - an/ rn' and so yn has cumulant generating function -

( 9.30) Since Tn is the unique minimum of C/t), and since C�(t) = - a nVn + nc'(t), Tn is determined by the equation c ( Tn) = an/ Vn. Since X1 has mean 0 and variance 1, it follows by (9.6) that '

( 9.31)

c ( O) = c' (0) = 0,

c" ( O) = 1 .

Now c'(t) is nondecreasing because c(t) is convex, and since c ( Tn) = an/ Vn goes to 0, Tn must therefore go to 0 as well and must in fact be O(an/ Vn) . By the second-order mean-value theorem for c'(t), an/ Vn = c'(Tn) = Tn + 0( T;; ), from which follows '

( 9.32)

154

PROBABILITY

By the third-order mean-value theorem for log Pn

c(t ),

= C11( T11) = - T11a11 Vn + nc ( T11) = - T" a " Vn + n [ tT; + o ( 'T�)] .

Applying (9.32) gives

( 9.33) Now (see (9. 14)) zn has moment generating function M/T,. + t )/Pn and (see (9.30)) cumulant generating function D11(t) = Cn( T11 + t) - log Pn = - ( T11 + t )a n Vn + nc(t + 'T11) log Pn · The mean of Z" is D�(O ) = 0. Its variance s� is D� (0); by (9 .31) this is --

,

( 9.34 )

s� = nc" ( T11) = n ( c (O) + 0( 711) ) = n ( 1 + o ( 1) ) . "

The fourth cumulant of Zn is D�"(O) = nc""(T11) = O(n). By the formula (9.9) relating moments and cumulants ( applicable because £[Z11] = 0), E[Z�] 3s � + D� {O) . Therefore, E[Z�]/s� � 3, and it follows by Theorem 9.2 that there exists an a such that P[ Z11 > 0] > a > 0 for all sufficiently large n. 1 By Theorem 9.1, P[ Y11 > 0] = p��e-0n with 0 < 011 < T11 S11 a + 1og a . By (9.28), (9.32), and (9.34), On = O(a11) = o(a_;,), and it follows by (9.33) that =

"

-

P[ Y,. > O] = e- a�( l + o( l ))/ 2 .

•

The law of the iterated logarithm is this:

9. 5 . Let sn = XI + . . . +Xn, where the xn are independent ' identically distn'buted simple random variables with mean 0 and uan'ance 1 . Theorem

Then

( 9. 35 )

[

l

. sup ,j Sn P hm = 1 = 1. n 2 n log log n

Equivalent to (9.35) is the assertion that for positive E

( 9.36 )

P[Sn > ( 1 + E ),f2 n loglog n i.o . ] = 0

and

( 9 .37 )

P [ sn > ( 1 - E ),f2 n loglog n i .o. ] = 1 .

The set in (9.35) is, in fact, the intersection over positive rational E of the sets in (9.37) minus the union over positive rational E of the sets in (9.36).

SECTION

9.

LARGE DEViATIONS AND THE ITERATED LOGARITH M

155

The idea of the proof is this. Write

(1 + E)(n)], then by (9.29), P( A!) is near (log n ) - < 1 ± d. If nk increases exponentially, say nk - () k for () > 1, then P(A! ) is of the order k - 0 ± d. Now [. k k -< I ± d converges if the sign is + and di:erges if the sign is - . It will follow by the first Borel-Cantelli lemma that there is probability 0 that A; occurs for infinitely many k. In providing (9.36), an extra argument is required to get around the fact that the A; for n =I= n k must also be accounted for (this requires choosing () near 1). If the A; were indepen dent, it would follow by the second Borel-Cantelli lemma that with probabil ity 1, A; occurs for infinitely many k, which would in tarn imply (9.37). An extra argument is required to get around the fact that the A� are dependent (this requires choosing () large). For the proof of (9.36) a preliminary result is needed. Put Mk max{S0 , 5 1 , . . . , Sk }, where 50 = 0. k

k

k

=

Theorem mean 0 and

9.variance 6. If the1, then forareaindependent simple random variables with > ..fi. Xk

Since Sn - Sj has variance n - j, it follows by independence and Chebyshev's inequality that the probability in the sum is at most

n 1 1 < P( AJ 2n < 2P(AJ .

•

156 nk

PROBABILITY

PROOF oF (9.36). Given E, choose () so that () > 1 but k = lO J and x k = 0(2 log log n k ) 1 12 . By (9.29) and (9.39),

0 2 < l + E.

Let

0 2 log k

and

where gk � 0. The negative of the exponent is asymptotically hence for large k exceeds () log k, so that

Since () > 1 , it follows by the first Borel-Cantelli lemma that there is probability 0 that (see (9.38))

( 9.40 ) for infinitely many

k.

Suppose that

nk _ 1 < n < nk

and that

(9.41 ) n 314(1og nf 1 + •>14i.o.] = 0. Use (9.29) to give a simple proof that P[S, > (3n log n) 1 1 2 i.o.] = 0.

9.7. Show that (9.35) is true if

S, is replaced by IS,I or max k , , Sk or max k s n i Sk l.

CHA P TER 2

M.e asure

SECTION

10.

GENERAL MEASURES

Lebesgue measure on the unit interval was central to the ideas in Chapter 1. Lebesgue measure on the entire real line is important in probability as well as in analysis generally, and a uniform treatment of this and other examples requires a notion of measure for which infinite values are possible. The present chapter extends the ideas of Sections 2 and 3 to this more general setting. Classes of Sets

The u-field of Borel sets in (0, 1 ] played an essential role in Chapter 1, and it is necessary to construct the analogous classes for the entire real line and for k-dimensional Euclidean space. Let x = ( x 1 , . . . , xk ) be the generic point of Euclidean k-space R k . The bounded rectangles Example 10.1.

( 10. 1) will play in R k the role intervals (a, b] played in (0, 1]. Let jN k be the u-field generated by these rectangles. This is the analogue of the class f1J of Borel sets in (0, 1]; see Example 2.6. The elements of jN k are the k dimensional Borel sets. For k = 1 they are also called the Linear Borel sets. Call the rectangle (10.1) rational if the a; and b; are all rational. If G is an open set in R k and y E G, then there is a rational rectangle A such that Y y E A c G. But then G = U e c A , and since there are only countably Y Y Y many rational rectangles, this is a countable union. Thus � k contains the open sets. Since a closed set has open complement, � k also contains the closed sets. Just as f!J contains all the sets in (0, 1] that actually arise in

158

SECTION 10.

GENERAL MEASURES

159

ordinary analysis and probability theory, � k contains all the sets in Rk that actually arise. The u-field � k is generated by subclasses other than the class of rectangles. If An is the x-set where a; < x ; < b,. + n - 1 , i = 1 , . . . , k, then An is open and ( 10 . 1 ) is n n A n. Thus � k is generated by the open sets. Similarly, it is generated by the closed sets. Now an open set is a countable union of rational rectangles. Therefore, the (countable) class of rational rectangles

generates � k .

•

The u-field � 1 on the line R 1 is by definition generated by the finite intervals. The u-field f!J in (0, 1] is generated by the subintervals of (0, 1]. The question naturally arises whether the elements of f!J are the elements of � 1 that happen to lie inside (0, 1], and the answe r is yes. If .sr/ is a class of sets in a space !l and 00 is a subset of !l , let .w'n !10 = [ A n 00: A E .w'].

If !F is a u-field in !l , then !Fn 00 is a u-field in 0 0 . (ii) If .w' generates the u-field !F in 0 , then .w'n 00 generates the u-field !Fn !1 0 in 0 0 : u(.w'n 00) = u(.w') n 00. Theorem 10.1 .

(i)

PRooF. Of course 00 = n n 00 lies in !Fn 00. If B lies in !Fn 00, so that B = A n 00 for an A E !F, then 0 0 - B = (!l - A ) n 00 lies in !Fn 0 0 . If Bn lies in !Fn 0 0 for all n, so that Bn = An n 00 for an A n E !F, then U n Bn = ( U n An ) n 00 lies in !Fn 00• Hence part (i). Let .9lQ be the u-field .w'n 00 generates in 00. Since .w'n 00 c !Fn 0 0 and !Fn no is a u-field by part (i), 90 c !Fn 00• Now !Fn 00 c !F0 will follow if it is shown that A E !F implies A n fl 0 E 90, or, to put it anoth�r way, if it is shown that !F is contained in .:#= [ A c 0 : A n 00 E 90l. Since A E .w' implies that A n 00 lies in .w'n 00 and hence in !Fn 00, it follows that de .:#. It is therefore enough to show that .§ is a u-field in D. Since n n 00 = 00 lies in SlQ, it follows that .n E .:#. If A E .:#, then (!l - A ) n 00 = 00 - ( A n 00) lies in 90 and hence n - A E .§. If An E .:# for all n, then ( u n A J n Do = u /An n Do) lies in 90 and hence U n A n E .:#. •

If 00 E !F, then !Fn 00 = [ A : A c 00, A E !F]. If !l = R 1 , 00 = (0, 1], and !F- � 1 , and if .w' is the class of finite intervals on the line, then .w'n 00 is the class of subintervals of (0, 1], and f!J = u(.w'n 00) is given by ( 10.2)

A subset of (0, 1 ] is thus a Borel set (lies in f!J) if and only if it is a linear Borel set (lies in � 1 ), and the distinction in terminology can be dropped.

160

MEASURE

oo Measures assume values in the set [0, oo] consisting of the ordinary nonnegative reals and the special value oo, and some arithmetic conventions are called for. For x, y [0, oo], x < y means that y = oo or else x and y are finite (that is, are ordinary real numbers) and x < y holds in the usual sense. Similarly, x < y means that y = oo and x is finite or else x and y are both finite and x < y holds in the usual sense. in [0, oo], For a finite or infinite sequence x, x 1, x 2 , Conventions Involving

E

•

•

•

( 10.3 ) means that either (i) x = oo and x k = oo for some k, or (ii) x = oo and x k < oo for all k and L. k x k is an ord inary divergent infinite series, or (iii) x < oo and x k < oo for all k and (10.3) holds in the usual sense for '[ k x k an ordinary finite sum or convergent infinite series. By these conventions and Dirichlet's theorem [A26], the order of summation in ( 10.3) has no effect on the sum. For an infinite sequence x, x 1 , x 2, . . in [0, oo], •

J.k i X

( 10.4)

means in the first place that xk < x k + 1 < x and in the second place that either (i) x < OG and there is convergence in the usual sense, or (ii) x k = oo for some k, or (iii) x = oo and the x k are finite reals converging to infinity in the usual sense. Measures

A set function J.L on a field !Y in n is a conditions:

measure

if it satisfies these

(i) J.L( A ) E [0, oo] for A .;:: /Y; (ii) /L(0) = 0; (iii) if A 1 , A 2 , . . is a disjoint sequence of S'tsets and if then (see (10.3)) •

U 'k =

1

Ak E

/Y,

The measure 11- is finite or infinite as J.L(f!) < oo or J.L{ f! ) = oo; ;it is a probability measure if J.L ( f! ) = 1 , as in Chapter 1. If f! = A 1 U A 2 U . . . for some finite or countable sequence of St"sets satisfying J.L (A k ) < oo, then J.L is u-finite. The significance of this concept will be seen later. A finite measure is by definition u-finite; a u-finite measure may be finite or infinite. If .sf is a subclass of !Y, then J.L is u-finite on .sf if f! = U k A k for some finite or infinite sequence of .Q
SECTION 10.

G ENERAL MEASURES

161

is important to understand that CT-finiteness is a joint property of the space f!, the measure JL , and the class d. If 11- is a measure on a CT-field !F in f!, the triple (f!, !F, JL) is a measure space. (This term is not used if !F is merely a field.) It is an infinite, a (}'-finite, a finite, or a probability measure space according as 11- has the corresponding property. If JL( A c) = 0 for an SZ:set A , then A is a support of JL , and 11- is concentrated on A . For a finite measure, A is a support if and only if JL( A) = JL(f!). The pair (fl, !F ) itself i� a measurable space if :Y is a CT-fie!d in fl . To say that 11- is a meas:.ue on (fl, 7) indicates clearly both the space and the class of sets involved. As in the case of probability measures, (iii) above is the cond ition of countable additivity, and it implies finite additivity: If A 1 , An are disjoint SZ:sets, then •

As in the case of probability measures, if this holds for n inductively to all n.

•

=

•

,

2, then it extends

A measure 11- on (Q, !F ) is discrete if there are finitely or countably many points w; in f! and masses m ; in [0, oo] such that JL(A) = [., e A m ; for A E !F. It is an infinite, a finite, or a probability measure as [.i m i diverges, or converges, or converges to 1 ; the last case was treated in Example 2.9. If !F contains each singleton {wJ, then 11- is CT-finite if anct only if m; < oo for all i • Example 10.2. I

Let !F be the CT-field of all subsets of an arbitrary f!, and let JL(A) be the number of points in A , where JL( A ) = oo if A is not finite. This 11- is counting measure; it is finite if and only if f! is finite, and is CT-finite if and only if f! is countable. Even if .9" does not contain every subset of f!, counting measure is well defined on !F. • Example 10.3.

Specifying a measure includes specifying its domain. If /.L is a measure on a field !F and .9lQ is a field contained in !F, then the restriction 11-o of 11- to .9lQ is also a measure. Although often denoted by the same symbol, 11-o is really a different measure from 11- unless !F0 = !F. Its p roperties may be different: If 11- is counting measure on the CT-field !F of all subsets of a countably infinite f!, then 11- is CT-finite, but its restriction to the CT-field .9lQ = {0, f!} is not CT-finite. • Example 10.4.

162

MEASURE

Certain properties of probability measures carry over immediately to the general case. First, J.L is monotone: J.L(A) < J.L ( B ) if A c B. This is derived, just like its special case (2.5), from J.L( A ) + J.L( B - A ) = J.L( B ). But it is possible to go on and write J.L( B - A ) = J.L( B) - J.L ( A ) only if J.L(B) < oo. If J.L( B ) = oo and J.L( A ) < oo, then J.L( B - A) = oo; but for every a E [0, oo] there are cases where J.L( A ) = ,u ( B ) = oo and J.L ( B - A ) = a. The inclusion-exclusion formula (2.9) also carries over without change to .r-sets of finite measure : ( 10 .5 )

J.L

) L0

The proof of finite

A k = �J.L( A;) - � J.L( A J 1 AJ + . . . · 1 ; 1 + ( - l t + J.L( A 1 n · · · n An ) ·

subadditimty also goes through just as before:

here the A k need not have finite measure. Theorem 10.2.

Let J.L

be

a measure on a field

:7.

{i) Continuity from below: If An and A Lie in !F and An i A , then t J.L( An ) i J.L( A ). {ii) Continuity from above: If An and A Lie in !F and An � A , and if J.L( A 1 ) < oo, then J.L( AnH J.L( A). and U 'k = 1 A k Lie in !F, then {iii) Countable subadditivity: If A 1 , A 2 , •

•

•

If J.L is u-finite on !F, then !F cannot contain an uncountable, disjoint collection of sets of positive J.L-measure. {iv)

PROOF. The proofs of (i) and {iii) are exactly as for the corresponding parts of Theorem 2. 1 . The same is essentially true of (ii): If J.L( A 1) < oo, subtraction is possible and A I - An i A I - A implies that J.L( A I ) - J.L(A n) J.L( A l - An } i J.L( A l - A ) = J.L ( A 1 ) - J.L( A). There remains (iv). Let [ Be : () E e] be a disjoint collection of .r-sets satisfying J.L(Be) > 0. Consider an .r-set A for which J.L(A) < oo. If 01, , (}n are distinct indices satisfying J.L( A n Be ) > E > 0, then nE < Ei= 1 J.L( A n Be ) < J.L( A ), and so n < J.L( A) IE. Thus the index set [ () : J.L( A n Be) > E ] is finite, =

•

I

tsee (10.4).

•

.

I

SECTION 10.

GENERAL MEASURES

163

and hence (take the union over positive rational E) [0: J.L(A n B8) > 0] is countable. Since 1-L is 0'-finite, f! = U k A k for some finite or countable sequence of .9t"sets A k satisfying J.L( A k ) < oo. But then E>k = [0: J.L( A k n B8) > 0] is countable for each k. Since J.L( B8) > 0, there is a k for which • J.L( A k n B8) > 0, and so E> = U k E>k : E> is indeed countable. Uniqueness

Accoroing to Theorem 3 .3, probability measures agreeing on a 7T-system 9 agree on u{9). There is an extension to the general case.

Suppose that J.L 1 and 1-L z are measures on u(9), where 9 is a 1T-system, and suppose they are u-finite on 9. If I-L l and J.L2 agree on 9, then they agree on u(9). Theorem 10.3.

PROOF. Suppose that B E 9 and J.L 1(B) = J.L i B ) < oo, and let ../'8 be the class of sets A in u(9 ) for which J.L 1(B n A ) = J.L 2(B n A ). Then ../'8 is a ,.\-system containing 9 and hence (Theorem 3.2) containing u{ 9). By u-finiteness there exist Ylsets Bk sati�fying f! = Uk Bk and J.L 1( Bk ) J.L 2( Bk ) < oo. By the inclusion-exclusion formula ( 10.5), =

1-L a

(� n

1. - 1

)

( Bi n A ) =

L J.L a( Bi n A ) -

l � i �n

L

1 � i <j :$.n

1-L a( Bi n Bi n A ) + . . .

for a = 1, 2 and all n. Since 9 is a 1r-system containing the B;, it contains the B; n Bi, and so on. For each u( 9)-set A , the terms on the right above are therefore the same for a = 1 as for a = 2. The left side is then the same for a = 1 as for a = 2; letting n � oo gives J.L 1(A) = I-L i A). •

Suppose I-L l and J.L 2 tire finite measures on u(9), where fi?J is a 1T-system and f! is a finite or countable union of sets in 9. If I-L l and 1-Lz agree on 9, then they agree on u(9). Theorem 10.4.

PROOF. By hypothesis, f! = U k Bk for Ylsets Bk , and of course 1-La(Bk) 5. 1-L a(f!) < oo, a 1 , 2. Thus 1-L 1 and 1-Lz are u-finite on 9, and Theorem 10.3 applies. • =

If 9 consists of the empty set alone, then it is a 7T-system and u(9) {0, f!} . Any two finite measures agree on 9, but of course they need not agree On u( 9). Theorem 10.4 does not apply in this case, because f! is not a countable union of sets in 9. For the same reason, no measure on u(9) is u-finite on 9, and hence Theorem 10.3 does not apply. • Example 10.5. =

164

MEASURE

Suppose that (f!, !F) = ( R 1 , .9£' 1 ) and 9 consists of the half-infinite intervals ( oo, x ]. By Theorem 10.4, two finite measures on !F that agree on 9 also agree on !F. The �sets of finite measure required in • the definition of u-finiteness cannot in this example be made disjoint.

-

Example 10.6.

If a measure on ( f! , !F ) is u-finite on a subfield .9lQ of !F, then f! = Uk Bk for disjoint SlQ-sets Bk of finite measure: if they are not • disjoint, replace Bk by Bk n B[ · · · n Bf_ 1 . b'xample 10.7.

The proof of Theorem 10.3 simplifies slightly if f! = U k Bk for disjoint �sets with JL 1( Bk ) = JL2( Bk ) < oo, because additivity itself can be used in place of the inclusion-exclusion formula. PROBLEMS

10.1. Show that if conditions (i) and (iii) in the definition of measure hold, and if f.L(A) < oo for some A E !T, then condition (ii) holds. k

10.2. On the u-field of all subsets of n = {1, 2, . . . } put f.L{ A) = [k e A 2 - if A is finite and f.L(A) = oo otherwise. Is f.L finitely additive? Countably additive?

10.3. (a) In connection with Theorem 10.2(ii), show that if An l A and JL( A k ) < oo for some k, then JL(A n H f.L(A). (b) Find an example in which A n l A, JL(A n) oo, and A = 0. =

10.4. The natural generalization of (4.9) is ( 10.6)

(

)

f.L lim inf An < lim inf JL( An) n

n

(

)

< lim SUP JL ( A n ) < JL lim sup A n · n n

Show that the left-hand inequality always holds. Show that the right-hand inequality holds if f.L( U k ;;,: n A k) < oo for some n but can fail otherwise.

10.5. 3.10 A measure space (D., !T, f.L) is complete if A c B, B E !T, and f.L( B) = 0 together imply that A E .9'"- the definition is just as in the probability case. Use the ideas of Problem 3.10 to construct a complete measure space (!l, srt, f.L + ) such that .rc srt and f.L and f.L + agree on .r. 10.6. The condition in Theorem 10.2(iv) essentially characterizes u-finiteness. (a) Suppose that !T, f.L) has no "infinite atoms," in the sense that for every A in !T, if f.L( A) = oo, then there is in .r a B such that B c A and 0 < f.L( B) < oo. Show that if .r does not contain an uncountable, disjoint collection of sets of positive measure, then f.L is u-finite. (Use Zorn's lemma.) (b) Show by example that this is false without the condition that there are no "infinite atoms."

(D.,

•

10.7. Example 10.5 shows that Theorem 10.3 fails without the u-finiteness condition. Construct other examples of this kind.

SECTION

11.

OUTER MEASURE

165

SECTION 1 1. OUTER MEASURE Outer Measure An

outer measure is a set function 11-* that is defined for all subsets of a space

f! and has these four properties:

(i) (ii) (iii) (iv)

J.L *(A) E [0, oo] for every A c f! ; J.L *(0) 0; 11-* is monotone: A c B implies J.L*{ A) < J.L*(B); J.l* is countably subadditive: J.L*(Un A ) < Ln J.L*(A). =

The set function

0

P*

defined by (3 .1) is an example, one which generalizes :

Example 11.1. Let p be a set function on a class Jd' in fl. Assume that E .sf and p (0) = 0, and that p( A ) E [0, oo] for A E d; p and .sf are

otherwise arbitrary. Put

( 1 1 .1)

11-*( A ) = inf L p( AJ , n

where the infimum extends over all finite and countable coverings of A by .ldf-sets An. If no such covering exists, take J.L*( A) = oo in accordance with the convention that the infimum over an empty set is oo. That 11-* satisfies { i ), (ii), and (iii) is clear. If J.L*(A n ) = oo for some n , then obviously J.L*( U n A) < Ln J.L*(A n ). Otherwise, cover each A n by .ldf-sets Bn k n satisfying Lk p(Bn k ) < J.L*(A) + Ej2 ; then ,u.*{ Un A) < Ln , k p(Bn k ) < • Ln J.L*(A n) + E. Thus f.L* is an outer measure. Define

A to be J.L*-measurable

if

(11 .2) for every E. This is the general version of the definition (3.4) used in Section 3. By subadditivity it is equivalent to ( 1 1 .3) Denote by .A'(J.L*) the class of J.L*-measurable sets. The extension property for probability measures in Theorem 3.1 was proved by a sequence of lemmas the first three of which carry over directly to the case of the general outer measure: If P* is replaced by 11-* and .II by .A'{J.L*) at each occurrence, the proofs hold word for word, symbol for symbol.

166

M EASURE

In particular, an examination of the arguments shows that oo as a possible value for J.L * does not require any changes. Lemma 3 in Section 3 becomes this:

ll.l. If 11-* is an outer measure, then .A'(J.L*) is a u-field, and J.L* restricted to .A'( 11-*) is a measure. Theorem

This will be used to prove an extension theorem, but it has other applications as w ell.

Extension Theorem 1 1 .2.

u-field.

A measure on a field has an extension to the gen.,erated

If the original measure on the field is u-finite, then it follows by Theorem 10.3 that the ext ension is unique. Theorem 1 1 .2 can be deduced from Theorem 1 1 . 1 by the arguments used in the proof of Theorem 3.l.t It is unnecessary to retrace the steps, however, because the ideas will appear in stronger form in the proof of the next result, which generalizes Theorem 11.2. Define a class .sf of subsets of f! to be a semiring if

{i) 0 E .sf; {ii) A, B E .sf implies A n B E .sf; (iii) if A, B E N' and A c B, then such that B - A = U Z = 1 Ck.

there exist disjoint .ldf-sets C 1 , . . . , Cn

The class of finite intervals in f! = R 1 and the class of subintervals o f f! = (0, 1 ] are the simplest examples of semirings. Note that a semiring need not contain fl.

Suppose that J.L is a set function on a semiring d. Suppose that J.L has values in [0, oo], that J.L{0) 0, and that J.L is finitely additive and countably subadditive. Then J.L extends to a measure on u(Jlf). Theorem 11.3.

=

This contains Theorem 11.2, because the conditions are all satisfied if .sf is a field and J.L is a measure on it. If f! = U k A k for a sequence of .ldf-sets satisfying J.L(A k ) < oo, then it follows by Theorem 10.3 that the extension is . umque. tsee also Problem 1 1.1.

SECTION

1 1.

OUTER MEASURE

167

PROOF. If A, B, and the Ck are related as in condition (iii) in the definition of semiring, then by finite additivity J.L( B) J.L( A) + I:� = 1 J.L(C k ) > J.L( A). Thus 11- is monotone. Define an outer measure 11-* by (1 1.1) for p = 11-: =

( 1 1 .4)

n

the infimu m extending over coverings of A by .ldf-sets. The first step is to show that N'c ./t(J.L*). Suppose that A E d. If 11-*( E) = oo, then (1 1.3) holds trivially. If 11- *{E) < oo, for given E choose .ldf-sets An such that E c U n A n and l:nJ.L(AJ < J.L*{ E) + E. Since .sf is a semiring, Bn = A n A n lies in .sf and A c n A n =An - Bn has th� form U;;';:, 1 Cnk for disjoint .ldf-sets cn k • Note that A n Bn u U;;'::. I cn k ' where the union is disjoint, and that A n E c l.J n Bn and A c n E c U n U;;'" 1 Cn k · By the defini tion of 11-* and the assumed finite additivity of J.L, =

n

n k=1

n Since E is arbitrary, (11.3) follows. Thus de ./t(J.L*). The next step is to show that 11-* and 11- agree on d. If A c Un A n for .ldf-sets A and A,, then by the assumed countable subadditivity of 11- and the monotonicity established above, J.L(A ) < l:nJ.L(A nA n ) < l:n J.L(A n ). There fore, A E .sf implies that J.L( A) < J.L*( A) and hence, since the reverse in equality is an immediate consequence of (11.4), J.L( A) J.L*( A). Thus 11-* agrees with 11- on d. Since N'c./t(iJ-*) and ./t(J.L*) is a u-field (Theorem 11. 1), =

Since 11-* is countably additive when restricted to ./t(J.L*) (Theorem 1 1.1 again), 11-* further restricted to u(N') i s an extension of 11- on .sf, as required . •

For .sf take the semiring of subintervals of f! (0, 1] (together with the empty set). For 11- take length ,.\: ,.\(a, b] = b - a . The finite additivity and countable subadditivity of ,.\ follow by Theorem 1.3.t By Theorem 1 1 .3, ,.\ extends to a measure on the class u( d) = f!J of Borel sets in (0, 1]. • Example 11.2.

=

t on a field, countable additivity implies countable subadditivity, and A is in fact countably additive on .sat- but .sat is merely a semi ring. Hence the separate consideration of additivity and subadditivity; but see Problem 1 1 .2.

168

M EASURE

This gives a second construction of Lebesgue measure in the unit interval. In the first construction ,.\ was extended first from the class of intervals to the field f!J0 of finite disjoint unions of intervals (see Theorem 2.2) and then by Theorem 1 1 .2 (in its special form Theorem 3 . 1 ) from f!J0 to f!J = u( f!J0) . Using Theorem 1 1 .3 instead of Theorem 1 1 .2 effects a slight economy, since the extension then goes from .sf directly to f!J without the intermediate stop at f!J0, and the arguments involving (2. 13) and (2.14) become unnecessary. In Theorem 1 1 .3 take for .sf the semiring of finite inter vals on the rea l line R 1 , and consider A 1(a, b] = b - a. The arguments for Theorem 1.3 in no way require that the (finite) intervals in question be contained in (0, 1], and so ,.\ 1 is finitely additive and countably subadditive on this class d. Hence ,.\ 1 extends to the u-field �� of linear Borel sets, which is by definition generated by d. This defines Lebesgue measure ,.\ 1 over the • whole real line. Example 11.3.

lies in (jJ if and only if it lies in � 1 (see (1 0.2)). Now ,.\ 1( A) = ,.\( A) for subintervals A of (0. 1], and it follows by uniqueness (Theorem 3.3) that A1{ A ) = A( A) for all A in f!J. Thus there is no inconsis tency in dropping ,.\ 1 and using A to denote Lebesgue measure on � 1 as well as on f!J. A subset of

(0, 1]

The class of bounded rectangles in R k is a semiring, a fact needed in the next section. Suppose that A = [ x: x; E I;, i < k] and B = [x: x; E 1;, i < k] are nonempty rectangles, the I; and 1; being finite intervals. If A c B, then I; c1;, so that 1; -k I; is a disjoint union If U I;' of intervals (possibly empty). Consider the 3 disjoint rectangles [ x : x; E U;, i < k ], where for each i, U; is I; or I; or I;'. One of these rectangles is A itself, and B - A • is the union of the others. The rectangles thus form a semiring. Example 11.4.

An

Approximation Theorem

If .sf is a semiring, then by Theorem 10.3 a measure on u(N') is determined by its values on .sf if it is u-finite there. Theorem 1 1 .4 shows more explicitly how the measure of a u{ N')-set can be approximated by the measures of .ldf-sets.

l. If A, A I> , An are sets in a semiring disjoint Jdf-sets c I> , em such that Lemma

•

•

•

•

.sf,

•

then there are

•

A n AcI n

· · ·

nAcn = C I

u · · ·

U Cm

•

PROOF. The case n = 1 follows from the definition of semiring applied to A n A I = A - ( A n A 1 ). If the result holds for n, then A n A I n · · nA� + I • UJ!= 1 (Ci n A� + 1 ); apply the case n 1 to each set in the union. ·

=

=

SECTION

1 1.

OUTER MEASURE

169

Suppose that .sf is a semiring, J.L is a measure on !F u( .N'), and J.L is u-finite on .sf. Theorem 1 1.4.

If B E !F and E > 0, there exists a finite or infinite disjoint sequence A �> A 2 , . . . of dsets such that B c Uk A k and J.L(( U k A k ) - B) < €. (ii) If B E !F and E > 0, and if J.L(B) < oo, then there exists a finite disjoint sequence A" . . , A n of dsets such that J.L( B ..,. ( U k = 1 A k )) < €. (i)

.

PROOF. Return to the proof of Theorem 1 1 .3. If J.L* is the outer measure defined by (11.4), then !Fc .A'(J.L*) and 11-* agrees with J.L on .9/, as was shown. Si nce 11-* restricted to !F is a measure, it follows by Theorem 10.3 that 11-* agrees with J.L on !F as well. Suppose now that B lies in !F and J.L( B) = J.L*{B) < oo. There exist dsets A k such that B c U k A k and J.L( U k A k ) < L: k J.L( A k ) < .u(B) + E; but then J.L(( U k A k ) - B) < E. To make the sequence {A k } disjoint, replace A k by n A� _ 1 ; by Lemma 1 , each of these sets is a finite disjoint A k n A� n union of sets in .9/. Next suppose that B !ies in !F and J.L( B) J.L*(B) = oo. By u-finiteness there exist dsets em such that f! = U m em and J.L(e m ) < oo. By what has j ust been shown, there exist dsets A mk such that B n em c U k A mk and J.L(( u k A mk ) - (B n em )) < E j2 m The sets A mk taken all together provide a sequence A 1, A 2 , . . . of dsets satisfying B c U k A k and J.L({ U k A k ) - B) < E. As before, the A k can be made disjoint. To prove part (ii), consider the A k of part (i). If B has finite measure, so has A = U k A k , and hence by continuity from above (Theorem 10.2{ii)), • J.L( A - U k < n A k ) < E for some n . But then J.L(B .-,. ( Uk= l A k )) < 2E. ·

·

·

=

.

If, for example, B is a linear Borel set of finite Lebesgue measure, then A(B ..,. ( U k = I A t- )) < E for some disjoint collection of finite intervals A I , . . . , An . I.

If JL is a finite measure on a u-field !T generated by a field .ro, then for each !?-set A and each positive E there is an .ro-set B such that JL(A "' B) < E. Corollary

PROOF. This is of course an immediate consequence of part (ii) of the theorem, but there is a simple direct argument. Let .# be the class of /?-sets with the required property. Since Ac..,. Be = A ..,. B, .# is closed under complementation. If A = Un An, where An E .#, given E choose n0 so that JL(A - Un 5 no An) < E, and then choose �-sets Bn, n < n0, so that JL( An .-,. Bn) < Ejn0• Since ( Un s no An) .-,. ( U n s no Bn) c Un s n ( An "' Bn), the .ro-set B = Un s n Bn satisfies JL( A "' B ) < 2E. Of course .ro c .#; • since :# is a u-field , !Fe .#, as require�. .w'

is a semiring, !l is a countable union of %sets, and 11- . , JL are measures on !F u(.w'). If JL 1( A ) JLz(A) < oo for A E .w', then JL1(B) 2 J1.2(B) for B E !F. Corollary 2. Suppose that

0 (pointwise) implies A(/) > 0 and continuous from above at 0 in the sense that fn l 0 (pointwise) implies A(fn ) --+ 0. (a) If f < g ( f, g E ../ ), define in !l X R 1 an "interval"

( f g ] = [(w, t ) : f ( w ) < t < g ( w ) ] .

( 1 1 5)

,

Show that these sets form a semiring d0. (b) Define a set function v0 on d0 by v 0( f , g ) = A ( g - f ) .

( 1 1 .6)

S how that v0 is finitely additive and countably subadditive or. d0.

11.5.

t

(a) Assume f E ...£' and let fn = (n( f - f 1\ 1)) 1\ 1 . Show that f(w) < 1 implies fn(w) = 0 for all n and f(w) > 1 implies fn(w) = 1 for all sufficiently large n. Conclude that for x > 0, (O, xfn] i [ w . f( w ) > 1 ] X (O, x] .

( 1 1 .7) (b) Let

!T be the smallest u-field with respect to which every f in ...£' is measurable: !T u[ r 1 H: f E .../, H E � � ]. Let .9(j be the class of A in !T for

which A X (0, 1] E u(d0) . Show that .9(j is a semiring and that 7 u (.9(j ) . (c) Let v be the extension of v0 (see (11.6)) to u(d0), and for A E .9Q define JLo(A) = v(A X (0, 1]). Show that JLo is finitely additive and countably subaddi tive on the semiring .9Q. SECTION 12. MEASURES I N EUCLIDEAN SPACE Lebesgue Measure

In Example 1 1 .3 Lebesgue measure ,.\ was constructed on the class �1 of linear Borel sets. By Theorem 10.3, ,.\ is the only measure on � 1 satisfying an analogous k ,.\(a, b] b - a for all intervals. There is in k-space k dimensional Lebesgue measure ,.\ k on the class � of k-dimensional Borel sets (Example 10.1). It is specified by the requirement that bounded rectan gles have measure =

(12.1)

Ak [ x :

a; < x; < b; , i

=

1,

. . . ,

This is ordinary volume-that is, length ( k or hypervolume ( k > 4).

k] =

=

k

0 ( b; - a; ) .

i=

I

1), area ( k = 2), volume ( k = 3),

MEASURE

172

Since an intersection of rectangles is again a rectangle, the uniqueness theorem shows that ( 1 2 . 1 ) completely determines A k . That there does exist k such a measure on � can be proved in several ways. One is to use the ideas involved in the case k = 1 . A second construction is given in Theorem 12.5. A third, independent, construction uses the general theory of product mea sures; this is carried out in Section 18. t For the moment, assume the k existence on � of a meas ure A k satisfying (12. 1). Of course, A k is u-finite. A basic property of A k is translation invariance. +

If A E �\ then A + x = [a + x : a E A] E � k and A k(A ) A k (A + x) for all x. Theorem 12.1. =

k

PROOF. If .:# is the class of A such that A + x is in � for all x, then .:# is a u-fie!d containing the bou nded rectangles, and so .:#-:J �". Thus A + x E � k for A E � k . k For fixed x define a measure 11- on � by JL(A) = A k (A + x). Then 11- and A k agree on the 7T-system of bounded rectangles and so agree for all Borel sets. •

If A is a (k - I )-dimensio nal subspace and x lies outside A, the hyper planes A + tx for real t are disjoint, and by Theorem 12.1, all have the same measure. Since only countably many disjoint sets can have positive measure (Theorem 10.2{iv)), the measure common to the A + tx must be 0. Every (k - I)-dimensional hyperplane has k-dimensional Lebesgue measure 0. The Lebesgue measure of a rectangle is its ordinary volume. The following theorem makes it possible to calculate the measures of simple figures. Theorem 12.2. If T: R k implies that TA E � k and

�

Rk

is

linear and nonsingular, then A E !JR k

( 1 2.2) Since a parallelepiped is the image of a rectangle under a li near transfor mation, (12.2) can be used to compute its volume. If T is a rotation or a reflection-an orthogonal or a unitary transformation-then det T = + 1 , a nd so Ak(TA) = A k(A). Hence every rigid transformation or isometry (an orthogonal transformation followed by a translation) preserves Lebesgu e measur e. An affine transformation has the form Fx = Tx + x0 (the general t see also Problem s 17.14 and 20.4 tAn analogous fact was used in the construction of a nonmeasurable set on

p.

45

S ECTION

12.

MEASURES I N EUCLIDEAN SPACE

173

li near transformation T followed by a translation); it is nonsingular if T is. It follows by Theorems 12.1 and 12.2 that ,.\ k(FA ) = ldet Tl · ,.\ k( A ) i n the nonsingular case. PROOF OF TH E TH EOREM. Since T Un A n = un TA n and TAC = (TA Y k because of the assumed nonsingularity of T, the class §= [ A : TA E .9£1 ] is a £T-field. Since TA is open for open A, it follows again by the assumed nonsingularity of T that § contains all the open sets and hence (Example 1 0. 1) all the Borel sets. Therefore, A E .9£1 k implies TA E .9£1 k. For A E B£1 \ set J.I) A ) = ,.\ k(TA ) and J.1 2{ A ) = Idet TI · A k ( A ). Then J.1 1 k a nd 1.1 2 are measures, and by Theorem 10.3 they will agree on .9£1 (which i'> the assertion (12.2)) if they agree an the 7T-system consisting of the rectangles [ x : a; <x; < b;, i = 1, . . . , k ] for which the a; and the b; are all rational (Example 10.1 ). It suffices therefore to prove (12.2) for rectangles with sides of rational length. Since such a rectangle is a finite disjoint union of cubes and ,.\ k is translation-invariant, it is enough to check (12.2) for cubes

( 12.3)

A = [ x : 0 < x; < c , i =

1,. . . , k]

that have their lower corner at the origin. Now the general T can by elementary row and column operations t be represented as a product of linear transformations of these three special forms:

{1°) (2°) (3°)

T(x 1 , { 1 , 2, T(x 1 , T(x 1 ,

• • •

. . .

•

,

• •

• • •

, x k ) = ( x7T 1 , . _ , x7Tk ), where k} ; , x k ) = (ax 1 , x 2 , , xk); , xk) = ( x 1 + x 2 , x 2 , , x k ). • •

• •

1T

is a permutation of the set

•

• • •

Because of the rule for multiplying determinants, it suffices to check (12.2) for T of these three forms. And, as observed, for each such T it suffices to consider cubes (12.3 ). (1 °): Such a T is a permutation matrix, and so det T = + 1. Since (12.3) is invariant under T, (12.2) is in this case obvious. (2°): Here det T = a, and TA = [ x: x 1 E H, O < x; < c, i = 2, . . . , k ], where H = (0, ac] if a > 0, H = { 0} if a = 0 (although a can not in fact be 0 if T is k nonsingular), and H = [ac, 0) if a < 0. In each case, ,.\ k (TA ) = Ia I · c = lal · A k( A ). tBI RKHOFF & MAc LANE, Section 8.9

174

MEASURE

{3°): Here det T 1. k < 3, and define =

Let

B = [x: 0 < xi < c, i = 3, . . . , k], where B R k =

if

B 1 = [ x : 0 < x 1 < x 2 < c ] n B, B 2 = [ x : O < x 2 < x 1 < c ] n B, B3 [ x: c < x < c + x2 , 0 < x2 < c ] n B. Then A = B 1 u B 2 , TA = B 2 u B3, and B 1 + (c, 0, . . . , 0) = B3 • Since i. k (B 1 ) = • i. k(B 3 ) by trans lation invariance, (12.2) follows by additivity. =

1

if T is singular, then det T = 0 and TA lies in a (k - I)-dimensional subspace. Since such a subspace has measure 0, (12.2) holds if A and TA lie in !JR k . The surprising thing is that A gp k need not imply t hat TA gpk if T is singular. Even for a very simple transformation such as the projection T(x , x 2 ) = ( x 1 , 0) in the plane, there exist Borel sets A for which TA is not a Borel set. f

E

E

Regularity

Important among measures on � k are those assigning finite measure to bounded sets. They share with A k the property of regularity : Theorem 12.3.

A is bounded.

Suppose that f.L is a measure on � k such that J.L(A) < oo if

{i) For A E � k and € > 0, there exist a closed C an open G such that C cA c G and f.L(G - C ) < E. {ii) If f.L( A) < oo, then f.L( A) = sup J.L( K ), the supremum extending over the compact subsets K of A . PROOF. The second part of the theorem follows form the first: J.L(A) < oo implies that J.L(A - A0) < E for a bounded subset A0 of A, and it then follow s from the first part that J.L( A0 - K) < E for a closed and hence compact subset K of A0 •

tsee HAUSDORFF,

p.

241 .

S ECTION 12.

MEAS URES IN EUCLIDEAN SPACE

175

To prove (i) consider first a bounded rectangle A = [x: a , < x1 < b,., i < k ]. The set G, = [x : a, <x, < b, + n - 1 , i < k ] is open and G,. � A . Since J.L(G 1 ) is finite by hypothesis, it follows by continuity from above that J.L(G,. - A) < E for large n. A bounded rectangle can therefore be approximated from the outside by open sets. The rectangles form a semiring (Example 1 1 .4). For an arbitrary set A in !11 \ by Theorem 1 1 .4{i) there exist bounded recta ngles A k such that A c U k A k and J.L(( U k A k ) - A ) < E. Choose open sets Gk such that A k c G k and J.L(Gk - A k ) < Ej2 k. Then G uk Gk is open and J.L(G - A ) < 2€ . Thus the general k-dimen sional Borel set ca n be approximated from the outside by open sets. To approximate from the jnside by closed sets, pass to comple • ments. =

Specifying Measures on the Line

There are on the line many measures other than ,.\ that are important for probability theory. There is a useful way to describe the collectioil of all measures on .9i' 1 that assign finite measure to each bounded set. If J.L is· such a measure, define a real function F by ( 12.4)

F( x ) =

(

J.L(O, x] - J.L ( x , O]

if x ?. O , if X < 0 .

It is because J.L (A) < oo for bounded A that F is a finite function. Clearly, F is nondecreasing. Suppose that x, � x. If x ;;::: 0, apply part (ii) of Theorem 10.2, and if x < 0, apply part (i); in either case, F(x ,.H F(x ) follows . Thus F is continuous from the right. Finally, ( 12.5 )

J.L ( a , b]

= F(b)

-

F( a )

for every bou nded interval (a, b ]. If J.L is Lebesgue measure, then (1 2.4) gives F(x ) = x. The finite intervals form a 1r-system generating .9i' 1 , and therefore by Theorem 10.3 the function F completely determines J.L through the relation (12.5)." But (1 2.5) and J.L do not determine F: if F( x ) satisfies (1 2.5), then so does F(x ) + c. On the other hand, for a given J.L, (12.5) certainly determines F to within such an additive constant. For finite J.L , it is customary to standardize F by defining it not by (1 2.4) but by

( 12.6 )

F( x) = 11- (

-

oo , x ;

]

then limx .... - oo F( x ) = O and limx --+ oo F( x ) = J.L(R 1 ). If J.L is a probability measure, F is called a distribution function (the adjective cumulative is sometimes added).

176

M EASURE

Measures J.L are often specified by means of the function theorem ensures that to each F there does exist a J.L ·

F. The following

If F is a rwndecreasing, right-continuous real function on the line, there exists on !JR 1 a unique measure J.L satisfying (12.5) for all a and b. Theorem 12.4.

As noted above , uniqueness is a simple consequence of Theorem 10.3. The proof of existence is almost the same as the construction of Lebesgue measure, the case F( x) = x. This proof is not carried through at this poi nt, because it is contained in a parallel, more general construction for k dimensional space in the next theorem. For a very simple argument establish ing Theorem 12.4, see the second proof of Theorem 14. 1 .

Rk

Specifying Measures in

The a--field ,92 k of k-dimensional Borel sets is generated by the class of bounded rectangles A

( 12. 7) (Example

10.1).

= [ x: a1 < X; < b1, i = 1 , . . . , k ]

If /1 = (a1, b1], A has the form of a Cartesian product

( 12.8) Consider the sets of the special form

( 12.9)

sX = [ y : Y; < X i ,

Sx

consists of the points "southwest" of is the half-infinite interval ( - oo, x]. Now

i

=

x Sx

1, . . . , k ] ;

(x 1 , , x k ); in the ca�e k = 1 it is closed, and (12.7) has the form

=

• • •

Therefore, the class of sets (12.9) generates !JJ k . This class is a 7T-system. The objective is to find a version of Theorem 12.4 for k-space. This will in particular give k-dimensional Lebesgue measure. The first problem is to find the analogue of (12.5). A bounded rectangle (12.7) has 2k vertices-the points x = (x 1 , , x k ) for which each x; is either a1 or b1. Let sgn A x, the signum of the vertex, be + 1 or - 1, according as the number of i (1 < i < k ) satisfying x1 = a; is even or odd. For a real function F on R k , the difference of F around the vertices of A is D.A F L sgn A x · F(x ), the sum extending over the 2 k vertices x of A . In the case k 1 , A = (a, b] and tJ. A F = F(b) - F(a). In the case k = 2 , • • •

=

=

D.A F = F(bp b 2 ) - F(b l , a 2 ) - F(ap b 2 ) + F(a 1 , a 2 ).

S ECTIO N 12.

MEASURES IN EUCLIDEAN SPACE

17 7

Since the k-dimensional analogue of (1 2.4) is complicated, suppose at first that J.L is a fin ite measure on � k and consider instead the analogue of (12.6), nam ely

F ( x ) = J.L [ Y: y,. < x,. , i = 1 , . . . , k ] .

( 1 2. 1 1 ) Suppose that Then

Sx

is defined by (12.9) and A is a bounded rectangle (12.7).

( 12.12) To see this, apply to the union on the right in (12. 1 0) the inclusion-exclusion formula ( 10.5). The k sets in the union give 2 k 1 intersections, and these are the sets Sx for x ranging over the vertices of A other than (b 1 , , bk ). Taking into account the signs in (10.5) leads to (12. 12). -

•

+

-

+

-

+

+

-

+

-

-

+

-

+

+

-

+

-

•

•

+

Suppose x< n > � x in the sense that x� n > � x,. as n -7 oo for each i 1, . . . , k. n The n S, 3, and if A and B are bounded sets in R " and have nonempty interiors, then A and B are congruent by dissection. (The result does not hold if k is 1 or 2.) This is the Banach-Tarski paradox. It is usually illustrated in 3-space this way: It is possible to break a solid ball the size of a pea into finitely many pieces and then put them back together again in such a way as to get a solid ball the size of the sun. t

PROBLEMS

12. 1. Suppose that JL is a measure on .9£'1 that is finite for bou nded sets and is translation-invariant: JL(A + x) = u( A ). Show that JL(A) a A (A) for some k a > 0. Extend to R =

12.2. Suppose that A E .9P1, A( A) > 0, and 0 < e < 1 . Show that there is a hounded open interval I such that A( A n I ) > eA(I). Hint: Show that .A( A) may be assumed finite, and choo!.e an open G c;uch that A c G and A( A) > eA( G ). Now G Un In for disjoint open intervals In [A1 2], and [nA(A n 1,) > e[nA(/n); use an ln . =

A E .9P 1 and A( A) > 0, then the origin is interior to the difference set D(A) = [x - y: x, y E A]. Hint Choose bounded open interval I as in Problem 12.2 for e = i· Suppose that l z l < A( / )j2; since A n I and (A n l ) + z are contained in an interval of length less than 3 A( / )j2 and hence cannot be disjoint, z E D( A).

12.3. i

If

a

12.4. i The following construction leads to a subset H of the unit interval that is n onmeasurable in the extreme sense that its inner and outer Lebesgue mea sures are 0 and 1: A * (H) = 0 and A*(H) 1 (see (3.9) and (3.10)). Complete the details. The ideas are those in the construction of a nonmeasurable set at the end of Section 3. h will be convenient tc work in G [0, 1); let ffi and e denote addition and subtraction modulo 1 in G, which is a group with iden tity 0. (a) Fix an irrational e in G and for n = 0, ± 1, ± 2, . . . let an be ne reduce d modulo 1 . Show that en Ell em = en + m• en e em = en- m • and the en are distinct. Show that {e 2n: n = 0, + 1, . . } and {e 2n + 1: n = 0, + 1, } are dense in G. (b) Take x and y to be equivalent i f x 6 y lies in {en: n = 0, + 1, . . . }, which is a subgroup. Let S contain one repre!.entative from each equivalence class (each coset). Show that G = Un (S Ell en), where the union is disjoint. Put H = Un ( S Ell e 2 n) and show that G - H = H Ell e. (c) Suppose that A is a Borel set contained in H. If A( A) > 0, then D(A ) contains an interval (O, t:); but then some e 2 k + l lies in (O, t:) c D(A) c D(H), and so e 2 k + l = h , - h z = h , 8 h z = Cs, Ell e 2 n ) e (s z Ell e 2n ) for some h,, h z in H and some s 1 , s2 in S. Deduce that s 1 s2 and obtain a contradiction . Conclude that A * (H) = 0. (d) Show that A * (H Ell e) = 0 and A*(H ) = 1 . =

=

.

. . .

1

t see WAGON for an accou nt of these prodigies.

SECTION 12.

MEASURES I N EUCLIDEAN SPACE

181

12.5. i The construction here gives sets H, such that H, j G and A * (H, ) = 0. If 1, = G - H, , then 1, l 0 and A*(J, ) = 1. (a) Let H, = u;: - n (S Ell ()k ), so that H, i G. Show that the sets H, Ell ()(2n + I )l are disjoint for different u. (b) Suppose that A is a Borel set r.ontained in H,. Show that A and indeed all the A Ell ()< 2, + 1 l< have Lebesgue measure 0. �

12.6. Suppose that JL is non negative and finitely additive on !JRk and that JL( R k ) < oo. Suppose further that JL(A ) = sup JL(K), where K ranges over the compact subsets of A. Show that JL is countably additive (Compare Theorem 12.3(ii).) 12.7. Suppm:e JL is a measure on !JRk such that bounded sets have fi nite measure. Given A, show that there exist an Fu-":>et U (a countable union of closed sets) and a G1i-set V (a countable intersection of open sets) such that U cA c V and JL(V - U) = 0. 12.8. 2. 19t Suppose that JL is a nonatomic probability measure on (R k , !JR� ) and that JL(A ) > 0. Show that there is an uncountable compact set K such that K cA and JL( K ) = 0. 12.9. The minimal closed support of a measure JL on !JR k is a closed set C such that e c e for closed e if and only if e supports JL · Prove its exi�tence and �'uniqueness. Characterize the points of elL as those X such that JL(U ) > 0 for every neighborhood U of x. If k = 1 and if JL and the function F(x) are related by (12.5), the condition is F( x - E ) < F(x + E) for all E; x is in this case called a point of increase of F. 12.10. Of minor interest is the k-dimensional analogue of (12.4). Let I, be (0, t] for t > O and (t, O] for t < O, and let A .. = I.. X . . . x i..• . Let �p(x) be + 1 or - 1 acc.ording as the number of i, 1 < i < k, for which X ; < 0 is even or odd. Show that, if F(x) = �p(x)JL(A .. ), then (12.12) holds for bounded rectangles A . Call F degenerate if it is a fu nction of some k - 1 of the coordinates, the requirement in the case k = 1 being that F is constant. Show that !l A F = 0 for every bounded rectangle if and only if F is a finite sum of degenerate functions; (12.12) determines F to within addition of a function of this sort. 12.11. Let G be a nondecreasing, right-continuous function on the line, and put F(x, y) = min{G(x), y}. Show that F satisfies the conditions of Theorem 12.5, that the curve e [(x, G(x)): x E R 1 ] supports the corresponding measure, and that A 2(e) = 0. =

12.12. Let F1 and F2 be nondecreasing, right-continuous functions on the line and put F(x " x 2 ) = F1(x 1 )F2(x 2 ). Show that F satisfies the conditions of Theorem 12.5. Let JL, JL 1 , JL z be the measures corresponding to F, F" F2 , and prove that JL(A 1 X A 2 ) = JL 1(A 1 )JL 2(A 2 ) for intervals A 1 and A 2 • This JL is the product of JL 1 and JL z; products are studied in a general setting in Section 18.

MEASURE

182 SECTION 13. MEASURABLE FUNCTIONS AND MAPPINGS

If a real function X on f! has finite range, it is by the definition in Section 5 a simple random variable if [w: X(w ) = x] lies in the basic u-field :Y for each x. The requirement appropriate for the general real function X is stronger; namely, [w: X(w) E H] must lie in :Y for each linear Borel set H. An abstract version of this definition greatly simplifies the theory of such functions. Measurable Mappings

(f!, /Y) and (f!', /Y') be two measurable spaces. For a mapping T: .(1 � f!', consider the inverse images r- 'A ' = [w E f!: Tw EA'] for A' c f!' (See [A7] for the properties of inverse images.) The mapping T is measurable Let

!Yj :Y' if r- 'A ' E :Y for each A' E /Y'. For a real fu nction f, the image space f!' is the line R 1 , and in this case � � is always tacitly understood to play the role of :Y'. A real function f on f! is thus measurable :Y (or simply measurable, if it is clear from the context 1 what :Y is involved) if it is measurable !Yj!JR 1 -that is, if f- H = [w: f(w) EH] E :Y for eve-ry H E � 1 • Jn probability contexts, a real measur able function is called a random variable. The point of the definition is to ensure that [w: f(w) E H] has a measure or probability for all sufficiently regular sets H of real numbers-that is, for all Borel sets H.

A real function f with finite range is measurable if f- 1 {x} E :Y for each singleton {x}, but his is too weak a condition to impose on the general f. Ot is satisfied if (f!, /Y) = (R 1 , � 1 ) and f is any one-to-one rna p of the line into itself; but in this case r I H, even for so simple a set H as an interval, can for an appropriately cho'len f be any uncountable set, say Example 13.1.

the non-Bore! set constructed in Section 3. ) On the other hand, for a measurable f with finite range, r 1 H E !F for every H c R 1 ; but this is too stlong a condition to impose on the general f. (For (f!, !F ) = ( R 1 , � 1 ), even f(x) x fails to satisfy it.) Notice that nothing is required of fA; it need not • lie in � 1 for A in /Y. =

If in addition to (f!, :Y"), (f!', !F'), and the map T: f! � f! ', there is a third measurable space (f!'', !F") and a map T': f!' � f!", the composition T'T = T' o T is the mapping f! � f!" that carries w to T'(T( w )). (i) If r - 1A' E :Y for each A ' E .sf' and .sf' generates !F', then T is measurable !F/' !F'. (ii) If T is measurable :Yj !F' and T' is measurable !F' 1 !Y'' then T'T is measurable !Fj :Y". Theorem 13.1.

SECTION 13.

MEASURABLE FUNCTIONS AND MAPPINGS

183

1

PROOF. Since T - 1 (f!' - A') = f! - T- W and T- 1 ( U n A'n ) = U n T - A'n , and since :Y is a u-field in f!, the class [ A': T - 1A' E /Y] is a u-field in f!' . If this u-field contains .sf', it must also contain u(N''), and (i) follows. As for (ii), it follows by the hypotheses that A" E /Y" implies that (T') - 1A " E :F', which in turn implies that (T'T)- 1A'' = [w: T'Tw E A"] =

[w : Tw E (T')- 1A"] = T - 1 ((T') - 1A ") E :F. x,

•

By part {i), if f is a real function such that [w: F(w) < x ] lies in :Y for all then f is measurable !T This condition is usually easy to check.

Mappings into

Rk

For a mapping f: f! � R k carrying n into k-space, � k is always understood to be the u-field in the image space. In probabilistic contexts, a measurable mapping into R k is called a random vector. Now f must have the form ( 13 . 1 ) for real functions fj( w ). Since the sets (12.9) (the "southwest regions") generate � \ Theorem 13.l{i) implies that f is measurable :Y if and only if the set ( 13.2)

[ w : fl ( w ) < x l , . . . , fk ( w ) < x k ]

=

k

n [ w: fj ( w) < x j]

j= I

lies in :Y for each (x 1 , , x k ). This condition holds if each fj is measurable !T On the other hand if x1. = x is fixed and x I = · · · = x'J 1 = x1. + l = · · · = x k = n goes to oo, the sets (13.2) increase to [w: f/w) < x ]; -the condition thus implies that each t is measurable. Therefore, f is measurable :Y if and only if each component function t is measurable !T This provides a practical criterion for mappings into R k . A mapping f: R ; � R k is defined to be measurable if it is measurable � ; /� k . Such functions are often called Borel functions. To sum up, T: f! � f!' is measurable !Yj !Y' if T - 'A' E :Y for ali A' E /Y'; f: f! � R k is measurable :Y if it is measurable !Yj� k ; and f: R' � R k is measurable (a Borel function) if it is measurable �; /� k . If H lies outside � 1 , then IH (i = k = 1) is not a Borel function. • • •

•

Theorem 13.2.

'

Iff:

R; � R k

is continuous, then it is measurable.

PROOF. As noted above, it suffices to check that each set (13.2) lies in �i- But each is closed because of continuity. •

MEASURE

184

If fi : f! � R 1 is measurable .7, j = 1, . . . , k, then g(f1 (w ), . . . , fk (w )) is measurable .7 if g: R k � R 1 is measurable-in particu lar, if it is continuous. Theorem 13.3.

PROOF. If the fi are measurable, then so is (13. 1), so that the result • follows by Theorem 13.1{ii).

Taking g(x 1 , . . , x k ) to be E7= 1 x; , n 7= 1 x;, and max{x p . . . , x k } in turn shows that sums, products, and maxima of measurable functions are measur able. If f(w) is real and measurable, then so are sin f(w ), e'f<w>, and so on, and if f( w) never vanishes, then 1 /f( w ) is measurable as well. Limits and Measurability

For a real function f it is often convenient to admit the artificial values oo and - oo-to work with the extended real line [ - oo, oo]. Such an f is by definition measurable .7 if [w: f(w) E H] lies in .7 for each Borel set H of (finite) real numbers and if [w: f(w) = oo] and [w: f(w) - oo] both lie in .7. This extension of the notion of measurability is convenient in connection with limits and suprema, which need not be finite. =

Theorem 13.4.

able

{i)

The functions

.7.

{ii) (iii) (iv)

Suppose that f1 , f2 ,

•

•

•

are real functions measurable :7:

sup n fn, infn fn , lim sup n fn ,

and

lim infn fn

are measur-

If limn fn exists everywhere, then it is measurable :7: The w-set where {fn( w )} converges lies in .7. Iff is measurable .7, then the w-set where fJw) � f(w) lies in

.7.

PROOF. Clearly, (supn fn < X) = nn [fn < X ] lie'> in ,7 even for X = 00 and x = - oo, and so sup,. fn is measurable. The measurability of infn fn follows the same way, and hence lim supn fn inf n supk � n fk and lim infn fn = supn inf k � n fk are measurable. If lim n fn exists, it coincides with these last two functions and hence is measurable. Finally, the set in (iii) is the set where lim supn fn( w ) lim infn f/w), and that in (iv) is the set where this common • value is f(w ). =

=

Special cases of this theorem have been encountered before-part (iv), for example, in connection with the strong law of large numbers. The last three parts of the theorem obviously carry over to mappings into R k . A simple real function is one with finite range; it can be put in the form ( 13 .3)

SECTIO N

13.

MEASURABLE FUNCTIONS AND MAPPINGS

185

where the A ; form a finite decomposition of fl . It is measurable !F if each A; lies in !F. The simple random variables of Section 5 have this form. Many results concerning measurable functions are most easily proved first for simple functions and then, by an appeal to the next theorem and a passage to the limit, for the general measurable fu nction.

If f is real and measurable !F, there exists a sequence {f,} of simple functions, each measurable !F, such that Theorem 13.5.

0 < f,( w ) i f( w )

( 13.4)

if f( w ) > 0

and if f( w ) < 0.

( 13.5 ) PROOF.

Define

- oo 1 - n - 1 and that B contains at most n pain ts. Show that some rotation carries B in to A 68 c A fm some e in C.

13. 14. Show by example that p., a--finite does not imply p. T

-

1 a--finite.

13.15. Consider Lebesgue measure A restricted to the class � of Borel sets in (0, 1 ].

For a fixed permutation n 1 , n 2 , . • • of the positive integers, if x has dyadic expansion .x 1 x 2 . . . , take Tx = .xn 1 xn , " ' . Show that T is measurable �� and that AT - 1 = A.

Hk be the union of the interva ls ((i - l )j2 k , ij2 k ] fo i even, 1 < i < 2 k . Show that if 0 0,

and conversely. Weak Convergence

Random variables X1 , , Xn are defined to be independent if the events [X1 EA 1 ], , [ Xn E Anl are independent for all Borel sets A 1 , , A n ;- so that P[X; E A ;, i = 1 , . . . ' n] = n;= I P[ Xj E A J To find the distribution func tion of the maximum Mn = max{X1 , . . . , Xn}, take A 1 = · · · = An = ( oo, x ]. This gives P[Mn < x ] = n;'_ 1 P[ X; < x]. If the Xi are independent and have common distribution function G and Mn has distribution function Fn , then • • •

• • •

• • •

-

( 14.8) It is possible without any appeal to measure theory to study the real function Fn solely by means of the relation (14.8), which can indeed be taken as defining Fn - It is possible in particular to study the asymptotic properties of Fn :

SECTION

14.

DISTRIBUTION FUNCTIO NS

191

Consider a stream or sequence of events, say arrivals of calls at a telephone exchange. Suppose that the times between successive events, the interarrival times, are independent and that each has the expo nential form (14.7) with a common value of a. By (14.8) the maximum Mn among the first n interarrival times has distribution function Fn(x ) = (1 e - a x ) n, x > 0. For each x, limn Fn(x ) = 0, which means that Mn tends to be large for n large. But P[Mn - a - I log n < x] = Fn(x + a - 1 log n). This is the distribution function of Mn - a - 1 log n, and it satisfies Example 14.1.

n n ) og ( -= ( 1 - e a x+ l ) --+ e - e - ax holds if log n > -ax, and so the limit holds for

Fn ( X + a - 1 log n )

( 14.9) as n --+ oo; the equality here all x. This gives for large n the approximate distribution of the normalized • random variable Mn - a _ , log n. Fn and F are distribution weakly to F, written Fn = F , if If

/

functions, then by definition,

lim Fn ( x )

( 14.10)

n

=

Fn converges

F( x )

for each x at which F is continuous. ; To study the approximate distribution of a random variable Y,. it is often necessary to study instead the normalized or rescaled random variable (Y,. - bn) Ia n for appropriate constants an and bn. If Yn has distribution function Fn and if an > 0, then P[(Yn - b)/an < x ] = P[ Yn < a nx + bn ], and therefore (Yn - bn)/an has distribution function Fn (a nx + bn). For this reason weak convergence often appears in the form*

( 14.1 1) An

example of this is

(14.9):

there

a n = 1, bn -= a - I log n, and F(x) = e - e -ax.

Consider again the distribution function maximum, but suppose that G has the form Example 14.2.

if X < if X >

of the

1, 1,

if X < 0, if x > 0.

This is an example of (14.1 1) in which

(14.8)

a n = n 1 1 a and bn

=

•

0.

t For the role of continuity, see Example 1 4.4. tTo write Fn( a x + bn) = F( x) ignores the distinction between a function and its value at an unspecified value of its argument, but the meaning of course is that F ( a x + bn) --+ F(x) at continuity points x of F.

n

n n

MEASURE

192 Example 14.3.

Consider (14.8) once more, but for

0 G( x ) = 1 - ( 1 - x ) 1 where a > 0. This time Therefore,

a

if X < 0, if 0 <x < if X > 1 ,

1,

FJn - l lax + 1) = {1 - n - 1 ( -x) "') n if if

a case of (14. 11) in which

if

- n 1 1 a < x < 0.

X < 0, X > 0,

a n = n - l / a and bn = 1.

•

Let 11 be the distribution function with a unit jump at the origin:

!:J.( X ) =

( 14.12)

{�

if X < 0, if X > 0.

If X(w ) = 0, then X has distribution function Example 14.4.

of large numbers,

be independent random variables for which - 1] = � ' and put Sn = X1 + + Xn . By the weak law

Let

P[ Xk = 1] = P[ Xk =

X1 X2 ,

11 .

• • •

·

·

·

( 14.13) Let Fn be the distribution function of n - l sn" If X > 0, then Fn(x) = 1 - P[n - 1 S,. > .t ] -+ 1; if x < 0, then FJx) < P[ln - 1 Snl > l x ll -+ 0. As this accounts for all the continuity points of /1, Fn = 11. It is easy to tum the argument around and deduce (14. 13) from Fn = 11 . Thus the weak law of large numbers is equivalent to the assertion that the distribution function of n - l sn converges weakly to /1 . If n is odd, so that sn = 0 is impossible, then by symmetry the events [Sn < 0] and [Sn > 0] each have probability � and hence FJO) = �. Thus Fn(O) does not converge to /1(0) = 1, but because 11 is discontinuous at 0, the for € > 0.

definition of weak convergence does not require this.

•

Allowing 04.10) to fail at discontinuity points x of F thus makes it possible to bring the weak law of large numbers under the theory of weak convergence. But if 04. 10) need hold only for certain values of x, there

SECT ION 14.

DISTRIBUTION FUNCflONS

193

arise s the question of whether weak limits are unique. Suppose that Fn = F and Fn = G. Then F(x) = limn Fn(x) = G(x ) if F and G are both continu ous at x. Since F and G each have only countably many points of discontinu ity, t the set of common continuity points is dense, and it follows by right continuity that F and G are identical. A sequence can thus have at most one weak limit. Convergence of distribution functions is studied in detail in Chapter 5 . The remainder of this section is devoted to some weak-convergence theorems which are interesting both for themselves and for the reason that they require so little technical machinery. Convergence of Types *

Distribution functions F and G are of the same type if there exist consrants a and b, a > 0, such that F(ax + b) = G( x) for all x. A distribution function is degenerate if it has the form Mx - x0) (see 04. 12.)) for some x0; otherwise , it is nondegenerate.

Suppose that F11( U 11X + V11) = F(x ) and F11(anx + b11) => G( x ) where u n > 0, a, > 0, and F and G are nondegenerate. Then there exist a and b, a > 0, such that a,,/u , --+ a, (bn - V11)/un --+ b, and F(ax + b) = G( x ). Theorem 14.2. ,

Thus there can be only one possible limit type and essentially only one possible sequence of norming constants. The proof of the theorem is for clarity set out in a sequence of lemmas. In all of them, a and the a " are assumed to be positive.

PROOF. If x is a continuity point of F(ax + b) and E > 0, choose conti nuity points u and v of F so that u < ax + b < v and F(v) - F(u) < E ; this is possible because F ttas only countably many discontinuities. For large enough n, u < a11x + b11 < V, IF11(U) - F(u)i < E, and I Fn( v) - F(v)I < E; but then F(ax + b) - 2 € < F(u) - € < F11(u) < Fn(anx + bn) < Fn( v) < F(v ) + € < • F(ax + b) + 2 E .

PROOF. Given E, choose a continuity point u of F so large that F(u) > 1 - €. If x > 0, then for all large enough n, anx > u and IFn(u) - F(u)l < E, so t The proof following ( 1 4.3) uses measure theory, but this is not necessary If the saltus u(x) = F(x ) - F(x - ) exceeds f at x 1 < < X11, then F(x;) - F(x ; _ 1 ) > f (take x0 < x 1 ), and so n �: < F(x11 ) - F(x0) < 1 ; hence [r u(x) > € ] is finite and [ x : u(x ) > 0] is countable *This topic may be omitted. ·

·

194

MEASURE

that F11(a11x) > F11(U) > F(u) - € > 1 - 2€. Thus limn F11(a11x) = 1 for similarly, limn F11(a11x) = 0 for x < 0. 3.

Lemma

If Fn

=

F and bn

verge weakly.

is

x > 0; •

unbounded, then F11(x + �") cannot con

Suppose that bn is unbounded and that bn oo along some subsequence (the case b, - oo is similar). Suppose that Fn(x + b11) G(x ). Given E, choose a conti nuity point u of F so that F(u) > 1 - E . Whatever x may be, for n far enough out in the subsequence, x + bn > u and F/u) > 1 2€, so that F11( x + b11 ) > 1 - 2€. Thus G( x) = lim, F,,( x + bn) = 1 for all con • tinuity points x of G, which is impossible. PROOF.

--+

=

--+

4.

If F11(x ) nondegenemte, then Lemma

=

F( x) and Fn(a11 X + bn)

=

G(x ), where F and G are

0 < inf a n < sup a n < oo, n n n Suppose that an is not bounded above. Arrange by passing to PROOF. subsequence that a �� oo. Then by Lemma 2,

( 14 . 1 4 )

a

"

( 1 4 . 15 ) Since

( 1 4 . 1 6) it follows by Lemma 3 that b11ja11 is bounded along this subsequence. By passing to a further subsequence, arrange that b11/an converges to some c. By (14. 15) and Lemma 1 , F11( a11( x + b11ja11)) Mx + c) along this subsequence. But (1 4 .16) now implies that G is degenerate, contrary to hypothesis. Thus an is bounded above. If G11( x) = F/a"x + b11), then G11( X) G(.x ) and G11(a;;1x - a;;1b11) = F11(x) F(x ). The result just proved shows that a;; 1 is bounded. Thus an is bounded away from 0 and oo. I f bn is not bounded, neither is bn fa n; pass to a subsequence along which b11 fa n + oo and a n converges to a positive a. Since, by Lemma 1, F11(a11x) F(ax ) along the subsequence, (1 4 . 16) and b11/an --+ + oo stand in contradiction (Lemma 3 again). Therefore • bn is bounded. =

=

=

--+

=

Lemma 5.

a = 1 and b

=

If F(x) F(ax + b) for all x and F =

nondegenerate, then

0.

Since F(x) = F(a"x + (a "-1 + Lemma 4 that a" is bounded away from 0 and follows that nb is bounded, so that b = 0. P ROOF.

is

·

·

·

+ a + 1)b), so that a

oo,

=

it follows by 1 , and it then •

SECTION 14.

DISTRIBUTION FUNCfiONS

195

PROOF OF THEOREM 14.2. Suppose first that un = 1 and vn = 0. Then (14. 14) holds. Fix any subsequence along which an converges to some positive a and bn converges to some b. By Lemma 1, Fn(anx + bn ) = F(ax + b) along this subsequence, and the hypothesis gives F(ax + b) = G( x ). Suppose that along some other sequence, an � u > 0 and bn v. Then F( u.x + v) = G( x) and F(ax + b) = G( x) both hold, so that u = a and v = b by Lemma 5. Every converge nt subse quence of {(a n, bn)} thus converges to (a, b), and so the entire sequence does. For the general case, let Hn(x ) = Fn( unx + vn). Then Hn( x ) = F(x) and 1 H/anu;; 1 x + (bn - vn)u;; 1 ) = G(x ), and so by the case already treated, an u;; 1 converges to some positive a and (bn - vn )u;; to some b, and as before, --+

•

F(ax + b) = G (x).

Extremal Distributions •

distribution function F is extremal if it is nondegenerate and if, for some distribution function G and constant� an (an > 0) and bm

A

( 14.17 ) These are t he possible limiting distributio ns of normalized maxima (see (14.8)), and Examples 14. 1, 14.2, and 14.3 give three specimens. The following analysis shows that these three examples exhaust the possible types. nk k F is extremal. From (14. 17) follow c (anx + bn) = p (x) and Assume that nk c (ank x + bnk ) = F(x ), and so by Theorem 14.2 there exist constants ck and dk such that c k is positive and (14.18 ) k From F(cjk x + djk ) = Fj (x) = F j(ck x + dk ) = F(cick x + dk ) + d) follow (Lemma 5) the relations

{ 14. 1 9 )

C1·k = C1·C k ,

Of course, c 1 = 1 and d 1 = 0. There are three cases to be con�idered separately.

CASE 1.

Suppose that c k = 1 for all k. Then

(14 .20 )

k F 1 1 ( x ) = F( x - dd .

k This imp lies that Fj j (x) = F(x + d1 - dk ). For positive rational r = jjk , put B, = d . - dk ; (14.19) implies that the definition is consistent, a nd F'(x) = F(x + B,). Since F is nondegenerate, there is an x such that 0 < F(x) < 1, and it follows by (14.20) that dk is decreasing in k , so that B, is strictly decreasing in r. *

This topic may be omitted.

M EASU RE

196

t �p(t) = inf0 < , < t,

For positive real let decreasing in and

1

o,

(r rational in the infimum). Then

F' (X ) = F( X + 'P (

( 14.21 )

t.

�p(t) is

t)) �p(st) �p(s) �p(t), t, �p(e x ), �p(t) t exlf3

so that by for all x and all positive Further, (14. 19) implies that = + the theorem on Cauchy's equation [A20] applied to = -{3 log where is strictly decreasing. Now (14.21) with = gives F(x) = {3 > 0 because exp{e 1 13 log F(O)}, and so F must be of the same type as

�p(t)

-x

( 14.22 )

Example 14. 1 shov'S that this distribution function can arise as a limit of distributions of maxima-that is, F1 is indeed extremal. 1. Then there CASE 2. Suppose that c k " =fo 1 for some k0, which necessarily exceeds k exists an x' such that ck x' + dk =x'; but (14.18) then gives F 11( x') = F(x'), so that F(x') 1 s 0 or 1 . (In Case 1, F has the type (14.22) and so never assumes the values 0 and 1 .) Now suppose further that, in fact, F(x ') = 0. Let x0 be the supremum of those x for which F(x ) = 0. By passing to a new F of the same type one can arrange that x 0 = 0; then F(x) = 0 for t < 0 and F(x) > 0 for x > 0. The new F will satisfy (14.18), but with new constants dk . If a (new) dk is distinct from 0, then there is an x near 0 for which the arguments o n the two sides of (14. 18) have opposite signs. Therefore, dk = 0 for all k, and II

•

II

( 14.23 ) k for all k and x. This implies that Fjf (x) = F(xcjjck ). For positive rational r =jjk , put y, = c jck . The definitio n is again consistent by (14.19), and F'(x ) = F(y,x). Since 0 < jF(x) < 1 for some x, necessarily positive, it follows by (14.23) that ck is decreasing in k, so that 'Yr is strictly decreasing in r. Put = info < r 0. Since F'(x) = F(rfJ(t)x) for all x and for positive, F(x) I / exp{ x - � log F(l)} for x > 0. Thus (take a = 1H) F is of the same type as

t.

r/J(st) rfJ(s)rfJ(t), r/J(e x ),

(14.24 )

r/J(I) r/J(t) t

=

if X < 0, if X > 0.

Example 14.2 shows that this case can arise. CASE 3. Suppose as in Case 2 that ck -=1= 1 for some k0, so that F(x') is 0 or 1 fo r some x', but this time suppose that F( }) = 1. Let x 1 be the infimum of those x fo r which F(x) = 1. By passing to a new F of the same type, arrange that x 1 = 0; then F(x ) < 1 for x < 0 and F(x ) = 1 for x > 0. If dk =fo 0, then for some x near 0, one side of (14.18) is 1 and the other is not. Thus d k = 0 for all k, and (14.23) again holds. And

SECTION 14.

DI STRIBUTION FUNCfiONS

197

again Yjj k = cjjc k consistently defines a function satisfying F'(x) = F(y,x ). Since F is nondegenerate, 0 < F(x) < 1 for some x, but this time x is necessarily negative, so that c k is increasing. The same analysis as before shows that there is a positive t such that F'(x) = F(t �x ) for all x and for t positive. Thus F(x) = exp{( -x) 1 1� Iog F( - 1)} for x < O, and F is of the type if X < 0, if X > 0.

( 14.25 )

Example 14 3 shows that this distribution function is indeed extremal. This completely characterizes the class of extremal distributions:

The class of extremal distribution functions consists exactly of the distribution functions of the types (14.22), (14.24), and (14.25). Theorem 14.3.

It is possible to go on and characterize the domains of attraction. That is. it is possible for each extremal distribution function F to describe the class of G satisfying (14.17) for some constants an and bn-the class of G attracted to F.t PROBLEMS 14. 1. The general nondecreasing function F has at most countably many discontinu

ities. Prove this by considering the open intervals

{ usup<x F( u ) , t.inf>x F( u ) ) -each nonempty one contains a rational. 14.2. For distribution functions F, the second proof of Theorem 14.1 shows how to

construct a measure J1. on (R 1 , .9P 1 ) such that JL(a, b] = F( b ) - F( a ). (a) Extend to th{' case of bounded F. (b) Extend to the general case. Hint: Let Fn(x) be -n or F(x) or n as F(x) < - n or -n < F(x) < n or n < F(x). Construct the corresponding Jl.n and define J.L(A ) = limn Jl.n(A).

14.3.

(a) Suppose that X has a continuous, strictly increasing distribution function F. Show that the random vai iable F(X ) is uniformly distributed over the unit interval in the sense that P[F(X) < u] = u for 0 < u < 1. Passing from X to F(X) is called the probability transformation. (b) Show that the function �p(u) defined by (14.5) satisfies F(�p(u) - ) < u < F(�p(u)) and that, if F is continuous (but not necessarily strictly increasing), then F(�p(u)) = u for 0 < u < 1. (c) Show that P[F(X) < u] = F(�p(u) - ) and hence that the result in part (a) holds as long as F is continuous.

tThis theory is associated with the names of Fisher, Frechet, Gnedenko, and Tippet. For further information, see GALAMBOs.

M EASURE

198 14.4.

i Let C be the set of continuity points of F. (a) Show that for every Borel set A , P[ F( X) A , X C] is at most the Lebesgue measure of A . (b) Show that if F is continuous at each point of F 1A , then P[ F( X) A ] is at most the Lebesgue measure of A .

E E

-

E

Levy distance d(F, G) between two distribution functions is the infimum of those E such that G(x - E) - E < F(x) < G(x + E) + € for all x. Verify that this

14.5. The

is a metric on the set of distribution functions. Show that a necessary and sufficient condition for Fn F is that d(Fn, F) --> 0. =

A Borel function satisfying Cauchy's equation [A20] is automatically bounded in some interval and hence satisfies f(x) =xf(l). Hint: Take K large enough that A [ x : x > s, /f(x)/ < K ] > 0. Apply Problem 12.3 and conclude thl!t f is bounded in some intervai to the righ t of 0

14.6. 12.3 i

14.7.

i Consider sets S of reals that are !inearly independent over the field of +n k x k 0 for distinct points X; in S and rationals in the sense that n 1 x 1 + in tegers n ; (positive or negative) is impossible unless n; 0. (a) By Zorn's lemma find
199

INl EG RATION

200

The reasons for these conventions will become clear later. Also in f0 1 ce are the conventions of Section 10 for sums and limits involving infinity; see (10.3) and (10.4). If A; is empty, the infimum in (15.1) is by the standard convention oo; but then J.L(A) = 0, so that by the convention (15 .2), this term makes no contribution to the sum (15.1 ). The integral of f is defined as the supremum of the sums (15. 1 ): ( 1 5 .3) The supremum here extends over all finite decompositions .r-sets. For general f, consider its positive part,

( 15 .5 )

of f! into

if 0 < f ( w ) < oo , if - 00 < f( w ) < 0

( 1 5 .4) and its

{A ;}

ne'gative part, if - 00 < f ( w ) < 0, if O < f ( w ) < oo .

These functions are nonnegative and measurable, and eral integral is defined by

f = r-r. The gen

( 15 .6) unless Jr dJ.L = Jr dJ.L = oo, in which case f has no integral. If Jr dJ.L and f.f- dJ.L are both finite, then f is integrable, or integrable J.L, or summable, and has (15.6) as its definite integral. If Jr dJ.L = oo and Jr dJ.L < oo, then f is not integrable but in accordance with (15.6) is assigned oo as its definite integral. Similarly, if Jr dJ.L < oo and Jr dJ.L = oo, then f is not integrable but has definite integral - oo. Note that f can have a definite integral without being integrable; it fails to have a definite integral if and only if its positive and negative parts both have infinite integrals. The really important case of (15.6) is that in which ff+ dJ.L and Jr dJ.L are both finite. Allowing infinite integrals is a convention that simplifies the statements of various theorems, especially theorems involving nonnegative functions. Note that (15.6) is defined unless it involves "oo - oo " ; if one term on the right is oo and the other is a finite real x , the difference is defined by the conventions oo - x = oo and x - oo = - oo. The extension of the integral from the nonnegative case to the general case is consistent: (15.6) agrees with (15.3) if f is nonnegative, because then

r = o.

SECfiON 15.

THE INTEGRAL

201

Nonnegative Functions

It is convenient first to analyze nonnegative functions.

If f = L;X JA ; is a nonnegative simple function , {A J being a finite decomposition of f! into �sets, then JfdJL = L ; ;JL( A J (ii) If 0 < f(w) < g(w) for all w , then ffdJL < JgdJL. (iii) If 0 < fn(w) i f(w ) for all w , then 0 < ffn d11- i JfdJL . (iv) For nonnegative functions f and g and nonnegative constants a and {3, J(af + {3g) dJL = affdJL + {3fgdJL . (i)

Theorem 15.1.

X

In part (iii) the essential point is thar JfdJL = limn ffn dJL, and it is important to understand that both sides of th is equation may be oo. If fn = /A and f = IA , where A n i A, the conclusion is that 11- is continuous from below (Theorem 10.2(i)): lim n JL(A n ) = JL(A ); this equation often takes the form

n

00 = oo .

PROOF OF (i). Let {B) be a finite decomposition of fl and let {3j be the infimum of f over Bj. If A; n Bj =fo 0 , then {3j < x ; ; therefore, [jf3jJL(B) = [ijf3 j JL((A; n B)) < L ;jX ;JL( A; nB) = L ; X ;JL ( A ). On the other hand, there • is equality here if {B) coincides with {AJ PRooF OF (ii).

� g.

The sums (15.1) obviously do not decrease if f is replaced •

PROOF OF (iii). By (ii) the sequence ffn d11- is nondecreasing and bounded above by Jfd/.L. It therefore suffices to show that JfdJL < limn ffn dJL, or that

m lim Jfn d11- > S = L V;JL ( A; ) . I= I

( 15 .7)

n

is any decomposition of f! into �sets and V; = infw e A f(w). In order to see the essential idea of the proof, which is quite simple, suppose first that S is finite and all the V; and JL( A ; ) are positive and finite. Fix an E that is positive and less than each V;, and put A;n = [w EA;: fn(w) > v; - E]. Since fn i f, A ;n i A ; . Decompose f! into A 1 n , . . . , A mn and the complement of their union, and observe that, since 11- is continuous from below, if

A 1,

• • •

( 15 .8)

Since the

, Am

I

m m jfn dJL > iL= l ( v; - E)JL ( A;n) � iL= l ( v; - E) JL ( A ; ) m = 5 - E L JL ( A ;) ·

JL(A) are

i=

I

all finite, letting

E � 0 gives (15.7).

INTEGRATION

202

Now suppose only that 5 is finite. Each product Vi!J.( A ; ) is the n finite; suppose it is positive for i < m0 and 0 for i > m0. (Here m0 < m; if the product is 0 for all i, then S = 0 and (15.7) is trivial.) Now v1 and J.L ( A ) are positive and finite for i < m0 (one or the other may be oo for i > m0). Define A 1, as before, but only for i < m0 • This time decompose f! into A 1 ,, , A m and the complement of their union. Replace m by m0 in (15.8) and complete the proof as before. Finally, suppose that S = oo. Then v,. J.L( A , ) = oo for some i 0 , so that v,. and J.L( A ) are both positive and at least one is oo. Suppose 0 < x < V; < oo > x]. From f, i f and 0 < y < J.L( A ,. ) < oo, and put A,. , = [w E A ,. : )",(w) J follows A , , i A,. ; hence J.L( A ,. , ) > y for n exceeding some n0. But then (decompose f! into A 111, and its complement) Jf, dJ.L > X J.L( A i ,) > xy for n > n0, and therefore lim , ff, dJ.L > xy . If v10 = oo, let x --+ oo, and 0if J.L(A 1 ) = • oo, let y --+ oo. In either case (15.7) follows: lim , ff, dJ.L = oo. •

II

I (J

u

(I

(J

u J

11

II

un

• •

(I

It

IJ

PROOF OF (iv). Suppose at first that f = [1xJA; and g = Lj Y/B are ; simple. Then af + {3 g = L-iax, + f3y)IA; n 8i , and so

j ( af + {3g ) dJ.L = � ( ax1 + f3 YJ J.L ( A , n Bj ) I)

J

= a L X ;J.L ( A 1 ) + {3 LYjll ( BJ = a fdJ.L + {3 j

i

JgdJ.L.

Note that the argument is valid if some of a, {3, x1, Yj are infinite. Apart from this possibility, the ideas are a� in the proof of (5 .2 1 ). For general nonnegative f and g, there exist by Theorem 13.5 simple functions /, and gr such that 0 < ! i f and 0 < g, i g. But then 0 < af, + {3 g, i af + {3g and f ( af, + f3g,) dJ.L = a ff, dJ.L + f3 fg, dJ.L, so that (iv) follows • from (iii). ,

By part (i) of Theorem 15.1, the expected values of simple random variables in Chapter 1 are integrals: E[X ] = JX(w)P(dw). This also covers the step functions in Section 1 (see (1 .6)). The relation between the Riemann integral and the integral as defined here will be studied in Section 17. Consider the line (R 1 , .9i' 1 , ,.\) with Lebesgue measure . Suppose that - oo < a0 < a 1 < · · · < a m < oo, and let f be the function with nonnegative value x1 on (a ;_ 1 , a,], i = 1, . . . ,.m , and value 0 on ( - oo, a0] and (a m , oo). By part (i) of Theorem 15. 1 , Jfd,.\ = [f!. 1 xia 1 - a1 _ 1 ) because of the convention 0 · 00 = O-see (15.2). If the "area under the curve" to the left of a0 and to the right of a m is to be 0, this convention is inevitable. From oo · 0 = 0 it follows that Jfd,.\ = 0 if f is oo at a single point (say) and 0 elsewhere. Example 15.1.

SECTION 15. THE INTEGRAL

203

If f = I< a.oo)• the area-under-the-curve point of view makes ff d11- = oo natural. Hence the second convention in (15.2), which also requires that the • integr al be infinite if f is oo on a nonempty interval and 0 elsewhere. Recall that

almost everywhere

Theorem 15.2.

mean s outside a set of measure 0.

Suppose that f and g are nonnegative.

(i) Iff = 0 almost everywhere, then JfdJL = 0. (ii) If JL[ w: f( w) > 0] > 0, then Jf d11- > 0. (iii) If JfdJL < oo, then f < oo almost everywhere. ( iv ) lf f < g almost everywhere, then JfdJL < fgdJL. (v) Iff = g almost everywhere, then JfdJL = fgdJL. PRooF. Suppose that f = 0 almost everywhere. If A; meets [w: f(w) = 0], then the infimum in (15. 1 ) is 0; otherwise, JL(A,) = 0. Hence each sum (15. 1 ) is 0, and {i) follows. If A� = [w: f(w) > €], then A� i[w: f(w) > 0] as E � 0, so that under the hypothesis of (ii) there is a positive E for which JL(A) > 0. Decomposing f! into A� and its complement shows that ffdJL > E JL( A ) > 0. If 11-U = oo] > 0, decompose f! into [ f = oo] and its compiement: ffdJL > oo · JL[f oo] = oo by the conventions. Hence (iii). To prove (iv), let G = [f < g]. For any finite decomposition {A 1 , , A m } of f!, �

=

•

[ ]

L��f f ] JL( A [ ;] < I: [ inf g ] JL( AJl G) < jgdJL,

I: i�iff JL ( A ; ) = I: i� f JL ( A ; n G) < I:

a

;

•

•

n G)

kI n G

where the last inequality comes from a consideration of the decomposition • A 1 n G, . . . , A m n G, G c. Th is proves (iv), and (v) follows immediately. Suppose that f = g almost everywhere, where f and g need not be nonnegative. If f has a definite integral, then since r = g + and r = g almost everywhere, it follows by Theorem 15.2(v) that g also has a definite integral and ffdJL = fgdJL. Uniqueness Although there are various ways to frame the definition of the integral, they are all equivalent-they all assign the same value to ffdJL . This is because the integral is uniquely determined by certain simple properties it is natural to require of it. It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. But these uniquely determine the integral for nonnegative functions: For f nonnega tive, there exist by Theorem 13.5 simple functions [, such that 0 < fn i f; by (iii), ffdJL must be lim , ffn dJL , and (i) determines the value of each ffn dJL.

INTEGRATION

204

Property (i) can itself be derived from (iv) (linearity) together with the assumption that f!A dp. = p.( A) for indicators /A : j([;x;fA ) dp. = L;X; /IA dp. = L;X;p.(A) . If (iv) of Theorem 15.1 is to persist when th e integral is exte nded beyond the class of nonnegative functions, !fdp. must be J(r - r ) dp. = Jr dp. - Jr dp. , which makes the definition (15.6) inevitable. PROBLEMS

These problems outline alternative defin itions of the integral and clarify the role measurability plays. Call (15.3) the and write it as

lower integral,

( 1 5 9) to distinguish it from the

upper integral

j

(15.10)

*

[

J

fdp. = inf [ sup f(w) p.( A ; ) · i

w EA,

The infimum in (15 10), like the supremum in (15.9), extends over all finite partiti ons {A ; } of D. i nto .r-sets.

f is measurable and nonnegative. Show that f *fdp. = oo if p.[w: f(w) 0 = or if p.[w: f(w) 0 for all

15. 1. Suppose that

>]

oo

> a] >

a

There are many functions familiar from calculus that ought to be integrable but are of the types in the preceding problem a nd hence have infinite upper integra l. Examples are x- 2 /( l oo>(x) and x - 1 /2/(x). Therefore, (15.10) is inappropriate as a , non negative f. The only problem with (15. 1 0), however, is that definition of ffdp. for it treats infinity the wrong way. To see th is, and to focus on essentials, ass ume that p.( D.) < and that f is bounded, although not necessarily nonnegath'e or measur able !f. oo

15.2.

i

(a) Show that

if {B) refines { A ;}. Prove a dual relation for the sums in (15.10) and concl ude that (15.11 ) (b) Now assume that f is measurable !f and let M be a bound for

Consider the partitio n A; = [w : iE M. Show that

Conclude that ( 1 5 . 12)

Darboux-Ycung

To define the integral as the common value in (15.12) is the approach. The advantage of (i5.3) as a definition is that (in the nonnegative case) it applies at once to unbounded f and infinite JL For A e n, define JL*(A ) and JL * (A) by (3.9) and (3.10) with JL in place of P. Show that f*IA dJ.L = JL*(A) and f * !A dJL = JL * (A) for every A. Therefore, (15.12) can fail if f is not measurable !f. (Where was measurability used in the proof of (15.12)?)

15.3. 3.2 15.2 i

The definitio ns (15.3) and (15.6) always make formal sense (for finite JL(!l) and sup i f l), but they are reasonable-accord with intuition-only if (1 5.12) holds. Under what conditions it hold?

does

there exist an function g, !f-set and a and g]

15.4. 10.5 15.3 i

(a) Suppose of f that A JL(A) = 0 5', [f * cA. This is the same thing as assuming that JL*[f * = 0, or assuming that f is measurable with respect to !f completed with respect to JL· Show that (15.12) holds. (b) Show that if (15. 12) holds, then so does the italicized condition in part (a).

measurable such that g]

Rather than assume that f is measurable 5', one can assume that it satisfies the 5', J.L) is complete is the italicized condition in Problem 15.4(a)-which in case same thing anyway. For the next three problems, assume that JL(fi) < and that f is measurable sr and bounded.

(D.,

oo

Show that for positive E there exists a finite partition {A;} such that, if {B) is any finer partition and wj E Bj, then

15.5. i

jfdJL - [ J(wj)JL ( Bj) 1

15.6. i

< E.

Show that

f

.

k- 1

[

k-1

k

n JL w : n < f( w ) < -y. 2 2 n lkls: n 2"

fdJL = hm L

The limit on the right here is

]

·

Lebesgue's definition of the integral.

INTEGRATION

206

Suppose that the integral is defined for simple nonnegative functions by j([;x; IA ) dJL = L;X;JL(A ; ). Suppose that fn and gn are simple and nondecreasing and have a common limit: 0 < fn i f and 0 < g, i f. Adapt the arguments used to prove Theorem 15.1(iii ) and show that limn ffn dJL = l imn fg, dJL. Thus, in the nonnegative case, ffdJL can (Theorem 13.5) consistently be defined as limn Jf, dJL for simple functions for which 0 < fn i f.

15.7. i

'

SECTION 16. PROPERTIES OF THE INTEGRAL Equalities and Inequalities

By definition, the requirement for integrability of is that Jr d11- and rr- dJL both be finite, which i� the same as the requirement that Jr dJL + Jr dJL < 00 and hence is the same as the requirement that J g- almost everywhere , and so (16.2) follows by the definition (15.6). •

f

f

f

f

PRooF OF (ii).

First,

af + {3g is integrable because, by Theorem 15.1,

Ji af + f3 g l dJL < J (lai · IJ I + lf3l · l g l) dJL =

l a l j iJI dJL + lf31j l g l dJL
0 and a < 0. Therefore, it is enough to check (16.3) for the case a = /3 = 1. By definition, (f + g) + - (f + g) - = f + g = r -r + + g - = (f + g) - + r + g+. All these func g + - g - and therefore (f + g) + + r tions being nonnegative, f(f + g) + d11- + Jr d11- + fg - d11- = f{f + g) - d11- + Jr dJJ, + fg + dJJ,, which can be+ rearranged to give f(f + g) + d11- - f(f + g) - d11- = Jr d11- - Jr d11- + fg d11- - fg - dJL. But this reduces to (1 6.3). • =

-

Since - I J I n, the function corresponding to Xn l > x n 2 > . . . has integral L:;:, = l x m by Theorem 15.1(i) (consider the decomposition { 1}, . . . , {n}, {n + 1 , n + 2, . . . }). It follows by Theorem 15. 1(iii) that in the nonnegative case the integral of the function given by {xm} is the sum L:m x m (finite or infinite) of the corresponding infinite series. In the general case the function is integrable if and only if L:';;, = 1 1 x m1 is a convergent infinite series, in which case the integral is L:;;', = 1 x � - L,;, 1 x;;; . m The function x m ( - l) + 1 m - I is not integrable by this definition and even fails to have a definite integral, since L,;, = 1 x� = L:';;, = 1 x;;; = oo. This invites comparison with the ordinary theory of infinite series, according to which the alternating harmonic series does converge in the sense that m limM [�= 1( - l) +l m- 1 = log 2. But since this says that the sum of the first M terms has a limit, it requires that the elements of the space f! be ordered. If f! consists not of the positive integers but, say, of the integer lattice points in 3-space, it has no canonical linear ordering. And if L:m x m is to have the same finite value no matter what the order of summation, the series must be absolutely convergent.t Th is helps to explain why f is defined to be inte grable only if Jr d11- and Jr d11- are both finite. • Example 16.1.

=

=

=-

In connection with Example 15.1, consider the function f= 3/(a , oo) - 2 /( - oo, a)· There is no natural value for JfdA. (it is " oo - oo" ), and none is assigned by the definition. Example 16.2.

t

RUDIN)>

p.

76.

INTEGRATION

208

If a function f is bounded on bounded intervals, then each function fn = fl< - n. n ) is integrable with respect to ,.\, Since f = limn fn , the limit of ffn d,.\, if it exists, is sometimes called the " principa! value" of the integral of f. Although it is natural for some purposes to integrate symmetrically about the origin, this is not the right defin ition of the integral in the context of general measure theory. The functions gn = fl< - n, n + I ) for example also converge to f, and fgn d,.\ may have some other limit, or none at all; f(x ) = x is a case in point. There is no general reason why fn should take precedence over gn. As in the preceding example , f = L:; 1( - l) k k - 1 /< k. k + I > has no integral, • even though the ffn d,.\ above converge. �

Integration to the Umit

The first result, the rem 15. 1(iii). Theorem 16.2.

monotone convergence theorem, essentially restates Theo If 0 0, then J L, f, d J.L = L, Jf, dJ.L.

The members of this last equation are both equal either to same finite, nonnegative real number.

oo or

to the

If L.,f, converges almost everywhere and IL. Z = dk l < g almost everywhere, where g is integrable, then Ln fn and the fn are integrable and J L.n /, d J.L = L. n ffn dJ.L. Theorem 16.7.

If Lnflfnl dJ.L < oo, then L., fn converges absolutely almost everywhere and is integrable, and JL.,Jn dJ.L = Ln ff, dJ.L. Corollary.

PROOF. The function g = Lnlfn i is integrable by Theorem 16.6 and is fmite almost eve1ywhere by Theorem 15 .2(iii). Hence Lnlfnl and L.Jn co n • verge almost everywhere, and Theorem 16.7 applies.

5', ,_L), consider a In place of a sequence {fn} of real measurable functions on family [[, : t > 0] indexed by a continuous parameter t. Suppose of a measurable f that

(D.,

( 16 8)

lim f, (rJJ ) = f( w)

1 -+ 00

on a set A, where { 16.9 )

A E 5',

JL ( !l - A ) = 0.

A techP.ical point arises here, since !f need r.ot contain the w-set where (16.8) holds:

Exampie 16.8. Let !f consist of the Borel subsets of !l = [0, 1), and let H be a nonmeasurable �et-a subset of .n that does not lie in !f (see the end of Section 3). Define f,(w) = 1 !f w equals the fractional part t - lt J of t and their common value lies in He; define f, (w) = 0 otherwise. Each f, is measurable 5', but if f(w) 0, then • the w-set where (16.8) holds is exactly H. =

Because of such examples, the set A above must be assumed to lie in !f. (Because of Theorem 13.4, no such assumption is necessary in the case of sequences.) Suppose that f and the f, are integrable. If I, = Jf, dJL converges to I = ffdJL as t -> oo, then certainly I, -> I for each sequence {t n} going to infi nity. But the converse holds as well: If I, does not converge to I, then there is a positive E such that I I, - / I > E for a sequence {tn } going to infinity. To the question of whether /1 " cOnverges to I the previous theorems apply. Suppose that (16.8) and I J,(w)l < g(w) both hold for w E A, where A satisfies (16.9) a nd g is integrable. By the dominated convergence theorem, f and the f, must then be integrable and I, -> I for each sequence {tn} going to infi nity. It follows that Jf, dJL -> ffd JL· In this result t could go co ntinuously to 0 or to some other value instead of to inti nity.

INTEGRATION

212

) Suppose that f(w, i s a measurable and integrable function of t w for each t in (a, b). Let (t) = ff(w, t )JL(dw ). (i) Suppose that for w E A , where A satisfies (16.9), f(w, t) is continuous in t at t0; that I f(w, suppose )I fun her t < g(w) for w E A and I t - t 01 < B, where B is in dependent of w(ii)andSuppose g is inthat tegrable. Then rjJ(t) i s continuous at t0. for w E A , where A satisfies (16.9), f(w, t) has in (a, b) a derivative l f '(w, )I t < g(w) for w E A and t E (a, b), where g is f'integrable. (w, t ); suppose further that Then rjJ(t) has derivative ff'(w, l)JL(dw) on (a, b). Theorem 16.8.

PROOF Part (i) is a n immediate consequence of the preceding discu ssion. To prove part (ii), consider a fixed If E A , then by the mean-value theorem,

t. w

f( t +h) -f( I ) ( w, s), h where s lies between 1 and t + h. The ratio on the left goest to f'(w, t) as h -> 0 and is by hypothesis domi nated by the integrable function g( w ). Therefore, W,

r/J( t + hl -( t)

=

J

W,

r•

=1

f( uJ,t + h� - f( w , t) JL ( dw ) -> jf'( w , t) JL ( dw ) . g g(w, t) t.

•

The condition involving in part (ii) can be weake ned. It suffices to a ssume that for each there is a n integrable such that for E A and all < in some neighborhood of

t

s

l f'(w, s)l g(w, t) w

Integration over Sets

The integral of

f over a set

A in :Y is defined by

( 16 . 10) The definition applies if f is defined only on A in the first place (set f = 0 outside A). Notice that fA fdJL = 0 if JL( A) = 0. All the concepts and theorems above carry over in an obvious way to integrals over A . Theorems 16.6 and 16.7 yield this result:

If A 1 , A 2 , are disjoint, and if f is either nonnegative or integrable, then f u A fdJL = L.n fA fd !J- . Theorem 16.9.

• • •

n

n

t Letting h go to 0 through

be 0, say, elsewhere.

n

a sequence shows that each f'( · , t ) is measurable

..r o n

A; take it to

SECTION 16.

PROPERTIES OF THE INTEGRAL

213

The integrals (16. 10) usually suffice to determine

f:

Iff and g are nonnegative and fAfdJL = fA gdJL for all A in !F, and if 11- is u-finite, then f = g almost everywhere. (ii) Iff and g are integrable and fA fdJL = fA gdJL for all A in !F, then f= g almost everywhere. (iii) Iff and g are integrable and fAfdJL = fA gdJL for all A in .9, where 9 is a rr-system generating !F and !l is a finite or countable union of .9-sets, then f = g almost everywhere. Theorem 16.10.

(i)

PROOF. Suppose that f and g are nonnegative and that fAfdJL < fA gdJL for all A in !T If 11- is u-finite, there are .r-sets A" such that A n i D. and JL(A n ) < oo . If Bn = [0 < g f uniformly and deduce ffn dJL -> JfdJL fro m (16.5). (bJ Use part (a) and Egoroff's theorem to give another proof of Theorem 16.5.

16.2.

Prove that if 0 < fn -> f almost evt!rywhere and ffn dJL < A < oo, then f is integrable and ffd,u < A . (This is essentially the same as Fatou's lemma and is sometimes called by that name.)

16.3.

Suppose that fn are integrable and supn ffn dJL < oo. Show that, if fn i f, then f is integrable and ffn dJL -> ffdJL. This is Beppo Levi 's theorem. Suppose that functions an , bn, fn converge almost everywhere to func tions a, b, f, respectively. Suppose that the first two sequences may be integra ted to the limit-that is, the functio ns are all integrable and fan dJL -> fa dJL, fbn dJL -> fbdJL. Suppose, finally, that the first two sequences enclose the third: an < fn < bn almost everywhere. Show that the third may be integrated to the limit. (b) Deduce Lebesgue's dominated convergence theorem from part (a).

16.4. (a)

16.5.

About Theorem 16.8: (a) Part (i) is local: there can be a different set A for each t0. Part (ii) can be recast as a local theorem. Suppose that for A, where A satisfies (16.9),

wE

INTEGRATION

220

f(w, t) has derivative f'(w, t0) at t0; suppose further that ( 1 6 .3 2 )

for w E A and 0 < I h i < o, where o is independent of w and g 1 is integrable. Then '(t0) = ff'(w, t0)J.L(dw) . The natural way to check ( 1 6.32), however, is by the mean-value theorem, and this requires (for w E A ) a deri vative throughout a neighborhood of t0 (b) If J.L is Lebesgue measure on the unit interval fl, (a, b) = (0, 1), and f(w, t) = l {3.

For positive

h,

f

f

1 x +h 1 x +h f( y ) dy - f( x) < h lf( y ) -f( x ) l dy h X

X

< sup [If( y )

- f( x )I: x < y < x + h] ,

h

and the right side goes to 0 with if f is continuous at x. The same thing holds for negative and the re for e faxf(y) dy has derivativ e f(x):

h,

X ! f( y ) dy = f( X ) dx d

( 17 .5)

a

f is continuous at x. Suppose that F is a function with continuous derivative F' = f; suppose that F is a primitive of the continuous function f. Then if

( 17 .6)

b ja f( x) dx = ta F ( x ) dx = F ( b ) - F ( a ) '

that IS,

,

as follows from the fact that F(x) - F(a) and faxf( y) dy agree at x = a and by (1 7.5) have iden tical derivatives. For con tinuous f, (1 7.5) and ( 1 7.6) are two ways of stating the fundamental theorem of calculus. To the calculation of Lebesgue integrals the methods of elementary calculus thus apply. As will follow from the general theory of derivatives in Section 31, (17.5) holds outside a set of Lebesgue measure 0 if f is integrable- it need not be continuous. As the following example shows, however, 07.6) can fail for discon tin uous f.

Example 17.4. Define F(x) = x2 sin x-2 for 0 < x < 1 and F(x) = 0 for x � 0 an d for x > 1 . Now for 4- < x < 1 define F(x ) in such a way that F is continuously differentiable over (O, oo). Then F is everywhere differentiable, but F'(O) = 0 and F'(x) = 2x sin x - 2 - 2x- 1 cos x- 2 for 0 < x < t · Thus F' is discontinuous at 0; F' is , in fact, not even integrable over (0, 1 ], which makes (17.6) impossible for a = 0. For a more extreme example, decompose (0, 1 ] into countably many subinterval s (an, bn ]. Define G( x ) = 0 for x < 0 and x > 1, and on (an, bn l define G(x) = F((x an )f(bn - an)). Then G is everywhere differentiable, but (17.6) is impossible for G if (a, b ] contains any of the (an, bn], because G is not in tegrable over any of them. • Change of Variable

For

( 17. 7)

SECTION

17.

THE INTEGRAL WITH RESPECf TO LEBESGUI: MEASURE

225

the change-of-variable formula is

( 17 .8 ) T' b

f

If exists and is continuous, and if is conti nuous, the two integrals are finite because the in tegrand are boun ded, and to prove it is enough to let be a variable and differentiate with respect to it. t With the obvious limiting argume nts, this applies to unbounded intervals and to open ones: Example 1 7.5.

(17.8)

Put T{x ) = tan x on ( - 7T'j2, 7T'/2). Then 1 gives for

T2(x ), and (17.8) f( y) (1 + y2) dy = 7T' . "' ( 17 .9) ! + y2

T'( x) = 1 +

=

- oo

The Lebesgue Integral in

Iii

1

Rk

The k-dimen sional Lebesgue integral, the integral in (R\ !iR k , A k ), is de k In low-dime nsional n oted f it being understood that P 1, cases it is also denoted f fA and so on. 1 As for the rule for changing variables, suppose that T: U --+ R\ where U is k , k an open set in R . The map has the form ; it is by definition continuously differen tiable if the partial derivatives = exist and are continuous in U. Let Dx [ ii ] be the Jacobian matrix, let det Dx be the Jacobian determinant, and let V

f(x) dx,

x = (x . . . , x ). f(x x2 ) dx dx2 , Tx = (t1(x), . . . t (x)) tJx) atJaxi = t (x) = TU. J(x) = Theorem 17.2. Let T be a continuoasly differentiable map of the open set U onto V. Suppose that T is one-to-one and that J(x) =I= 0 for all x. If f is notlnegative, then j f( Tx ) IJ(x ) l dx = jf( y) dy.

(17.10 )

u

v

By the inverse-function theorem [A35], V is open and the inverse point that mapping T- 1 is continuously differen tiable. It is assumed in V --+ R 1 is a Borel function . As usual, for the general holds with in place of and if the two sides are finite, the absolute-value bars can be removed; and of course can be replaced by or flTA ·

f,

(17.10) f, (17.10)

f

t See Problem 17. 1 1 for extensions.

fi8

f: If I

INTEGRATION

226

Example 1 7.6. Suppose that T is a non singular linear transformation on U = V = R k . Then Dx is for each the matrix of the transformation. If T is

x

identified with this matrix, then (17.10) become s

ld e t TI J. f( Tx) dx = J.t ( y) dy.

( 17.1 1)

u

v

If f = ITA ' this holds because of ( 12.2), and then it follows in the usual sequence for simple f and for the general nonnegative f: Theorem 17.2 is easy in the linear case. •

R2

Example 1 7. 7. In tak e U [(p, 0): p > 0, 0 < 0 < 2 -rr ] and T(p, 0) ( p cos O , p sin O) . The Jacobian is J(p , O) = p, and (17. 10) gives the formul3 for integrating in polar coordinates:

( 1 7 . 1 2)

JJp> O

=

=

f( p cos O , p sin O ) p dp dO =

0 A k ( TA ) .

( 17.14) tSPivAK,

p.

72.

SECTION 17. THE INTEGRAL WITH R ESPECf TO LEBESGUE MEASURE

227

Each side of 07.14) is a measure on �= U n !JR k . If .r:ff consists of the rectangles A satisfying A - c U, then .r:ff is a semiring generating �. U is a countable union of dsets, and the left side of 07. 14) is finite for A in .r:ff (supA - 111 < oo). It follows by Corollary 2 to Theorem 1 1.4 that if 07.14) holds for A in .w', then it holds for A in �. But then (linearity and monotone convergence) 07.13) will follow. Proof of 0 7. 14) for A in d. Split the given rectangle A into finitely many subrectangles Q1 satisfying diam Q1 < o,

(17.15 ) 'o

to be determined. Let x1 be some point of Q1• Given £ , choose o in the first place so that IJ(x) - J(x')l < € if x, x' E A - and lx - x'l < o. Then (17.15) implies

( 1 7.16) Let Qt• be a rectangle that is concentric with Q1 and similar to it and whose edge lengths are those of Q1 multiplied by 1 + *' · For x in U consider the affine transforma tion ( 17.17)

x z will [A34] be a good approximation to Tz for z near x . Suppose, as will be p1 0ved in a moment, that for each *' there is a o such that, if 0 7.15) holds, then, for each i, x approximates T so well on Q1 that '

{ 17.18) By Theorem 12.2, which shows in the nonsingular case how an affine transforma tion changes the Lebesgue measures of sets, A k (x Qt ' ) = IJ(x1)1A k (Q1+ '). If 0 7.18) holds, then '

( 17.19)

A k (TA ) = }::) k (TQ;) < }::) k ( x; Qt ') i

i

= [ l l (x1)1A k ( Qt ' ) = ( 1 + t: / [ l l ( x1 ) 1A k ( Q 1 ) . i

i

(This, the central step in tht! proof, shows where the Jacobian in (17. 10) comes from .) If for each *' there is a o such that 07.15) implies both (17.16) and 07.19), then (17.14) will follow. Thus everything depends on 07.18), and the remaining problem is to show that for each *' there is a o such that 0 7. 18) holds if (17. 15) does. Proof of 07. 18). As (x, z) varies over the compact set A - x [z: lzl = 1], 1Dx- 1 zl is continuous, and therefore, for some c, { 17.20)

Since the tJI are uniformly continuous on A -, o can be chosen so that ltp(z) - tp(x)l < t:jec for all j, l if z, x E A and lz - xl < o. But then, by linear approximation [A34: 06)], ITz - Tx - DxCz - x)l < t:c- 1 lz - x l < t:c - 1 o. If 07. 15) holds and 8 < 1, then by the definition 07.17), { 17.21 )

ITz - x zl < t:/C '

for z E Q1 •

INTEGRATION

228

To prove (17.18), note that z

E Q; implies

1 1 . z l I D � Ix-, Tz - zl = Ix, I Tz - - I -�-. = x , ( Tz - x , z ) l 'f'x, x,

-

< ciTz - x z l < t: , '

where the first inequality follows by (17.20) and the second by (1 7.21). Since ; 1 Tz is within t: of the point z of Q, , it lies in Q/': ; 1 Tz Q/', or Tz x' Q/'. Hence (17.18) holds, which completes the proof. • '

E

E

Stieltjes Integrals

F

Suppose that is a function on Rk satisfying the hypotheses of Theorem for bounded 1 2.5, so that there exists a measur� J-L such that ( A ) = rectangles A. In integrals with respect to is often replaced by

J.L D.A F J.L, J.L(dx)

dF(x):

jAf( x ) dF ( x ) JAf( x ) J.L (dx).

( 1 7.22)

=

The left side of this equation is the Stieltjes integral of with respect to F; since it is defined by the right side of the equation, nothing new is involved. Suppose that is uniformly continuous on a rectangle A, and suppose f(y)l < that A is decomposed into rectangles A m small enough that E / ( A ) for y E Arn. Then

f

f

J.L

IJ(x) -

x,

�f( x ) dF ( x ) - L f( xm)D.AmF m

<E

xm

for E A m · In this case the left side of (1 7.22) can be defined as the limit of the a, lf(x)l > t:] -+ 0 as a -+ oo , Show by example that f(x) need not go to 0 as x -+ oo (even if f is continuous).

230 17. 7.

INTEGRATION

Let

r::;

=

16. 7.

17.8.

f. (x) =x" - 1 - 2x 2" - 1• Calculate and compare JJ L,=dn(x )dx and 1 JJf,(x) dx. Relate this to Theorem 16.6 and to the corollary to Theorem

Show that (1 + y 2 )- 1 has equal integrals over ( - oo ,- 1), ( - 1, 0),(0, 1),(1, oo). Conclude from (17.9) that f J(l + y 2 ) - 1 dy = '7T14. Expand the integrand in a geometric series and deduce Leibniz 's formula '7T

-

4

=

1-

1

- +

3

-51 71 + · · · - -

16.7 (note that its corollary does not apply). 17.9. Show that if f is integrable, there exist continuous, integrable functions g, such that g,(x) -+ f(x) except on a set of Lebesgue measure 0. (Use Theorem 17.1(ii) with € = n - 2.) 17. 10. 13.9 17.9 i Let f be a finite-valued Borel function over [0, 1] By the following steps, prove Lusin 's theorem: For each € there exists a continuous function g such that A [ x E (0, 1): f(x) * g(x)] < E. (a) Show that f may be assumed integrable, or even bounded. (b) Let g, be continuous functions converging to f almost everywhere. Combine Egoroff's theorem and Theorem 12.3 to show that convergenc� is un iform on a compact set K such that A((O, 1) - K) < €. The limit lim, g,(x) f(x) must be continuous when restricted to K. (c) Exhibit (0, 1) - K as a disjoint union of open intervals lk [A12], defi n e g as by Theorem

=

f on K, and define it by linear interpolation on each 1, .

17. 11.

Suppose in 07.7) that T' exists and is continuous and f is a Borel function, and suppose that Jtlf(Tx)T'(x)l dx < oo_ Show in steps that f (a,b) lf(y)l dy < oo T and (17.8) holds. Prove this for (a) f continuous, (b) f = 11 t i' (c) f 18 , (d) f simple, (e) f> 0, (f) f general. ••

17.12.

=

Let ../ consist of the continuous functions on R 1 with compact support. Show that ../ is a vector lattice in the sense of Problem 1 1 .4 and has the property that f E ../ implies fA 1 E ../ (note that 1 � ../ ). Show that the a--field !T generated by ../ is � 1 • Suppose A is a positive linear functional on ../; show that A has the required continuity property if and only if f,( x)! 0 uniformly I n x implies A(f,) -+ 0. Show under this assumption on A that there is a measure JL on � 1 such that

16.12 i

{ 17.25)

A { f) = jfdJL,

f E ../ .

Show that JL is a--finite and unique. This is a version of the Riesz representation

theorem.

Let A(f) be the Riemann integral of f, which does exist for f in ../'. Usin g the most elementary facts about Riemann integration, show that the JL determ ined by (17.25) is Lebesgue measure. This gives still another way of constructing Lebesgue measure.

17.13. i

17.14. i

Extend the ideas in the preceding two problems to

Rk .

PRODUCT MEASURE AND FUBINI 's THEOREM

SECTION 18.

231

SECTION 18. PRODUCT MEASURE AND FUBINI'S THEOREM

Let (X, 8{') and (Y, W) be measurable spaces. For given measures J.L and v on these spaces, the problem is to construct on the Cartesian product X X Y a product measure 7T such that 7T(A X B) = J.L(A)v(B) for A e X and B c Y. In the case where J.L and v are Lebesgue measure on the line, 7T will be Lebesgue measure in the plane. The main result is Fubini's theorem, accord ing to which double integrals can be calculated as iterated integrals. Product Spaces

It is notationally convenient in this section to change from (f!, .9') to (X, 8{') and (Y, W). In the product space X X Y a measurable rectangle is a product A X B for which A E 8{' and B E W. The natural class of sets in X X Y to consider is the u-field 8{'x W generated by the measurable rectangles. (Of course, 8{'x W is not a Cartesian product in the usual sense.) Suppose that X = Y = R 1 and flC W= � 1 • Then a mea surable rectangle is a Cartesian product A X B in which A and B are linear Borel sets. The term rectangle has up to this point been reserved for Cartesian products of intervals, and so a measurable rectangle is more general. As the measurable rectangles do include the ordinary ones and the latter generate IN 2 , it follows that � 2 c �N 1 x � 1 . On the other hand, if A is an interval, [ B: A X B E � 2 ] contains R 1 (A X R 1 = U n(A X ( - n, n]) E IN 2 ) and is closed under the formation of proper differences and countable unions; thus it is a u-field containing the intervals and hence the Borel sets. Therefore, if B is a Borel set, fA: A X B E IN 2 ] contains the intervals and hence, being a u-field, contains the Borel sets. Thus all the measurable rectangles are in IN 2 , and so IN 1 x IN 1 IN 2 consists exactly of the two • dimensional Borel sets. Emmple 18.1.

=

As this example shows, 8{'x measurable rectangles.

W is

in general much larger than the class of

{i) If E E 8{'x W, then for each x the set [ y: (x, y ) E £] Lies in W and for each y the set [x: (x, y) E £] Lies in f:?r. (ii) Iff is measurable 8{'x W, then for each fixed x the function f(x , ) is measurable W, and for each fixed y the function f( y) is measurable 8{'. Theorem 18.1.

·

· ,

The set [ y : the section of PROOF.

(x, y) E £ ] is the section f determined by x.

of E determined by

x, and f(x, · ) is

Fix x , and consider the mapping Tx : Y � X X Y defined by 1

Tx y = (x, y).

If E = A X B is a measurable rectangle,

Tx- £

is

B

or

0

INTEGRATION

232

Tx- 1£ E 1 [y: (x,y) E E] = T.- E E

x

W. By Theorem according a s A contains or not, and in either case W for 13.l{i), is measurable Wj8lx W. Hence 1 ffx W. By Theorem 13.1(ii), if is measurable 8lx WjB£1 , then is 1 measurable WjB£1 • Hence ·)= ( · ) is measurable W. The symmetric • statements for fixed are proved the same way.

EE

T.

f f(x, fT.

y

fT.

Product Measure

v) finite. x.

JL)

Now suppose that (X, ff, and (Y, W, are measure spaces, and suppose By the theorem just proved y : for the moment that and v are is a well-defined function of If ..£ is the class of in ffx W for which this function is measurable ff, it is not hard to show that ..£ is a A-system. Since the function io; I,ix)v(B) for = A X B, ..£ contains the 7T-system consisting of the measurable rectangles. Hence ..£ coincides with f!CX W by the 7T-A theorem. It follows without difficu lty that

11-

(x, y) E £]

E

v[

E

( 1 8 . 1)

=

7T'( £ ) J v[ y: ( x, y ) E £] 11- ( d.t ) , X

E E f!Cx W,

is a finite measure on f!Cx W, and similarly for

( 18.2 )

7T"( £ ) = jJL[x : ( x, y ) E E] v(dy) , y

E E ffx W.

For measurable rectangles,

( 1 8.3 )

7T'( A X B )

=

1T"( A X B) = JL( A ) · v( B } .

E

= 1T"(E)

The class of in ffx W for which 7T'(£) thus contains the measurable rectangles; since this class is a A-system, it contains f!Cx W. The common value 7T'(£) 1T"( is the product measure sought. To show that (18.1) and (18.2) a!so agree for u-finite and let {A ml and {Bn} be decompositions of X and Y into sets of finite measure, and put m( A) JL(A n A m ) and vn(B) = v(B n Bn). Since v(B) [m vm(B), the in tegrand in (18. 1) is measurable 8l in the lT-finite as well as in the finite case ; hence 7T' is a well-defined measure on ffx W and so is 7T". If 1T', n and 1T:;,n are (18.1) and (18.2) for and n , · then by the finite case , already treated = L m n1T',n( E) (see Example 16.5), Thus (18.1) L mn1T:;, n(E) and (1 8.2) coincide in the u-finite case as well. Moreover, 1T'(A X B) =

= E)

11-

11-

=

v,

=

11-m v 7T'( £) 7T"(E). Lmnllm(A)vn(B) JL( A )v( B). Theorem 18.2. If (X, ff, JL) and (Y, W, v) are u-finite measure spaces, 7T(E) = 7T'( E) 1T"(E) defines a u-finite measure on f!Cx W; it is the only measure such that 7T(A X B) = J.L(A) · v(B) for measurable rectangles. =

=

=

=

SECTION 18.

PRODUCf MEASURE AND FUBINI ' S THEOREM

233

P RooF. Only u-finiteness a nd uniqueness remain to be proved. The

products A m X Bn for {A m } and {Bn} as above decompose X X Y into measurable rectangles of finite 7T-measure. This proves both u-finiteness and uniqueness, since the measurable rectangles form a 7T-system generatmg • 8lx W (Theorem 10.3).

product measure;

J.Lx v.

it is usually denoted X The 1T thus defined is called Note that the integrands in (18.1) and (18.2) may be infinite for certain and which is one reason for introducing functions with infinite values. Note also that ( 18.3) in some cases requires the conventions ( 15.2).

y,

Fubini's Theorem

Integrals with respect to 1T are usually computed via the formulas ( 18.4)

J f(x, y) rr( d(x,y)) J [Jt( x, y)v( dy) ] J.L(dx) =

XXY

X

Y

and

( 18.5) J f( x, y) 7r ( d( x, y)) J [ Jt( x, y)J.L(dx) ] v(dy ). The left side here is a double integral, and the right sides are iterated integrals. The formulas hold very generally, as the following argument show�. =

Xx Y

Y

X

Consider (1 8.4). The inner integral on the right is

jf( x, y)v( dy). Because of Theorem 18.1(ii), for f measurable ff x W the integrand here is measurable W; the question is whether the integral exists, whether (18.6) is measurable 8l as a function of x, and whether it integrates to the left side of (18.4). First consider nonnegative f. If f = IE, everything follows from Theorem 18.2: (18.6) is v[ y : (x, y) E £], and (18.4) reduces to 7T{£) 1T'(E). Because of linearity (Theorem 15.1(iv)), if f is a nonnegative simple function, then ( 18.6)

y

=

(18.6) is a linear combination of functions measurable f!C and hence is itself measurable ff; further application of linearity to the two sides of (18.4) shows that (18.4) again holds. The general nonnegative is the monotone limit of nonnegative simple functions; a pplying the monotone convergence theorem to (18.6) and then to each side of (18.4) show s that again has the properties required. Thus for nonnegative (18.6) is a well-defined function of (the value oo is not excluded), measurable ff, whose integral satisfies (18.4). If one side of

f

f

f,

x

INTEGRATION

234

(18.4) is infinite, so is the other; if both are finite, they have the same finite value. Now suppose that not necessarily nonnegative, is integrable with respect to Then the two sides of (18.4) are finite if is replaced by I Now make the further assumption that

f,

1r.

f

j l f(x , y)l v ( dy)

( 18. 7)

y

fl.

< oo

for all x. Then ( 18.8) j f(x , y )v(dy ) = ( r e x , y)v( dy) - jr ( x , y)v( dy). Jy

y

y

r < l tD

The functions on the right here are measurable f!C and (since r, integrable with respect to J.L, and so the s ame is true of the function on the left. Integrating out the X and applying (18.4) to r and to r gives (18.4) for itself. satisfying (18.7) need not coincide with X, but The set A0 of J.L(X -- A0) = 0 if is integrable with respect to because the function in (Theorem 15.2{iii)). Now (18.8) holds on A0, (18.7) integrates to (18.6) is measurable 8r on A0, and (18.4) again follows if the inner integral on the right is given some arbitrary constant value on X - A0. The same analysis applies to (18.5):

f

x

1r,

f Jl f l d1r

Theorem 18.3.

the functions

Under the hypotheses of Theorem 18.2, for nonnegative f

1 f( v( dy) , f f( X , y) J.L( dx) are measurable 8l and W, respectively , and (18.4) and (18.5) hold. If f (not necessari l nonnegative) is in tegrable with respect to 1r, then th e two functions y (18.9) are finite and measurable on A0 and on B0, respectively , where J.L(X A0) = v(Y - B0) 0, and again (18.4) and (18.5) hold. ( 18 .9)

y

X, Y)

X

=

It is understood here that the inner integrals on the right in (18.4) and (18.5) are set equal to 0 (say) outside A0 and B0.t This is the part concerning nonnegative is sometimes Application of the theorem usually follows a two-step called procedure that parallels its proof. First, one of the iterated integrals is computed (or estimated above) with 1 !1 in place of If the result is finite,

Fubi n i' s theorem; Tonelli's theorem.

f

f.

t since two functions that are equal almost everywhere have the same integral, the theory of integration could be extended to functions that are only defined al most everywhere; then A0 and would disappear from Theorem 18.3.

B0

SECfiON 18.

PRODUCf MEASURE AND FUBIN! ' S THEOREM

235

then the double integral {integral with respect to 7T ) of 1 !1 must be finite, so that f is integrable with respect to 1r; then the value of the double integral of f is found by computing one of the iterated integrals of f. If the iterated integral of 1!1 is infinite, f is not integrable 7T. Let D, be the closed disk in the plane with center at the origin and radius r. By (17.12), Example 18.2.

A 2 ( D,) =

ff dx d D,

y

=

ff

O n k. Since J l fm -f,I P dJL > aPJL[Ifm -f,l > a] (thisk is just a general version of Markov's inequality (5.3 1)), JL[I f, - fml > 2- ] < 2Pk l fm - fJ� < 2 - k for Therefore, L 1 JL[I f, + I -J, k l > 2 k ] < oo, and it follows by the first Borei-Ca;ltelli lemma (which works for arbitrary measures) that, outside a set of wmeasure 0, L:kl f, k + I - !, k I converges. But then f, k converges to some f almost everywhere, and by Fatou ' s lemma, !If - f,kl P dJL < lim inf; ! I f,; - /,� I P dJ-t < 2 - k _ Therefore, f E L P and I ! - !,� l i P -� 0, as required. If = oo, choose { } so that 11/m - /,l i "" < 2- k for m , > Since k If, � + I -f, I < 2-k almost everywhere, /, � converges to some f, and if -f,k I < • 2 - k almost everywhere. Again, II!-f,) p -+ 0. PROOF.

p

m , n -+

{n k }

m, n >nk .

p

k

-

n nk.

n

k

The next theorem has to do with separability.

Let U be the set of simple functions L:�: 1aJ8 for a; and JL( B ) finite. For < p < U is dense in LP. af i n i t e and !F is countably generated, and if p < then LP (ii) lf separable. Theorem 19.2. ;

1-L is

(i)

1

I

oo ,

is

oo ,

of (i). Suppose first that p < For f E LP , choose (Theo Proof rem 1 3.5) simple functions J, such that fn -+ f and If, I 1 i /P 1 l g l q ; and is obviously linear. According to the Riesz representation theorem, this is the most general bounded linear functional in the case p < i s £Tj i n ite, t h at 1 i s Suppose that < and that J.L on LP has t o p. Every form (19.8) for conjugate functi o nal the bounded L i n ear some g E U; further, ( 19 .9) and unique up to a set of J.L-measure 0. A

=

.6

a

N A n), and since l cp( u n N A )I < MJ.Ll /P( U n N A ) � 0, it fol lows that cp is an additive set function in the sense of (32. 1). The Jordan decomposition (32.2) represents cp as the difference of two finite measures cp + and cp - with disjoint supports A + and A -. If J.L( A ) = 0, then cp + ( A ) = cp( A n A + ) < MJ.L1 1 P( A ) = 0. Thus cp + is absolutely continu ous with respect to J.L and by the Radon-Nikodym theorem (p. 422) has an integrable density g + : cp +( A ) = fA g + dJ.L. Together with the same result for cp - , this shows that there is an integrable g such that ·y( !A ) = cp( A) = = Thus = for simple functions in L P. Assume that this lies in U, and define 'Yg by the equation (19.8). Then and are bounded linear functionals that agree for simple functions; since the latter are dense (Theorem 1 9.2(i)), it fol lows by the continuity of and 'Yg that they agree on all of L P. It is therefore enough (in the case of finite J.L) to prove E U. It will also be shown that is at most the M of (1 9.7); since does work as a bound in (19.7), (19.9) will fol low. If r/ ) = 0, (1 9.9) will imply that = 0 al most everywhere, and for the general satisfying r/ ) = must it will follow further that two functions agree almost everywhere. oo. Let gn be simple functions such that 0 1 i > and take n = � P sgn Then h n g = n and since hn < = n J.L r/ ,) = <M is simple, it follows that l = this gives J.L ! < M. Now the M[ fgn Since 1 monotone convergence theorem gives E LP and even < M. = = 1 = l ( )l < M · = oo. In this case, Then J.L > al < for simple functions m L1 • Take = sgn = If < M · II/III = M, this inequality gives J.Ll 0; therefore g il"" = M and E L"' = U. Let A n be set'> such that A n i fi and J.L(A) oo . If II: J.L for J E J.Ln( A ) = J.L(A n A ), then lr(f/.4)1 < M =M. LP( J E L P(J.L) c L P (J.L)). By the finite case, A n supports a gn in Lq such that = for J E LP and < M. Because of uniqueness, gn l can be taken to agree with gn on A n ( LP(J.Ln l ) c LP(J.L)). There is + + therefore a function on n such that = n on A n and M. It E U. By the dominated convergence theorem < M and follows that = limn and the continuity of E LP imp lies = " limn Uniqueness fol lows as before. )= • " >

>

fA gdJ.L f!A gdJ.L. y(f) ffgdJ.L g y 'Yg f

g l gll q

f

y

llgl l q

g

f y(f ) Assumehthat g I a [ MJ.L f/[lgl�a l · lgl dJ.L ffgdJ.L I g II llgll q < l g l > a] = Case a--finite. < · llf!A ) P [f ltl pdJ.LJ' IP fA y(fiA ) ffiA . gn dJ.L , l gJ q g g I l iA gll q < g llgl l q g y, f ff/A g dJ.L ffg dJ.L y(f!A y(f).

y

g

=

=

"

t Problem 1 9 3

246

INTEGRATION

Weak Compactness

f E LP, and g E U, where p and q are conjugate, write ( 19.10) ( f, g ) = ffgdJL . For fixed f in LP , this defines a bounded linear functional on U; for fixed g in U, it defines a bounded linear functional on LP. By Holder' s inequality, For

( 19 .1 1 ) Suppose that and fn are elements of If limn(f�, for each 0, then certainly to f. If in U, then converges weakly to although the converse is false. t The next theorem says in effect that if p > then the unit ball Bf 1 is compact in the topology of weak convergence.

LP. (f, g) = g) f fn weakl g fn converges --> fnl i I ! y P f, 1, = [fELP : 1 / I P < ] Suppose that is uf i n ite and :Y is countably generated. If oo, then eve1y sequence in B f contains a subsequence converging weakly 1to in S, choose rules o) = x ( · ) is an element of L"", in fact of 8 �, and so by Theorem 19.4 there is a subsequence along which, for each j = 1, . . . , k, oJ">

JA dfL > 0 and J( l LA )lA aJ.L = lim n /0 "L, W >)JA J.L = 0, so that oi > 0 and L,ioi 1 almost everywhere on A. S ince J.L is u-finite, the oi can be altered on a set of J.L-measure 0 in such a way nas to ensure that o = (81, , ok ) is an element of D. risk But, along the subsequence, < > R( o ). Therefore: -

d

-

=

x

The set i s compact and vex. conThe rest is geometry. For x in Rk , let Qx be the set of x' such that 0 <x; <x1 for all If x R(o) and x' = R(o'), then o' is better than o if and only if x ' E Qx and is admissible if there exists no better t han o; it makes no sense to x'use a ruleA rule that is not admissible. Geometrically, admissibility means that. for x = R(o), Q x consists of x alone. i. * X.

=

•

-->

0

•

•

01

Sn

>

x

Let = R(o) be given, and suppose that o is not admissible. Since S n Qx is compact, it contains a po int nearest the origin unique, since S n Qx is convex as well as compact); let o' be a conesponding rule: = R(o'). Since o is not admissible, it would x' and o' is better than o. If S n Q x · contained a point distinct from ' be a point of S n Q, nearer the origin than x , which is impossible. This means that Q x · contains no point of S other than x ' itself, which means in turn that 8' is admissible. Therefore, if o is not itself admissible, there is a o ' that is admissible and is bette1 than o . This is expressed by saying that the class of admissible rules is

x'

=l=x,

(x'x'

x',

complete. Let p = (p1, , Pk ) be a probability vector, and view P; as an a priori probability that [; is the correct density. A rule o has Bayes risk R( p , o) = [1p1R1(o) with respect to p. This is a kind of compound risk: [; is correct with probability p1, and the statistician chooses fj with probability 8/ w ). A Bayes rule is one that minimizes the Bayes risk for a given p. In this case, take a = R(p, o) and consider the hyperplane H = [ z : � P;Z; = a ] ( 19. 12) •

•

•

and the half space

( 19.13 )

[

H+= z· " l..., p .z. > a ·

i

,

r -

]

•

Then x = R(o) lies o n H, and S is contained in H+: x is on the boundary of S, and

SECTION 19.

TH E

LP

SPACES

249

H is a supporting hyperplane. If P; > 0 for all i, then Qx meets S only at x , and so o is admissible. Suppose now that o is admissible, so that x = R( o ) is the only point in S n Q x and x lies on the boundary of S. The problem is to show that o is a Bayes rule, which means finding a supporting hyperplane (19.12) corresponding to a probability vector p. Let consist of those for which Q Y meets S. Then is convex: given a convex combination = of points in choose in S points and southwest of and respectively, and note that = lies in S and is southwest of Since S meets Qx only in the point x, the same is true of so that x is a boundary point of as well as of S. Let (19.12) ( p -=F 0) be a supporting hyperplane through x x H and H+. If P,· < 0, take Z; = X; 1 and take Z ; = x ,. for the other i; then lies in but not in H+, a contradiction. (The right-hand figure shows the ro le of and only H2 the planes H 1 and H2 both support S, but only H2 supports corresponds to a probability vector.) Thus P; > 0 for all i, and since f.; P; = 1 can be arranged by normalization, o is indeed a Bayes rule. Therefore

T y y" Ay + A' y ' y', T E Tc T U

T, z" Az + A'z' +

(I

tl

are Bayes rules, and they form a complete class. The S pace

T

z z'

yy". T, z T: T The admissible rules

L2

The space is special because = 2 is its own conjugate index. If f, g E L the is well defined, and by (1 9. 1 1 ) -write 11!1 1 in (f, place of IIJ II 2 -K/, This is (or Cauchy-Schwarz) inequality. If one of f and is fixed, (f, is a bounded (hence continuous) f ), the norm is giver. by linear functional in the other. Further, (f, is complete under the metric A 11! 11 = ( / f ), and is a vector space on which is defined an inner product having all these properties. The Hilbert space is quite like Euclidean space. If (f, = 0, then f and are and orthogonality is like perpendicularity. If f1 , , fn a1 e orthogonal (in pairs), then by linearity, (L.J;, L.J) = L;Lif;, f) = L.; (f; , f): IL.; /; 11 = L; /;1 1 . This is a version of the Pythagorean theo rem. If f and are orthogonal, write f .l For every f, f .l 0. Suppose now that 11- is a--finite and !F is countably generated, so that C is separable as a metric space. The construction that follows gives a sequence • • • that is in the sense that llcpnll = 1 for (finite or infinite) cp and is all and (cpm , cpn ) 0 for in the sense that (f, cp ) = 0 for all implies f = 0-so that the orthonormal system cannot be enlarged. Define 1 , Start with a sequence ... f2 , . . . that is dense in have been defined and are inductively: Let 1 f1 • Suppose that P • • • , orthogonal. Define is L.j= 1 ; ; where if Then is orthogonal to and is arbitrary if and f 1 is a linear combination of 1 , • • • , 1 • This, the Gram-Schmidt method, • • • with the property that the finite gives an orthogonal sequence linear combinations of the include all the fn and are therefore dense in 2 If = L• 0, take if 0, discard it from the sequence. Then cp 1 , cp 2 , . . . is orthonormal, and the finite linear combinations of the are are 0, in which still dense. It can happen that all but finitely many of the

L2 p 2, inner product g g))I 2 g =gn+ l =fn+l - ang g gn ani (J�+ �> g;)/l l g; l 2 .. gw g; = O.g ggn+l g;=I=n + O ,gn, + n g , g1, 2 gn gn =I= 'Pn gn!ll gJ; gn = 'Pn gn ,

•

P

,

•

•

250

INTEG RATION

case there are only finitely many of the 'Pn · In what follows it is assumed that cp 1 , cp 2 , • • • is an infinite sequence; the finite case is analogous and somewhat simpler. Suppose that f is orthogonal to all the 'Pn · If ar� arbitrary scalars, then f, a 1 cpp · · · • a n cpn is an orthogonal set, and by the Pythagorean property, II ! - '£7= ,a;cp1ll 2 = 11!11 2 + 'L;= 1 T > 11!11 2 . If 11!11 > 0, then f cannot be ap proximated by finite linear combinations of the cpn, a contradiction: cp 1 , cp 2 , • • •

a1

a is aConsider completenoworthonormal s stem. y a sequence a 1 , a 2 , . • • of scalacs for which '[� a ,2 converge s. If sn = '£7= a 1cp , then the Pythagorean theorem gives llsn - sm11 2 = 'Lm nar Since the scalar series converges, {sJ is fundamental and therefore by Theorem 1 9 . 1 converges to some g in L2 • Thus g = lim n 'L7= 1 a1cp1, which it is namral to express as g '[� 1 1cp1• The series (that is to say, the sequence of partial sums) converges to g in the mean of order 2 (not almost everywhere). By the following argument, every element of L 2 has a unique representation in this form. The Fourier coefficients of f with respect to {cp) are the inner products n, 0 < II ! - 'L7= 1 a 1 cp;ll 2 = 11!11 2 - 2'L, a 1(f, cp) + a'L1 =a 1 (f,aicpcp).cp For11! 11each 2 - 'L 7= 1 a ! , and hence, n being arbitrary, '£7= ,a; < 11!11 2 • 1• ) u By the argument above, the series '£7= 1 a 1 cp; therefore converges. By linearity, (f- 'L7= 1 a1cp;, cp) = 0 for n > j, and by continuity, (f- '[�= 1 a cp , cp) = 0. Therefore, J- 'L7= 1 a1cp1 is orthogonal to each cpi and by completeness must ,

=1

;

=

=

by linearity, so that f - PMf .l cpi by continuity. But if f - PMf is orthogonal to each cpi , then, again by linearity and continuity, it is orthogonal to the general element 'L]= tbjcpj of M. Therefore, PMfE M and J- PMf .l M. The map f � PMf is the orthogonal projection on M.

complete in

nand orthogonal projection

nj

SECTION

THE L P SPACES

! 9.

251

The fundamental properties of PM are these: {i) (ii) (iii) (iv)

g E M and f - g .l M together imply g = PMf; f E M implies PMf = f; g E M implies I I ! - g il > I I ! - PMf ll; PM(a f + a'f' ) = a PMf + a' PM f'·

Property (i) says that PMf is uniquely determined by the two conditions PM f E M and J - PM .l M. To prove it, suppose that g, g ' E M, J - g .l. M, and f - g' .l M. Then g - g' E M and g - g' .l M, so that g - g' is orthogo 2 nal to itself and hence ll g - g ' ll = 0: g = g' . Thus the mapping PM is independent of the particular basis {cp;}; it is determined by M alone. Clearly, (ii) follows from (i); it implies that PM is idempotent in the sense t hat P'/4 f = PM f. As for {iii), if g lies in M, so does PM f - g, so that, by the 2 2 2 2 Pythagorean relation, II ! - g ll = II! - PM.f l l + IIPMf - g ll > I I ! - PMJ II ; the inequality is strict if g =I= PM f. Thus PM f is the unique point of M lying nearest to f. Property (iv), linearity, follows from (i). An

Estimation Problem

Fir�t, the technical setting: Let (!l, .9", ,u ) and (e, G', '7T) be a a--finite space and a probability space, and assume that .9" and G' are countably generated. Let fe(w) be a nonnegative function on e X n, measurable G'x .9", and assume that fn fi w )JL( dw) = 1 for each (J E e. For some unknown value of e, w is drawn from n according to the probabiiities P6( A ) = fA fe( w )JL( dw ), and the statistical problem is to estimate the value of g( e), where g is a real function on e. The statistician knows the func tions f( . ) and g( ) as well as the value of w; it is the value of (J that is unknown. For an example, take !l to be the line, f(w ) a function known to the statistician, and fe( w ) = af(aw + {3 ), where (J = ( a, {3) specifies unknown scale and location parameters; the problem is to estimate g(e) = a, say. Or, more simply, as in the exponential case ( 1 4.7), take fe( w) = af(aw ), where (J = g((J) = a. An estimator of g((J) is a function t(w). It is unbiased if ·

{ 19 . 15 )

·

,

1 t ( w ) f6( w )JL(dw ) = g ( e ) fl

for all e in e (assume the integral exists); this condition means that the estimate is on target in an average sense. A natural loss function is ( t ( w) - g(e))2, and if f6 is the co rrect density, the risk is taken to be fn(t(w) - g(e ))2fe( w )JL( de). If the probability measure '7T is regarded as an a priori distribution for the unknown e, the Bayes risk of t is ( 19.16)

R('1T, t ) = 1. 1 ( t ( w ) - g ( e ) ) 2 f6 (w ) JL ( dw ) '1T ( de ) ; e n

this integral, assumed finite, can be viewed as a joint integral or as an i terated integral (Fubini's theorem). And now t0 is a Bayes estimator of g with respect to '7T if it minimizes R( '7T, t) over t. This is analogous to the Bayes rules discussed earlier. The

252

INTEGRATION

followmg simple projection argument shows that, except in trivial cases, no Bayes estimator is unbaised t Let Q be the probability measure on G'x !f having density t6(w) with respect to 1r X JL, and let 2 be the space of square-integrable functions on (E) X !l, G' X !f, Q). Then Q is finite and G'x !f is countably generated. Recall that an element of 2 is an equivalence class of functions that are equal almost everywhere with respect to Q. Let G be the class of elements of [2 containing a function of the form g( e, w) = g( w) -functions of () alone Then G is a subspace. (That G is algebraically closed is clear; if t, E G and lit, - til -> 0, then-see the proof of Theorem 19.1-some subse quence converges to f outside a set of Q-measure 0,

is i The Neyman-Pearson lemma. Suppose f1 and /2 are rival densities and L(Ji i ) is 0 or 1 as j = i or j -=F i, so that R;(b) is the probability of choosing the opposite density when [; is the right one. Suppose of o that 8 2( w ) = 1 if f2(w ) > tfi( w ) and oz( w ) 0 if f2(w ) < tfi( w ) where t > 0 Show that b is adm!ssible: For any ule o', fo; r dJL < foz r dJL implies Wr h dJL fo r f dJL. Hint. f(o - o;) =

•

,

> f f ([2 - tf1 ) dJL > 0, since the integrand is nonnegative. l

19.8.

2

2

The classical orthonormal basis for r z [o, 2'7T] with Lebesgue measure is the trigonometric system ( 1 9.17) (2'7T) - I ,

'7T - 1/2 sin nx

,

l '7T - /2 cos tiX ,

Hint:

n = 1,2 .. ,

e inx

+ Prove orthonormality. Express the sines and cosines in terms of multiply out the products, and use the fact that M dx i s 2'7T or 0 as = 0 or -=fo 0. (For the completeness of the trigonometric system, see Problem 26.26.)

e-;,\ m m 19.9.

.,.eimx

Drop the assumption that [2 is separable. Order by inclusion the orthonormal systems in L 2 , and let (?..orn's lemma) = [cp,: be maximal. (a) Show that f1 = [ (f, cp, ) -=F 0] is countable. Use 'Lj= 1( f, cp1 ) 2 < 11[11 2 and the argument for Theorem 10.2(iv). (b) Let Pf = Lr E r ( f, cp )cp . Show that f - Pf l. clJ and hence \maYimality) f = Pf. Thus ¢ r s a f-t ortiJon�rmal basis. (c) Show that is countable if and only if [2 is separable. and (d) Now take to be a maximal orthonormal system in a subspace define PMf = 'L E r (f, cp)cp1• Show that PMf E M and f - PMf l. , that g = PMg if g , ")'an a that f - PMf l. M. This defines the general orthogonal projection.

y:

EM

yEn Hint·

M,

C H A PTER 4

Random Variables and Expected Values

SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS

This section and the next cover random variables and the machinery for dealing with them-expected values, distributions, moment generating func tions, independence, convolution. Random Variables and Vectors

random variabl e on a probability space f!, !F, P) is a real-valued function X = X(w) measurable Sections 5 through 9 dealt with random variables of A

(

!F.

a special kind, namely simple random variables, those with finite range. All concepts and facts concerning real measurable functions carry over to ran dom variables; any ch3nges are matters of viewpoint, notation, and terminol ogy only. The positive and negative parts x+ and x- of X are defined as in (15.4) and (15.5). Theorem 13.5 also applies: Define

( k - 1 )2- n ( 20.1)

n If X is nonnegative and Xn nonnegative, define

( 20.2) (This i s the same as

254

if ( k - 1 )2- n < X < k T n , if

=

x

>

n.

1 < k < n2 n ,

1/Jn(X), then 0 < Xn i X. If X is not necessarily if X > 0, if X < 0.

(13.6).) Then 0 < Xn(w) i X(w) if X(w ) > 0 and 0 >

SECTION 20.

RANDOM VARIABLES AND DISTRIBUTION S

255

Xn(wH X(w) if X(w) < 0; and IXJw)l i I X{w)l for every w. The random variable Xn is in each case simple. is a mapping from f! to Rk that is measurable !F. Any A mapping from f! to R k must have the form w � X( w ) = (X1(w ), . . . , Xk (w )), where each X;(w) is real; as shown in Section 13 (see (13.2)), X is measur able if and only if each X; is. Thus a random vector is simply a k -tuple X = (X, , . . . , Xk ) of random variables.

random vector

Subfields

If .§ is a £T-field for which .§c !F, a k-dimensional random vector X is of course measurabie .§ if [w: X(w ) E .§ for every in � k . The £T-field £T( X) generated by X is the smallest £T-field with respect to which it is measurable. The £T-field generated by a collection of random vectors is the smallest £T-field with respect to which each one is measurable. As explained in Sections 4 and 5, a sub-lT-field corresponds to partial information about w. The information contained in £T(X) = £T(X1, , Xk ) consists of the k numbers X,(w ), . . . , Xk (w ). i The following theorem is the analogue of Theorem 5.1, but there are technical complications in its proof.

E H]

H

•••

Let X = (X" . . . , Xk ) be a random vector. {i) The £T-jield £T( X ) = £T( X 1 , •• , Xk ) consists exactly of the sets [ X E H] for H E �k . (ii) In order that a random variable Y be measurable £T(X) £T(X" . . . , Xk ) 1 k a measurabl e map y and sufficient that there exist itthat necessar f: R R such Y(w) = f( X1( w), . . . , Xk (w)) for all w. The ciass .§ of sets of the form [ X E H] for HE � k is a £T-field. Since X is measurable £T( X ), .§c £T(X). Since X is measurable .§, £T( X) c Hence part {i). Measurability of f in part {ii) refers of course to measurability � k I� 1 • The sufficiency is easy: if such an f exists, Theorem 13. 1{ii) implies that Y is measurable £T(X). To prove necessity, * suppose at first that Y is a simple random variable, and let y 1, ••• , Ym be its different possible values. Since A; = [w: Y(w ) = Y;] lies in £T(X ), it must by part {i) have the form [w: X( w) E H;] for some H; in �X(w)k. Putcanf=lieLin; Y)more H . ; certainly f is measurable. Since the A; are disjoint, no than one H; (even though the latter need not be disjoint), and hence f(X(w )) Y(w ). Theorem 20. 1 .

=

is

�

PROOF.

f.

,

=

t The partition defined by (4. 16) consists of the sets [w. X(w) t For a general version of this argument, see Problem 13.3.

=

x ] for x E

R k.

RANDOM VARIABLES AND EXPECTED VALUES

256

To treat the general case, consider simple random variables Y,. such that Y,.( w ) � Y(w) for each w. For each there is a measurable function fn : 1 R k � R such that Y,.(w) = fn( X(w)) for all w. Let M be the set of x in R k for which {fn( x )} converges; by Theorem 13.4{iii), M lies in � k. Let f( x ) = limn fJx) for x in M, and let f( x ) = O for x in R k - M. Since f= lim n fn lM and fn lM is measurable, f is measurable by Theorem 13.4{ii). For each w, Y( w ) = l im n JJ X(w)); this implies in the first place that X( w ) lies in M and in the second place that Y(w) = lim n JJX(w )) = f( X(w)). •

n,

Distributions

The distribution or law of a random variable X was in Section 14 defined as the probability measure on the line given by JL = px - 1 (see (13.7)), or ( 20.3 )

JL( A ) = P[ X t:: A ] ,

A

E �1 •

The distribution function of X was defined by ( 20.4) for real x. The left-hand limit of F satisfies ( 20.5)

F( x - ) = JL ( - oo , x )

= P[ X < x ) ,

F( x ) - F( x - ) = JL { x} = P [ X = x ] ,

and F has at most countably many discontinuities. Further, F is nondecreas ing and right-continuous, and limx --+ - oo F( x ) = 0, lim x --+oo F( x ) 1. By Theo rem 14. 1, for each F with these properties there exists on some probability space a random variable having F as its distribution function. A support for JL is a Borel set S for which JL{S ) = 1. A random variable, its distribution, and its distribution function are discrete if JL has a countable support S = { x 1 , x 2 , . . . }. In this case JL is completely determined by the values JL{ X 1 }, jL{X 2 }, • • • • A familiar discrete distribution is the =

binomial:

r=0, 1, ... , n. There are many random variables, on many spaces, with this distribution: If { Xk } is an independent sequence such that P[ Xk = 1] = p and P[ Xk = 0] = 1 - p (see Theorem 5.3), then X could be L:j= 1 Xi, or L:�:;r;xi, or the sum of any of the Xi, Or f! could be {0, 1, . . . , n} if !F con sists of all subsets, P{r} = JL{r}, r = 0, 1, . . . and X( r ) = r. Or again the space and random variable could be those given by the construction in either of the two proofs of Theorem 14. 1. These examples show that, although the distribution of a

n

, n,

SECTION 20.

RANDOM VA RIABLES AND DISTRIBUTIONS

X X

257

contains all the information about the probabilistic random variable behavior of itself, it contains beyond this no further information about the or about the interaction of with underlying probability space (f! , !F, other random variables on the space. Another common discrete distribution is the distribution with parameter ,.\ > 0:

X

P)

Poisson

,.\r

P[X= r ] =J.L{r} = e-,\rl ' - r = O, l, ... .

(20.7)

constant c P[ X= c] = J.L{c} = {x x }

X(w) c.

A can be regarded as a discrete random variable with = In this case 1. For an artificial discrete example, let , 2 , . . be an enumeration of the rationals, and put 1 .

( 20.8) the point of the example is that the support need not be contai ned in a lattice. A random variable and its distribution have f with respect to Lebesgue measure if f is a nonnegative Borel function on R 1 and

density

P[XEA ] = J.L( A ) = f f( x ) dx, A E In other words, the requirement is that J.L have density f with respect to Lebesgue measure in the sense of (16.1 1). The density is assumed to be with respect to if no other measure is specified. Taking A = R 1 in (20.9) shows that f musr integrate to 1. Note that f is determined only to within a set of Lebesgue measure 0: if f = except on a set of Lebesgue measure 0, then can also serve as a density for X and It follows by Theorem 3.3 thl:lt (20.9) holds for every Borel set A if it holds for every interval-that is, if ( 20.9)

92 1 .

A

,.\

,.\

g

g

JL ·

F( b ) -F( a ) = ft( x ) dx holds for every a and b. Note that F need not differentiate to f everywhere (see (20. 13), for example); all that is required is that f integrate properly-that (20.9) hold. On the other hand, if F does differentiate to f and f is continuous, it follows by the fundamental theorem of calculus that f is indeed a density for F.t a

t The general question of the relation between differentiation and integration is taken up in Section 3 1

258


For the

exponential distribution with parameter a > 0, the density is if if

( 20.10 )

X < 0, X > 0.

The corresponding distribution function

F ( x ) =- \! 0 e - a< 1-

( 10. 1 1 )

was studied in Section 14. For the

if if

normal distribution with parameters (x )2 1 exp , f( x ) = & 2 27r [ 2a- l

( 20 .12 )

X < 0, >0 m

and a-, a- > 0,

- m

(T

- oo

<x
0, (20. 16) show� that has the normal density with parameters and Finding the density of g{X) from first principles, as in the argument leading to (20. 16), often works even if g is many-to-one: If X has the standard normal distribution, then am

a

t

b

au.

aX + b

Example 20. 1.

for

X > 0. Hence

X2 has density f( x )

0 =

1

r;:;v L. 'TT'

x

- 1 /2 - x /2 e

if X < 0, •

if X > 0.

Multidimensional Distributions

. . . , k ), the distribution J.L (a For a k-dimensional random vector X = probability measure on � k ) and the distribution function F (a real function on R k ) are defined by

(Xp X

A E �k,

(20.17) where

Sx

=

[

y : Y;

< X;, i

=

]

1, . . . , k consists of the points "southwest" of

x.

RANDOM VARIABLES AND EXPECfED VALUES

260

J.L 1, F Xk . FA

Often and are called the joint distribution and , function of X Now is nondecreasing in each variable, and D. A rectangles (see (12.12)). As h decreases to 0, the set • • •

joint

F>

distribution

0 for bounded

= [ y : X + = 1' . . . ' k ] decreases to S and therefore (Theorem 2. 1{ii)) F is continuous from above Further, in the sense that lim h ! F(x 1 + h, . . . , x k + h) = F(x , x ). k ) � if X ; � for some i (the other coordinates held fixed), F(x , x 1, k ) � 1 if X ; � for each i. For any F with thesek properties and F(x x k there is by Theorem 12.5 a unique probability measure J.L on �� such that J.L(A)As =hD.decrease A F for bounded rectangles A, and J.L( S ) F(x) for all x. s to S increases to the interior S� [ y i = 1, . . . , k ] of S and so sX . h

x'

•

•

1,

•

•

•

i

h' i

0

-

0

•

Y;

oo

•

•

•

oo

0,

x'

x

x. -h

=

=

:

Y; < X ;,

lim F( x1 - h , . . . , x k - h ) = 11- ( S� ) . h !O

(20. 18)

F F(x).

x F(x) J.L(

Since is nondecreasing in each variable, it is continuous at if and only if it is continuous from below there in the sense that this last limit coincides S x ) J.L(S� ), with Thus is continuous at if and only if which holds if and only if the boundary asX sX s� (the y-set where Y; < X i for all i and Y; = for some i) satisfies aS) 0. If k > 1 , can have discontinuity points even if J.L has no point masses: if J.L corresponds to a uniform distribution of mass over the segment B 0): 0 < < 1 in the 0 < x < 1, then is discontinuous at each plane . point of B. This also shows that can be discontinuous at uncountably many points. On the other hand, for fixed X the boundaries as h are disjoint for different values of h , and so (Theorem l0.2{iv)) only count�bly many of them can have positive wmeasure. Thus is the limit of points + h, . . . + h at which F is continuous: the continu ity points of are dense. There is always a random vector having a given distribution and distribu tion function: Take (f!, :F, (R \ �\ and X(w) = w. This is the obvious extension of the construction in the first proof of Theorem 14 .1. The distribution may as for the line be discrete in the sense of having countable support. I t may have density f with respect to k-dimensional Lebesgue measure: As in the case k 1, the distribution = is more fundamental than the distribution function and usually JL is described not by but by a density or by discrete probabilities. � R ; is measurable, If X is a k-dimensional random vector and then g(X) is an i-dimensional random vector; if the distribution of X is JL , the distribution of X) is just as in the case k 1-see (20 .14). If I> and its distribu � is defined by then gj(X) is is given by tion for

F x; (u{A) A[x:

x

=

-

=

J.L(

=

=F [( x,

(x,O) EA]), F (x x 1 F J.L) P)

=

=

F x ]

X

, xk )

=

J.L

J.L( A) fA f(x)dx.

=

F,

F g: Rk 1 g{ gg/ J.L , 1 R k J.LRj =J.Lg;-: 1 gj(x J.L/ A)xk)=J.L= xj,[(x1, , x ): xjEA]xj, = P[XjEA] k =

•

•

•

'

•

•

•

SECTION 20.

RANDOM VARIABLES AND DISTRIBUTIONS

261

.9i' 1 . The are the marginal distributions of J..L . If J..L has a density f in AR \E then J..Lj has over the line the density J..L i

(20.19)

f;(x) = j f(x p .. . ,x,_ 1, x , x; + �> · · · · xk ) dx1 • • • dx,_ 1 dx; + 1 • • • dxk , since by Fubini ' s theorem the right side integrated over A comes to J..L[(xpNow·· suppose · · x�) XjthatEA].g is a one-to-one, continuously differentiable map of V onto U, where U and V are open sets in R k . Let T be the inverse, and suppose its Jacobian J(x) never vanishes. If X has a density f supported b)" V, then for A c U, P[g(X) E A) = P[ X E TA] = fTA f( y ) dy , and by (17. 10), this equals fA f(Tx)I J(x)l dx. Therefore, g{X) has density x I J I for x E U, ( �( Tx ( ) ) (20 .20) d( x) = for x $. U. Rk - 1

This is the analogue of (20. 16). Example 20.2.

(X1, X2) has density f ( X X 2) = ( 2 7T) exp [ - i (X f + X i) J , Suppose that I,

-I

g 21 g(X1, X),2)

and let be the transformation to polar coordinates. Then U, V, and T are as in Example 17.7. If R and E> are the polar coordinates of then 2 in V. By (20. 19), R has density ( R , E>) has density (2 7T)- 1pe - p 2 pe - p on (0, oo and E> is uniformly distributed over (0, 2 7T . •

/2

=

(X1, X2 ),

)

For the normal distribution in R \ see Section 29. Independence

• • • , Xk are defined to be independent if the u-fields X1, u(X1 ) , • • • , u(Xk ) they generate are independent in the sense of Section 4. This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since u(X) consists of the 1 sets [X; E H] for HE .9i' , X 1, • • • , Xk are independent if and only if Random variables

H1, , Hk .

for all linear Borel sets The definition (4.10) of independence ••• requires that (20.2 1) hold also if some of the events are suppressed on each side, but this only means taking H,·

[X; E H;] 1 =R •


262

Suppose that (20 .22)

x 1, , x k ;

x]

for all real it then also holds if some of the events X; < ; are suppressed on each side Oet X; � ). Since the intervals ( x ] form a 7T-system generating � � , the sets < form a 7T-system generating u(X,.). are independent. Therefore, by Theorem 4.2, (20.22) implies that XI , . . . If, for example, the X; are integer-vaiued, it is enough that for integral (see • • •

oo

-

[X; x ] n1,(5.9)). , Xk = nd =P[X1 = n 1 ] · · · P[Xk =n k ]

[oo,

' xk

P[X1 n1, ,n k Let . . . , Xk ) have distribution and distribution function F, and let the have distributions and distribution functions (the marginals). By (20.2 1), X1, , Xk are independent if and only if is product measure in the sense of Section 18: •

•

=

•

X;

( Xp

•

J.L

J.L ;

IL

• • •

•

•

Fj

( 20 .23) By (20.22), X1,

• • •

,

Xk are independent if and only if

(20.24) Suppose that each integrated over J.L has density

J.L,

fl f ( y1) ) yk k F1( x1) Fk(xk), so that

has density [;; by Fubini' s theorem, is just

(-oo, xd x · · · x { -oo, xd

•

•

•

•

•

•

( 20 . 25) in the case of independence. If are independent IT-fields and X; is measurable .#;, i 1, . . . ' k, then certainly XI , . are independent. are If X; is a d;-dimensional random vector, i 1, . . . , k, then X1 , , u(Xk ) are independent. by definition independent if the u-fields u(X The theory is just as for random variables: X1 , are independent if and E � d1 , Xk ) can be only if (20.21 ) holds for E � dk. Now (X re garded as a random vector of dimension d f.7_ d; ; if J.L is its distribution X R dk and J.L ; is the distribution of X; in R d;, then, just as in R d R d1 before, , X are independent if and only if J.L J.L X X J.L In none of this need the d; components of a single X; be themselves independent random variables. An infinite collection of random variables or random vectors is by defini tion independent if each finite subcollection is. The argument following (5.10)

.#1, , .#k •

•

•

.

.

=

' xk

=1), , X k , Hk 1 •

H1

=

X1,

x ···

. • .

k

•

• • •

•

•

•

•

•

•

•

=

=

1, , 1 · · · k. •

•

•

, Xk

SECTION 20.


263

extends from collections of simple random variables to collections of random vectors: Theorem 20.2.

Suppose that ... ...

( 20.26) •

is an independent collection of random vectors. If .9; is the u-field generated by the i th row, then � .9;, . . . are independent. •

Let J:1f; consist of the finite intersections of sets of the form [ H] with H a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The u-fields .9; u{J:If;), i 1, . . . , n, are independent for each n, and the result follows. • PROOF.

=

X;j E

=

Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them. Suppose that X and are independent random vectors with distributions Then ( X, has distribution /-L and in and in and over Let range over By Fubini's theorem,

Y k k k + Rj = Rj . X . Rj R X R Y) J.L v v k ·x Rj y R . · ( 20 .27 ) (J.L x v)(B ) = J v[y: (x, y) E B] J.L(dx ) , Replace B by (A X R k ) n B, where E �j and B E �j + k . Then (20.27) reduces to ( 20.28) (J.L X v) (( A x R k ) n B ) = f v[y : (x, y) EB ]J.L( dx ) , RJ

A

A

) E B] is the x-section of B, so that Bx E� k Theorem = [ (x, B y y x 1 8.1), then P[(x, Y) E B] = P[w : (x, Y(w)) E B] P[w: Y(w) E B ] v(Bx). :

If

=

(

x

=

Expressing the formulas in terms of the random vectors themselves gives this result: Theorem 20.3.

If X and Y are independent random vectors with distribu and then

J.L and v in Rj R\ B E�j + k ( 20 .29 ) P [( X , Y ) EB ] = f P [(x , Y ) EB ] J.L(dx ) ,

tions

Rl

'


264

and

P [ XEA,( X,Y ) E B] f P [(x ,Y ) E B] J.L( dx ) , A E �j , B E �j + k . Suppose that X and Y are independent exponentially distributed random variables. By (20.29), P[Y/X > z] f�P[Yjx x zae-ax dx = (l +z)- 1 • Thus YjX has density x (l +z)-2 = J�e-a z]ae-a dx for z > O. Since P[X> z1, Y/X>z2] = fz� P[Yjx > z2]ae-a x dx by (20.30), • the joint distribution of X and YIX n be calculated as well. (20 .30)

=

A

Example 20.3.

>

=

ca

The formulas (20.29) and (20.30) are constantly applied as in thi example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usua lly silent. Here is a more complicated argument of the same sort. Let , Xn be independent random variables, earh unifo rmly distributed over [0, t ]. X < Yn < t. The X; Let Yk be the k th smallest among the X;, so that 0 < Y < Yn - Yn t - Yn; let M divide [0, t ] into + 1 subintervals of lengths Y , Y2 P[ M < ] . The problem is to show be the maximum of these lengths. Define 1/Jn( t, that

1

Example

, • • •

20.4.

1 1a)= - Y1, •••a ,

n

·

·

·

_

1,

( 20 .31 )

=

where x + ( x + I xi)/2 denotes positive part. Separate consideration of the possibilities 0 < a < t /2, t /2 < a < t , and t 1 . Suppose it is shown that the probability rfii t, satisf.es disposes of the case the recursion ( 20 .32)

s this same recursion, and so it will follow by induction that (20.31) holds for all In intuitive form, the argument for (20.32) is this: If [ M < a] is to hold, the smallest If X1 is the smallest of the X1, then of the X; mus t have some value in [0, , Xn must all lie in [ t ] and divide it into subintervals of length at most the X n probability this is (1 - xjt) - J r/J, _ /t because X , , Xn have probability n of (1 - xjt ) - I of all lying in [ t ], and if they do, they are independe nt and uniformly distributed there. Now (20 .32) results from integrating with respect to the density for xl and multiplying by to allow for the fact that any of XI , . . . , xn may be the smallest. 1. Let be To make this argument rigorous, apply (20.30) for j 1 and k = the interval [0, a], and let consist of the points (x n) for which 0 < X; < t, is the minimum of xn divide [ x 1 , t] into subintervals of length , xn, and at most a. Then P[ X1 = min X;, M < a] = P[ X1 (X1 , , Xn ) 8 ]. Take X1 for

2

,

• . .

n

x, x,

B x1, • • •

x

n.

a]. - x, a),

2

1, x

a;

• • •

n-

=

x2,

• • • ,

• • • ,

EA,

• • •

E

A x1

S ECTION 20.

RANDOM VARIABLES AND DISTR IBUTIONS

X and ( X2, • • • , X,) for

265

Y in (20.30). Since X1 has density ljt,

(20.33) If C is the event that x < X; < t for 2 < i < n, then P(C) = 0 - xjt)" - 1 • A simple calculation shows that P[ X; - X < s" 2 < i < n!C] = n;'= z(sJ(t - X )); in other words, given C, the random variables X2 - x, . . . , X, - x are conditionally independent and uniformly distributed over [0, t - x ]. Now X2 , • • • , X, are random variables on some 5', P); replacing P by P( ·!C) shows that the integrand in probability space (20.33) is the same as that in (20.32). The same argument holds wi th the index 1 replaced by any k (1 < k < n), which gives (20.32). (The events [ Xk = mi n X;, < ] a re not disjoi nt, but any two intersect in a set of probability 11

(D.,

O.)

Y a

Sequences of Random Variables

Theorem 5.3 extends to general distributions

11- n -

If {JL,} is a finite or infinite sequence of probability mea exists on some probability space (f!, 5', P) an independent of random variable� such that X, has distribution JL, .

Theorem 20.4. sures on � 1 , there

sequence {X,}

PROOF. By Theorem 5.3 there exists on some probability space an independent sequence Z 1 , 22, . . . of random variables assuming the values 0 and 1 with probabilities ?[2, = 0] = ?[ 2, = 1] = i · As a matter of fact, Theorem 5.3 is not needed: take the space to be the unit interval and the Z,{w) to be the digits of the dyadic expansion of w -the functions d ,(w) of Sections and 1 and 4. Relabel the countably many random variables 2" so that they form a double array, .

.

.

All the 2n k are independent. Put U, = L.� = 1 2, k 2 - k The series certainly converges, and U, is a random variable by Theorem 13.4. Further, U1 , U2 , is, by Theorem 20.2, an independent sequence. Now P[Z, ; = z ; , 1 < i < k ] = 2 - k for each sequence z 1 , • • • , z k of O's and 1 ' s; hence the 2 k possible values j2 - k , 0 <j < 2 \ of S, k = L.7= 1 2, ; 2 ; all have probability 2 - k . If 0 < x < 1, the number of the j2 - k that lie in [0, x] is l2 k xj + 1 , and therefore P[ S, k < x ] = (l2 k xj + l)j2 k . Since S, k (w) j U,(w) as k j oo, it follows that [S, k < x] H U, < x] as k j oo, and so P[U, < x] = Iim k P[S, k < x ] lim k (l2 k xj + 1)j2 k = x for 0 < x < 1. Thus U, is uniformly distributed over the unit interval. • • •

-

=


266

The construction thus far establishes the existence of an independent sequence of random variables each uniformly distributed over [0, 1]. Let and put in : be the distribution function corresponding to for 0 < u < 1. This is the inverse used in Section 14-see (14.5). u Set Since 0, say, for u outside (0, 1), and put )-see the argument following (14.5)-P[ if and only if u And by Thus has distribution function Theorem 20.2, are independent. •

Un

Fn < Fn(x)] cpn(u) 'Pn(u) < X < Fn(x <x] = P[Un < Fn(x)] = Fn(x). Xn X" X2 , =

•

.

J.L n , cpn(u) = f[x Xn(w) = cpn(Un(w)). xn Fn.

.

This theorem of course includes Theorem 5.3 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov' s existence theorem in Section 36. Convolution

X

J.L and v. = [(x, y ): x + y E H] with

Y

Let be independent random variables with distributions and Apply (20.27) and (20.29) to the planar set B

H E .9i' 1 :

"' ( 20 .34) P[ X+ YE H] = j-oo v( H -x ) J.L ( dx ) = j- "' P[YEH -x]J.L ( dx ) . The convolution of J.L and v is the measure J.L * v defined by "' ( 20 . 35 ) ( J.L * V}(H) = j- "'v ( H - x ) J.L ( dx ) , If X and Y are independent and have distributions J.L and (20.34) shows that X+ Y has distribution J.L * v. Since addition of random variables is commutative and associative, the same is true of convolution: J.L * v v * J.L and J.L *(v * TJ) = (J.L * v) * TJ. If F and G are the distribution functions corresponding to J.L and v, the distribution function corresponding to J.L * v is denoted F * G. Taking H = y] in (20.35) shows that ( "' ( 20 .36 ) ( F * G ) ( y ) = j G ( y - x ) dF ( x ). (See 0 7.22) for the notation dF(x).) If G has density g, then G{y -x) jy:,xg(s) ds = f�"'g(t - x)dt, and so the right side of (20.36) is f � "' [f':"'g(t - x) dF(x)] dt by Fubini's theorem. Thus F * G has density F * g, where "' ( 20.37) ( F * g )( y ) = j g(y -x ) dF ( x ) ; "'

v,

=

-

oo,

- 00

=

- 00

SECTION 20.

267


G

g.

this holds if has density If, in addition, denoted and reduces by (16. 1 2) to

f*g

F

has density

f, (20.37) is

"'

( f * g )( y) = J g(y -x ) f( x ) dx. This defines convolution for densities, and * has density f * g if and have densities f and g. The formula (20.38) can be used for many explicit ( 20.38)

- oo

v

J.L

v

J.L

calculations.

Let X 1 , , Xk be independent random variables, each with the exponential densit� (20. 10). Define by Example 20.5.

•

•

•

gk

x > O,

( 20.39) put

g k(x)

=

0 for

k = 1 , 2, . . . ;

x < 0. Now g k(y).

g 1 coincides with gk =g k - l * g 1 , and . since XI + . . + xk has density gk •

which reduces to Thus (20. 10), it follows by induction that the sum The corresponding distribution function is

X > 0, as follows by differentiation.

•

Example 20.6. Suppose that X has the normal density (20.12) with and and that Y has the same density with T in place of u. If

independent, then

X+ Y

A change of variable u

=

X

has density

x(u 2 + T 2 ) 1 12juT

reduces this to

m -=

Y

0

arc


268

Thus X + Y has the normal density with

(F 2.

m = 0 and with u 2 + T 2 in place of •

If and v are arbitrary finite measures on the line, theii: convolution is defined by (20.35) even if they are not probabil ity measures.

J.L

Convergence in Probability

Random variables X, converge in probability to X, written X,

-)

lim P [ I X, - X I > E ] = 0

(20.4 1 )

P

X, if

n

for each positive E.t This is the same as (5.7), and the proof of Theorem 5.2 carries over without change (see also Example 5.4) I,

X.

If X, � X with probability then X, � (ii) A necessary and sufficient condition for X11 � X is that each subsequence {X11 } contain a further subsequence {X, } such that X � X with probability 1 as i � PRooF. Only part (ii) needs proof. If X, � X, then given {n k }, choose a subsequence {n k(i l } so that k > k ( i) implies that P[I X,k - X ! > i - 1 ] < 2 1 . By the first Borel-Cantelli lemma there is probability 1 that i X, . - X I E] > E holds along some sequence {n k}. No subsequence Theorem 20.5. �

{i)

kUl

oo

P

P

11kUl

P

-

� ( 0, then f, converges in measure to f. P

>

The Glivenko-Cantelli Theorem ·

empirical distribution function for random- 1variables Xp . . . , X, F,,(x, w) with a jump of n at each Xk(w )

The distribution function (20.42)

tThis is ofte n expressed p lim, X,, = X •This topic may be omitted

:

is the

SECTION 20.

269

RANDOM VARIABLES AND DISTRIBUTION S

F w)

If the Xk have a common un known distribution function F(x), then n(x, is its natural estimate. The estimate has the right limiting behavior, according to the

Glivenko-Cantelli theorem: Theorem 20.6. Suppose that XI > X . . are independent and have a com mon distribution function F; put Dn(w) = supx IF/x, w) - F(x)l. Then Dn � 0 with probability 1. For each Fn(x, w) as a function of w is a random variable. By right 2, .

x,

contir!Uity, the supremum above is unchanged if x is restricted to the rationals, and therefore Dn is a random variable. The summ ands in (20.42) are indepen dent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each x there is a set A x of probability 0 such that limFn( x ,

( 20.43)

n

w) = F( x )

w

for $ A x · But Theorem 20.6 says more, namely that (20.43) holds for w outside some set A of probability 0, where A does not depend on x-as there are uncountably many of the sets A x • it is conceivable a priori that their union might necessarily have positive measure. Further, the conver gence in (20.43) is un iform in x. Of course, the theorem implies that with probability 1 there is weak convergence = (x) in the sense of Section 14.

Fn(x, w) F

PROOF OF THE THEOREM. As already observed, the set Ax where (20.43) fails has probability 0. Another application of the strong law of large numbers, with 1( - oo, x > in place of 1( - oo, x J i n (20.42), shows that (see (20.5)) = lim n , = ) except on a set Bx of probability 0. Let inf[ x: < (x) for 0 < < 1 (see (14.5)), and put xm k = cp( j ), > 1, , 1 < < It is not hard to see that F(cp( u) - ) < < F( cp(u)); hence 1 F( xm ' 1 - ) < m - 1 , and 1 - m - • U:: t and n( w ) be the maximum ot the quantities for k = l, ' + < < then If x < X < X� > + ' -I > Together with simi lar arguments fo r the this shows that cases x < xm, l and x >

cp(u) F/x - w) F(x u u F ] km m u F(xm k m. k 1 - ) - F(xm k - l ) < m - , F(xm m) > IF/xm k: w) - F( tm k )l Dm . . . , m. IFn(xm k - , w) - F(xm k - )I Fn(x, w) Fn(xm k - , w) F(xm k - ) m k- l - 1 k• Fn(x, w) Fn(xm k - l' w) F(xm k - l ) Dm n(w) F m Dm - Dm n(w) F(x) - m - Dm n(w). xm,m' •

•

•

•

•

( 20 .44)

w

If lies outside the union A of all the Axmk and Bxmk. , then limn = 0 and hence limn = 0 by (20.44). But A has probability 0.

Dn(w)

Dm n(w) •

•

270

RAN DOM VARIABLES AND EXPECTED VALUES

PROBLEMS 20. 1.

20.2.

2.11 i

A necessary and sufficient condition for a a--field .:# to be countably generated is that .:#= a-( X) for some random variable X. Hint: If .:#= k a-( A 1 , A 2 , • • . ), consider X = Lk=dUA )/ 10 , where f(x) is 4 for x = 0 and 5 for x -=1= 0. If X is a POSitive random variable with dens ity f, then x - l has density f( ljx)jx 2 • Prove this by (20.16) and by a direct argument.

20.3.

Suppose that a two-dimensional distribution ft.nction F has a continuous density f. Show that f( x, y) = a 2F( x, y)jax ay.

20.4.

The construction in Theorem 20.4 requires only Lebesgue measure on the unit interval. Use the theorem to prove the existence of Lebesgue measure on R k . First construct Ak re�t::icted to ( -n, n] x X ( -n, n], and then pass to the limit (n --+ oo). The idea is to argue from first principles, and not to use previous constructions, such as tho�e in Theorems 12.5 and 18.2. ·

·

B,

20.5.

Suppose that A , and C are positive, independent random variables with distribution function F. Show that the quadratic Az 2 + + C has real zeros with probability f'0f0F(x 214 y) dF(x) dF(y ).

20.6.

Show that XI , Xz , . . . are independent if a- ( X l > independent for each n.

20.7.

Le t X0, X1 , • • • be a persistent, irreducible Markov chain, and fo r a fixed state j let T1 , T2 , • • • be the times of the successive passages through j. Let Z 1 T1 and Zn = Tn -k Tn _ 1 , n > 2. Show that Z 1 , Z2 , • • • are independent and t hat Z = k ] = fjj l fo r n > 2.

P[

20.8.

Bz

" ' '

xn - l ) and a-(X) are

=

n

Ranks and records. Let XI > X2 , . . . be independent random variables with a

B

be the w-set where common continuous distribution function. Let Xm(w) Xn( w ) for some pair m , n of distinct integers, and show that 0. Remove from the space !l on which the X, are defined. This leaves the joiJit distributions of the xn unchanged and makes ties impossible. n Let r< l(w) = (Tf"l(w), . . . T�"l (w )) be that permutation (t 1 , . . . , tn) of O, . . . , n) for which X, ( w ) < X, (w) < < X, (w). Let Yn be the rank of Xn ' � among X1 , • • • , Xn: Yn r if an only if X; < j(n for exactly r - 1 values of i preceding 11. n (a) Show that T< l is uniformly distributed over the n! permutations. (b) Show that Yn r] 1jn, 1 < r < n. (c) Show that Yk is measurable a-(T 0. (By (17.9), the density integrates to 1.) (a) Show that c u * c, = c U +, . Hint: Expand the convolution integrand in partial fractions. (b) Show that, if XI ' . . . ' xn are independent and have density c,, then ( X1 + + Xn )jn has density c u as well. · ·

·

Y

Show that, if X ar:d are independent and have the standard normal density, then Xj Y has the Cauchy density with u = 1. (b) Show that, if X has the uniform distribution over ( 7rj2, 7T'/2), then tan X has the Cauchy distribution with u = 1.

20.15. i

(a)

-

20. 16.

1 8.1 8 i Let XI , . . . ' xn be independent, each having the standard normal distribution Show that Xn2 = XIz + . . . +X2 n

has density

(20.46) over (0, oo). This is called the chi-squared distribution with n degrees offreedom.

272

RANDOM VA RIABLES AND EXPECTED VALUE�

20.17. i

The gamma distribution has density

(20.47) over (0, oo) for positive parameters Show that

au

f( , , ) * f(

(20.48)

a and u. Check that (20.47) integrates to 1 ; a, c ) = f( ' ; a ,u + u).

Note that (20.46) is f( x; t, nj2), and from (20.48) deduce again t hat (20.46) is the density of x;. Note that the exponential density (20 10) is f( x ; 1), and from (20 48) deduce (20 39) cnce again.

a,

n

Let N, X 1 , X2 , . . . be independent, where P[ N = n] = q - lp, n > l, and each xk has the exponential density f(x; 1). Show that XI + . . +X,., has density f(x; 1).

20.18. i

ap,

20.19.

a,

Let Anm(t:) = [IZk - Zl < t:, n < k < m]. Show that Zn --+ Z with probability 1 if and only if limn lim m P( An m(t:)) = l for all positive £, whereas Zn -"p Z if and only if lim n P( An,(t: )) = 1 for all positive £. Suppose that f: R 2 --+ R 1 is continuous. Show that Xn --+ P X and Yn --+ P Y imply f( Xn, Y) --+ P f( X, Y ). (b) Show that addition and multiplication preserve convergence in probability.

20.20. (a)

20.21.

Suppose that the sequence {Xn} is fundamental in probability in the sense that for t: positive t here exists an NE such that P[IXm - Xnl > t:] < t: for m, n > NE . (a) Prove there is a subsequence {Xn ) and a random variable X such that that limk Xn k = X with probability 1. Hint: Choose increasing nk such k k k P[ I Xm - Xnl > 2 - ] < 2 - for m, n � n�. Analyze P[ IXn k+, - Xnk l > 2 - ]. (b) Show that Xn --+ P X. Suppose that XI < Xz < . . . and that xn --+ p X. Show that xn --+ X with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure.

20.22. (a)

20.23.

20.24.

If Xn --+ 0 with probability 1, t hen n - 1 r,�= 1 Xk --+ 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. S how that in a discrete probability space convergence in probabil ity is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equiva lence holds: Suppose that P has a nonatomic part in the sense that there is a set A such that P(A) > 0 and P( lA) is nonatomic. Construct random variables Xn such that X, --+ P 0 but X, does not converge to 0 with probabil ity 1.

2. 1 9 i

(a)

SECTION 21. (

'0.25.

EXPECTED VALUES

273

20.21 20.24 i Let d( X, Y) be the infimum of those positive *' for which P[l X - Y I > t: ] < t:. (a) Show that d( X, Y) = 0 if and only if X = Y with probability 1. Identify random variables t hat are equal with probability 1, and show that d is a metric

on the resulting space. (b) Show that Xn -+ P X if and only if d( X X) -+ 0. (c) Show that t he space is complete. (d) Show that in general t here is no metric d0 on this space such t hat Xn -+ X with probability 1 if and only if d0( Xn, X) -+ 0.

n

20.26.

,

Construct in R k a random variable X that is uniformly distributed over the surface of the unit sphere i:1 the sense t hat I XI = 1 and UX has the same distribution as X for orthogonal transformations U. Hint Let Z be uniformly = (1, 0, , 0), say), distributed in the unit ball in R k , define 1/J( x) = x !I xI and take X = ljJ( Z ).

( ljJ(O)

20.27.

e

and be the longitude and latitude of a random point on th� surface of the u nit sphere in R 3• Show that E> and are independent, 0 is uniformly distributed over [0, 2'7T), and is distributed over [ -'7T/2, +'7T/2] with density -} cos

i Let

SECTION 21. EXPECTED VALVES Expected Value as Integral

The expected value of a random variable with respect to the me asure

P:

E[

X on (f!, .9', P) is the integral of X

•

X ] = jXdP = j X(w)P(dw) . n

All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative is always defin ed (it may be infinite); for the ge neral is defined, or has an expected value, if at least one of and is finite, in which case and i s integrable if = and only if E I Il < oo. The integral over a set A is defined, as before, as In the case of simple random variables, the definition reduces to that used in Sections 5 through 9.

X, E[X1 E[X- 1

X, E[X1

X

[X E[IA X].

E[X1 E[X + 1 - E[X- 1; fA XdP

X

E[ X + j

Expected Values and Limits

The theorems on integration to the limit in Section 16 apply. A useful fact: If random vari ables are dominated by an integrable random variable, or if they are uniformly integrable, then � follows if converges to in probability -convergence with probability 1 is not necessary. This follows easily from Theorem 20.5.

Xn

X

E[Xn1 E[X1

Xn

274


Expected Values and Distributions

J.L. g

X

Suppose that has distribution If is a real function of a real vari able, then by the change-of-variable formula (16.17),

E[g( X)) f-"' g(x)J.L(dx). "' (In applying (16.1 7), replace T: f! � D.' by X: f! � R 1 , J.L by P, 11-T - 1 by J.L, and f by g.) This formula holds in the sense explained in Theorem 16. 1 3: It ( 21 . 1 )

=

holds in the non negative case, so that

( 21 2 )

E (l g(X) I] j-"' l g(x) IJ.L (dx) ; =

"'

if one side is infinite, then so is the other. And if the two sides of (21 .2) are finite, then (21 . 1 ) holds. If J.L is disci ete and J.L{X I ' x 2 , = 1, then (21. 1) becomes (use Theorem •

•

•

16.9)

( 21 .3 )

}

r

X has density f, then (21. 1) becomes (use Theorem 16. 1 1 ) E [ g(X) ) J g(x)f(x) dx. ( 21 .4 ) "' If F is the distribution function of X and J.L, ( 2 1 . 1 ) can be written E[g(X)] If

00

=

=

J':"'g(x) dF( x) in

-

the notation ( 17.22).

Moments

J.L and F determine all the absolute moments of X: k 1 , 2, . . . . ( 21 . 5 ) E[IXi k J j "' lxl kJ.L (dx) Jco lxl k dF ( x ) , k Since j < k implies that lxlj < 1 + lxl , if X has a finite absolute moment of order k, then it has fin ite absolute moments of ord ers 1, 2, . . . , k - 1 as welL For each k for which (2.15) is finite, X has k th moment

By (21.2),

=

- 00

=

- 00

=

( 21 .6 )

J.L

These quantities are also referred to as the moments of and of F. Th ey can be computed by (21 .3) and (21 .4) in the appropri ate circumstances.

SECTION 21.

EXPECTED VALUES

275

m

0 and u = 1 . Consider the normal density (20.12) with goes to 0 exponentially as � + oo, and so finite For each k, mome nts of all orders exist. Integration by parts shows that Example 21.1.

x k e - x 2 12

x

=

oo "' 1 ! x k e - x2 1 2 dx = k - f x k - 2 e -x2 ;2 dx , k 2 ,3 , . . . . J2 - "' (Apply ( 18. 16) to g(x) = x k - 2 and f(x) =xe-x 212 , and let Of course, E[ X] 0 by symmetry and E[X 0 ] 1 . It follows by induction J2

17r

7r

=

- "'

=

th at

a � - oo, b � oo. )

=

( 2 1 .7 ) E[X 2k ] = 1 x 3 x 5 x

· · ·

x ( 2k 1 ) ,

k

-

=

1 , 2, . . . , •

and that the od d moments all vanish. If the first two moments of Section the variance is

5,

X are finite and E[X] = m, then just as

in

Vac[X] E [ ( X - m ) ] = r ( x - m) 2 JJ- (dx)

( 21 .8)

=

2

" - ou

From Example 21.1 and a change of variable, it follows that a random variable with the normal den sity (20.12) has mean and variance u 2 • Consider for nonnegative the relation

m

X "' "' f f E[X] = 0 P[ X > t] dt = 0 P[ X> t] dt.

( 21 .9)

P[ X= t]

t,

Since can be positive for at most countably many values of th e two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For simple and nonnegative, (21.9) was proved in Section see (5 .29). For the general nonnegative let be simple random variables for which 0 < i X (see (20.1)). By the monotone conver i gence theorem moreover, and therefore i again by the monotone convergence theorem. i Since (21.9) holds for each a passage to the limit establishes (21. 9 ) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then is integrable. Replacing by > aJ leads from (21.9) to

X

5;

X, Xn P[ Xn > t] P[ X> t ],

Xn

£[ Xn] £[X]; f t] dt J0P[ X> t] dt, Xn , X X XI1x ( 2 1 . 10 ) j X dP = a P[ X> a] + j "'P[ X > t] dt, As long as a > 0, this holds even if X is not nonnegative. [ X > a]

a

276


Inequalities

Since the final term in (21 . 10) is non negative, Thus

E[X].

aP[ X> a] < frx a. J XdP < �

1 P[ X> a] < -a1 1[ X 0,

X. As in Section 5, there follow the inequalities

( 21 . 12) It is the inequality between the two extreme terms here that usually goes under the name of M; nkov; but the left-hand i nequality is often useful, too. As a special case there is Chebyshev's in equ ality,

1 P[IX - ml > a) < 2 Var [ X ] a

(21 . 13) (m

=

E[X] ).

J ensen's inequality

cp(E[ X]) < E[cp( X)]

( 2 1 . 14)

cp

X X = ax + b cp aE[X] + b < E[cp(X)].

and if and holds if i s convex on a n interval containing the range of be a support both hdve expected values. To prove it, let L(x) ing line through )- a line lying entirely under the graph of [A33]. Then so that But the left side of this inequality is cp( Holder' s inequ ality is

cp( X)

( El X], cp( E[ X]) aX(w) + b < cp(X(w)), E[ X]).

1 1 - + - = 1.

( 21 . 15)

p

q

For discrete random variables, this was proved in Section 5; see (5.35). For the general case, choose simple random variables and Y, satisfying 0 and 0 see (20.2). Then (5.35) and the monotone convergence theorem give (21 .15). Notice that (21 .15) implies that if and are integrable, then so is XY. Schwarz's inequality is the case

< IXnli lXI IYI"

p =q

=

< IYnli IYI;

Xn

2:

(2 1 . 16) If

X and Y have second moments, then XY must have a first moment.

IXIP

S ECTION 21.

EXPECTED VALUES

277

The same reasoning shows that Lyapounov 's inequality (5.37) carries over from the simple to the general case. Joint Integrals

(X1 , , Xk ) has

The relation (21 . 1 ) extends to I andom vectors. Suppose that distribution in k-space and g : � By Theorem 16.13,

Rk R1

J.L

• • •

•

(21 . 17 ) with the usual provisos about infin ite values. For example, = m;, the of X1 and If is

fRkx1x1J.L(dx). £[X1]

Random variables are

X1

covariance

£[ X,X1] =

uncorrelated if they have covariance 0.

Independence and Expected Value

X and Y are independent. If they are also simple, then Ef XY] E[ X)E[ Y), proved+ in Section 5-see (5.25). Define Xn by (20.2) and similarly define Y,. 1/Jn ( y ) - 1/1/ y- ). Then Xn and Yn are indepen dent and simple, so that E[IXnY,.ll E[IXni] E[IY,.Il, and 0 < IXnli lXI, 0 < !Y,.Ii IYI. If X and Y are integrable, then £[1XnY,I] = £[1Xni ] E[IY,I l i E[IXI] · E[I YI], and it follows by the monotone convergence theorem th::\t E[IXYI J oo; since and IXnY,I < IXYI, it follows further by the dominated conver XnY, gence theorem that E[XY] = lim n E[XnY,l = limn E[Xn]E[Yn] E[X]E[Y]. Therefore, is integrable if X and Y are (which is by no me ans true for depe ndent random variables) and E[XY] E[X]E[Y]. This argument obviously extends inductively: If X1 , , Xk are indepen dent a n d integrable, then the product XI . . . xk is also integrable and Suppose that

as

=

=

=

lsi < s0 ,

Laplace trans s, X oo); X

-

e ek e L:'k=olsxl jk! e

]

-oo, JL{n} JL{ -n} Cjn 2 n -s0 , s0 ), lsl <s0•

Thus M( s) has a Taylor expansion about 0 with positive radius of conver gence if it is defined in some ( s0), > 0. If M(s) can somehow be can be calculated and expanded in a series and if the coefficients identified, then, since ] ! , the moments of must coincide with can be computed: = ! It also follows from the theory of Taylor expan sions that is the k th derivative M< k >(s) evaluated at s = 0:

-s • ks0 L:ka k s ,

( 2 1 .23 )

ak

E[Xk jk

a kk E[X ] a k k [A29] a k k! M< k > ( O) = E[X k ] = j x kp_(dx ) .

X

"'

- oo

Ms jM s

This holds if ( ) exists in some neighborhood of 0. If v has Suppose now that is defined in some neighborhood of x den sity es ( ) with respect to 11- (see (16. 1 1 )), then v has moment for u in some neighborhood of 0. generating function N(u) = M(s + u) 1

s.

M

M(s)

SECTION 21.

EXPECTED VALUES

279

N < k >(O) = f': oox k v( dx) = [:"'x ke sxJl.(dx)jM(s), and since N (o) = M < k >(s)/M(s), But then by (21.23),

00

M< k > (s) = J x k e sxJJ. ( dx ) .

( 21 . 24)

- 00

This holds as long as the moment generating fu nction exists in some neigh borhood of s. If s = 0, this gives (2 1 .23) again. Taking k 2 shows that M(s) is convex in its interval of definition. =

Example 21.2.

For the standard normal density,

and a change of variable gives ( 21 .25 ) The moment generating function in this case defined for all has the expansion

( )

k

oo _!_ � ;., 2 2 e5 1 = 2 k ! k'::o k�O

1X3X

s.

Since

e5 2 1 2

· · ·

X ( 2 k - 1 ) 2k s , (2 k ) !

the momen ts can be read off from (21 .22), which proves (21. 7) once more. Example 21.3.

•

In the exponential case (20.10), the moment generating

function (21 .26) is defined for s < a. By (21 .22) the kth moment is k!a - k . The mean and • variance are thus a - I and a - 2 • Example 21 .4.

For the Poisson distribution (20.7),

(21 .27) Since M'(s) = A e5M(s) and M"(s) = (A.2 e 2 5 + Ae5)M(s), the first two mo ments are M'(O) = A and M"(O) = ;>..2 + A.; the mean and variance are both A . •


280

Let X1 1 , Xk be independent random variables, and suppose that each X1 has a moment generating function M1(s) = E[ e s X; ] in ( -s0 , s 0). For lsi < s0, each exp(sX,) is integrable, and, since they are independent, their product exp{s L:7= 1 X) is also integrable (see (21 .18)). The moment generating function of XI + . . . + xk is therefore •

•

•

( 21 28) in ( - s0, s0 ). This relation for simple random variables was essential to the arguments in Section 9. For simple random variables it was shown in Section 9 that the moment generating function determines the distribution . This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case. PROBLEMS 21.1.

Prove 1

../2rr

/00 e -IX f2 dx = t - lf2 - oo

2

'

differentiate k times with respect to t inside the integral (justify), and derive (21.7) again. 21.2.

Show that, if X has the standard normal distribution, then

2"n!f2j·rr .

£[1 Xl 2" + 1 ] =

21.3.

20.9 i Records. Consider the sequence of records in the sense of Problem 20.9. Show that the expected waiting time to the next record is infinite.

21.4.

20. 14 i Show that the Cauchy distribution has no mean.

21.5.

Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicato r random variables. Why is Theorem 16.6 not enough for the second Borel-Cantelli lemma?

2 1.6.

Prove (21 .9) by Fubini's theorem.

21. 7.

Prove for integrable X that E[X] =

0

10 P[ X > t ] dt - f 00

- oo

P [ X < t ] dt .

SECTION 21. 21.8. (a)

EXPECTED VALUES

281

Suppose that X and Y have first moments, and prove

E[ YJ - E[X] = {' ( P [ X < t < Y ] - P[ Y < t < X J) dt . - oo

Let ( X, Y] be a nondegenerate random interval Show that its expected length is the integral with re>spect to t of the probab ility that it cove rs t.

(b)

21.9.

Suppose that X and Y are random variables with distribution functions F and G. (a) Show that if F and G have no common jumps, then E[F( Y)] + E[G( X)] = 1. (b) If F is condnuous, then E[F( X)] = i. (c) Even if F dud G have common jumps, if X and Y are taken to be independent, then E[ F( Y)] + E[ G( X )] 1 , P[ X = Y ]. (d) Even if F has jumps, E[F(X)] i t- � Lx P 2[ X = x ] =

=

Show that uncorrelated variables need not be independent. (b) S how that Var[L:I'= 1 X;] = L:7, 1 = 1 Cov[ X;, X1 ] = I:;'� 1 Var[ X; ] + 2[ 1 5 ; < 1 5 11 Cov[X;, �-1. The cross terms drop out if the X, are uncorrelated, and hence drop out it they are independent.

21.10. (a)

Let X, Y, and Z be independent random variables such that X and Y assume the values 0, 1, 2 with probabil ity * each and Z assumes the values 0 and 1 with probabilities i and �. Let X' = X and Y' = X + Z (mod 3). (a) Show that X', Y', and X' + Y' have the same one-dimensional distribu tions as X, Y, and X + Y, respectively, even though ( X', Y' ) and ( X, Y) have different distributions. (b) Show that X' a11d Y' are dependent but uncorrelated (c) Show that, despite dependence, the moment generating function of X' + Y' is the product of the moment generating functions of X' and Y'.

21.11. i

21.12.

Suppose that X and Y are independent, nonnegative random variables and that E[ X] = oo and E[Y] = 0. What is the value common to E[ XY ] and E[ X]E[ Y ]? Use the conventions ( 15.2) for both the product of the random variables and the product of their expected values. What if £[X] = oo and O < E[ Y ] < oo?

21.13.

Suppose that X and Y are independent and that f( x, y) is nonnegative. Put g(x) = E[ f( x , Y)] and show that E[ g ( X )] = E[ f( X, Y)]. Show Jllore generally that fx E A g( X) dP = fx E A f( X, Y) dP. Extend to f that may be negative. The integrability of X + Y does not imply that of X and Y separately. Show that it does if X and Y are independent.

21.14. i

2 1.15.

21. 16.

20.25 i Write d 1( X, Y) E[ IX Y l ;( l + I X - Yl)]. Show that this is a metric equivalent to the one in Problem 20.25. =

-

For the density C exp( - l x l 1 12 ), oo < x < oo, show that moments of all orders exist but that the moment generating function exists only at s = 0. -

282


Show that a moment generating function M(s) defined in ( -s0, s0), s0 0, can be extended to a function analytic in the strip [ z : - s0 < Re z < s0]. If M(s) is defined in [0 s0), s0 0, show that it can be extended to a function continuous i n [ z : 0 < Re z < s0] and analytic in [ z · 0 < Re z < s0].

21.17. 16.6 i

>

>

,

21.18. Use (21.28) to find the generating function of (20.39). 21.19. For independent random variables having moment generating functions, show by (21.28) that the variances add.

Show that the gamma density (20.47) has moment generating function k 1) . . (1 - sja ) - " for s < a. Show that the k th moment is l)ja k Show that the chi-squared distribution with n degrees of freedom has mean n and vanance 2n.

21.20. 20. 17 i

u(u +

(u +

21.21. Let X1 , X2 , be identically distributed random variables with finite second moment. Show that nP[I X1 1 > € /n) -> 0 and n - 1 1 2 max k , ,I Xk I -> p 0. • • •

SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES

X1, X2 ,

Let be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series 1:: 1 con converges with probability 1 , or as in Section 6 whether - • 1 verges to some limit with probability 1 . It is to questions of this sort that the present section is devoted. will denote the partial sum Throughout the section, •

•

•

n L:�= Xk

=

X,

L� = .xk (S0 = 0).

S,

The Strong Law of Large Numbers

The central result is a general version of Theorem 6. 1.

If X 1 , X2 ,

are independent and identically distributed and have finite mean, then S,/n £[ X1] with probability 1. Theorem 22.1.

•

• •

-7

Formerly this theorem stood at the end of a chain of results. The following argument, due to Etemadi, proceeds from first principles. If the theorem holds for nonnegative random variables, then 1 n - L� = 1X:- n - 1 L:� = 1 X; E[XtJ - E[X;l £[X1] with proba bility 1. Assume then that Xk > 0. Consider the truncated random variables Yk Xk l[xk k l and their partial sums S,� L� = 1Yk . For a > 1, temporarily fixed, let u, l a " J. The first step is to prove S, � E [ s: J (22.1 ) u,

PROOF. n - • s, =

=

-7

=

=

11

n

s

=

22. SUMS OF INDEPENDENT RANDOM VARIABLES Since the Xn are independent and identically distributed, n n Var[ s: ] = L Var[ Yk ] < L E [ Yk2 ] k=l k= l = k=Ll E [ XI2/[XJ .s k ]] < nE [ xi2 /[ XI .s n ] ] . SECTION

283

ll

It follows by Chebyshev' s inequality that the sum in

(22.1) is at most

K = 2af(a - 1), and suppose x > 0. If N is the smallest n such that un > x, then a N >x, and since y 2lyJ for y > 1, n 1 1• -N K = Kx 2 a a 0, and the sum in (22.1) is at most KE - 2 £[X1] < oo . From (22.1) it follows by the first Borel-Cantelli lemma (take a un ion over positive, rational E) that (s:n - E[s:n ])fun 0 with probability 1. But by the 1 consistency of Cesaro summation [A30], n - 1 E[s;] = n - L:�= 1 £[ Yk ] has the same limit as £[ Yn ], namely. £[ X1 ]. Therefore s:" fun £[X1 ] with probaLet

0;

here 0 is included in the range of integration. This is the moment generating function (21 .21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21 .24) gives (22.5) Therefore, for positive

x

��

k

(22.6)

lsxl

L k= O

( 1) '

and s,

k s ) s kM ( k l (s) = 1 L e - s y ( J,' J.L(dy) 0 k =O oo

lsxl

Fix X > 0. ut 0 < y < x , then Gs y(xjy ) -7 1 as s -7 00 by (22.3); if y > x, the limit is 0. If J.L { x = 0, the integrand on the right in (22.6) thus converges as s -7 oo to /[O, x J( y ) except on a set of J.L-measure 0. The bounded convergence theorem then gives

}

(22.7) t lf y = 0, the integrand in (22.5) i s I for k = 0 and 0 for k ;;:: I, hence for the middle term of (22 6) is 1 .

y

=

0, the in tegrand in

286


J.L{X}

x x

Thus M(s) determines the value of F at if > 0 and 0, which covers all but cou ntably many values of in [O, oo). Since F is right-continu ous, F itself and hence are determined through (22.7) by M(s). In fact is by (22. 7) determined by the values of M(s) for s beyond an arbitrary s0:

x

J.L

Theorem 22.2.

where s0 > 0, then Corollary.

=

J.L

Let J.L and v be probability measures on [0, oo). If

11 = v.

Let f 1 and f2 be real functions on [0, oo ). If

where s0 > 0, then f1 = f2 outside a set of Lebesgue measure 0. The f,. need not be nonnegative, and they need not be integrable, but must be integrable over [O, oo) for s > s0.

e - sxfi(x)

For the nonnegative case, apply the theorem to the probability i = 1, 2. For the densities g,.( x) = where m = • general case, prove that fi + fi. = f; + fl almost everywhere. PROOF.

e - s"xfi(x)jm,

J0e-snxt,.( x)dx,

=

J.L 1 J.L 2 J.L3, then the corresponding transforms (22.4) satisfy M1 (s)M2(s) Mis) for s > 0. lf J.L i is the Poisson distribution with mean Ai, then (see (21.27)) M,(s) exp[A,. ( e - s - 1)]. It follows by Theorem 22.2 that if two of the J.L, are Poisson, so is the third, and A 1 A 2 A3 • Example 22. 1.

*

If

=

=

+

•

=

Kolmogorov's Zero-One Law

1 L:Z 1 Xk(w) X1(w), . . . , Xm_ 1( w) - L: Xk(w) L:Z X

Consider the set A of w for which n -7 0 as n -7 oo. For each = are irrelevant to the question of m, the values of whether or not w lies in A, and so A ought to lie in the - fi eld xm . . . ). In fact, limn n ;:(: = 0 for fixed m, and hence w 1 lies in A if and only if lim n =m k( w ) = 0. Therefore,

u(Xm, (22.8)

+I'

u

I

11

[

A = n U n w: n-1 E

N "?: m n "?:N

X k( w) t k=m

the first intersection extending over positive rational

E.

]

<E ,

The set on the inside

SECTION 22.

SUMS OF INDEPENDENT RANDOM VARIABLES

287

lies in u( X X + 1 , . . . ), and hence so does A. Similarly, the w-set where the series Ln Xn( w) converges lies in each u( X X + 1 , . . . ). The intersection !T n : = 1 u(Xn , Xn + 1 , ) is the tail u-field associated with the sequence Xp X2 , . . . ; its elements are tail events. In the case xn = /A ' this is the u-field (4.29) studied in Section 4. The following general form of Kolmogorov's zero -one law extends Theorem 4.5. m,

m

m

m, •

.

.

"

Suppose that {Xn} is independent and that A E !T n �= 1 u(Xn , Xn + P · . . )· Then either P(A ) = O or P(A) = 1. Theorem 22.3.

Let .ro = u k = I u( XI ' . . . ' Xk ). The first thing to establish is that .ro is a field generating the u-field u(X 1 , X2 , ). If B and C lie in �, then B E o-(Xp . . . , X) and C E o-(X1, . . . , Xk ) for some 1 and k; if m = max{j, k}, then B and C both lie in u(X; , . . . , Xm), so that B U C E u(Xp . . . , Xm ) c .ro . Thus .ro is closed under the formation of finite unions; since it is similarly closed under complementation, � is a field. For H E .9R 1 ' [ xn E H] E .ro c u(.ro), and hence xn is measurable u(.ra); thus .ro generates u( X1 , X2 , ) (which in general is much larger than .ra). Suppose that A lies in .9'". Then A lies in u( Xk+ P Xk + 2 , . . . ) for each k. Therefore, if B E u(Xp . . . , X1 ), then A and B are independent by Theorem 20.2. Therefore, A is independent of .ro and hence by Theorem 4.2 is also independent of u(Xp X2 , ). But then A is independent of itself: P(A n A) = P(A)P(A) . Therefore, P(A) = P 2 (A), which implies that P(A) is ei • ther 0 or 1 . PROOF.

•

•

•

•

•

•

•

•

•

As noted above, the set where L:n X/w) converges satisfies the hypothesis of Theorem 22.3, and so does the set where n - 1 L:Z= 1 Xk(w) -7 0. In many similar cases it is very easy to prove by this theorem that a set at hand mu st have probability either 0 or 1. But to determine which of 0 and 1 is, in fact, the probability of the set may be extremely difficult. Maximal Inequalities

Essential to the study of random series are maximal inequalities-inequali ties concerning the maxima of partial sums. The best known is that of Kolmogorov.

Suppose that Xp . . . , Xn are independent with mean 0 and finite variances. For a > 0, Theorem 22.4.

(22.9)

288


A k be the set where I Ski > a but ISj l < a for j < k. Since the Ak n E[ s,;] > r_ f s; dP k n= I A k = kL JA k [ sl + 2 Sd Sn - Sk) + (Sn - Sk) 2] dP n= l > kL fA [ Sl + 2Sk (Sn - Sk) ] dP. =l k Since A k and S k are measurable o-( X1, , Xk ) and Sn - Sk is measurable o-(X/:+ 1 , , Xn ), and since the means are all 0, it follows by (21 . 19) and independence that fA kS1.( S" - Sk ) dP 0. Therefore, n n E [ S,;] > L J S} dP > L a 2 ( A k) k= l k = l Ak = a2P [ max k :s:n !Skl > a ] . By Chebyshev' s inequality, P[ISnl > a] < a - 2 Var[ S�]. That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of indep endent variables, if maxk "'" I Ski is large, then ISn l is probably large as Let are disjoint,

PROOF.

•

•

•

•

•

•

=

P

•

l:s

well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi. Theorem 22.5.

( 22.10) PROOF.

the

Suppose that X1 ,

• • •

P [ 1 �ka:n I Ski > 3a ]

, Xn are independent. For

a > 0,

< 3 1 �ka:n P[ I Ski > a] . Let Bk be the set where ISk l > 3a but IS) < 3a for j < k. Since

Bk are disjoint,

n- 1 P L �ka:n ISk I > 3 a ] a] + k P ( Bk n [ ISn I < a] } �1 n- 1 < P[ISnl > a] + L P(Bk n [ 1 Sn - Skl > 2a] ) k= l n- 1 = P(ISnl > a] + kL P( Bk) P[1Sn - Skl > 2 a] =l < P[IS"I > a] + I:Smax k :Sn P(IS" - Ski > 2a] " I > a] + P(ISk [> a] ) (P(IS � P(IS" I > a] + _ max k :S :s:n < 3 l max :S k :Sn P [ ISk I > a] . 1

•

SECTION 22.


289

If the Xk have mean 0 and Chebyshev's inequality is applied to the right side of (22.10), and if a is replaced by a/3, the result is Kolmogorov 's inequality (22.9) with an extra factor of 27 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series

For independent Xn, the probability that L:Xn converges is either 0 or 1. It is natural tO try and characterize the two cases in terms of the distributions of the individual Xn.

Suppose that {Xn } is an independent sequence and E[Xn ] = L Var[ Xn ] < oo, then LXn converges with probability 1.

Theorem 22.6.

0. If

PROOF.

By (22.9),

Since the sets on the left are nondecreasing in

Since L: Var[ Xnl converges,

[ k� l

r,

lim P sup iSn + k - Snl > €

(22.1 1 )

n

]

letting r -7 oo gives

=

0

for each E. Let E(n, E) be the set where supj, k � n 1Sj - Ski > 2E, and put £(€ ) = n n E( n , € ). Then E(n, EH £(€ ), and (22. 11) implies P( £ (€)) = 0. Now U E E(E), where the union extends over positive rational E, contains the set where the sequence {Sn} is not fundamental (does not have the Cauchy property), and this set therefore has probability 0. • Let Xn(w) = rn(w) a n, where the rn are the Rademacher functions on the unit interval-see (1. 1 3). Then Xn has variance a�, and so [a� < oo implies that L:rn(w)a n converges with probability 1. An interesting special case is a n = n - 1 • If the signs in L: + n - 1 are chosen on the toss of a coin, then the series converges with probability 1 . The alternating harmonic series 1 2 - 1 + 3 - 1 + · · · is thus typical in this respect. • Example 22.2.

-

If L:Xn converges with probability 1 , then Sn converges with probability 1 to some finite random variable S. By Theorem 20.5, this implies tha t


290

Sn � P S. The reverse implication of course does not hold in general, but it does if the summands are independent.

For an independent sequence {Xn }, the Sn converge with probability 1 if and only if they converge in probability. Theorem 22.7.

p

It is enough to show that if sn � S, then {Sn} is fundamental with probability 1. Since PROOF.

sn � p s implies lim sup P ( ISn t-J - Sni > E ] = 0 .

(22 . 1 2)

n

j :?.

I

But by (22. 10),

and therefore

[

]

[

P sup iSn +k - Sni > E < 3 sup P 1Sn+k - Snl > k�l k� l

�].

It now follows by (22. 12) that (22. 1 1 ) holds, and the proof is completed as • before. The final result in this direction, the three-series theorem, provides neces sary and sufficient conditions for the convergence of [ X11 in terms of the individual distributions of the xn. Let x�c) be xn truncated at c: x�c) = Xn [(IX,I :;; ]· c

Theorem 22.8.

senes

Suppose that {Xn}

is

independent, and consider the three

(22 . 13)

In order that EXn converge with probability 1 it is necessary that the three series converge for all positive c and sufficient that they converge for some positive c.

S ECTION 22.


291

Suppose that the series (22.13) converge, and put m�c) E[ X�cl]. By Theorem 22.6, L:( X�c) - m�c)) converges with probabil ity 1, and since L:m�c) converges, so does L:x�cl. Since P[ Xn =t= X�c) i.o.] 0 by the first Borel-Cantelli lemma, it follows finally that L:Xn converges with • probability 1. PROOF OF SuFFICIENCY. =

=

Although it is possible to prove necessity in the three-series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circularity of reasoning, since the three-series theorem is nowhere used in what follows. PROOF OF NECESSITY. Suppose that L:Xn converges with probability 1 , and fix c > 0. Since X11 � 0 with probability 1 , it follows that L:X�c) con verge� with probability 1 and, by the second Borel-Cantelli lemma, that

L:P [I Xn l > c] < oo.

Let MYl and s�c) be the mean and standard deviation of s�c) EZ = 1Xkc)_ If s�c) � oo, then since the x�c) - m�c ) are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that =

(22.14)

lim p n

[

s 0. Now Consider an open disk D = [ I (22.17) . coincides in D0 with a function analytic in D0 u D if and only if its expanston

z: z -

about ( converges at least for this holds. The coefficient

l z - (I < r. Let A0 be the set of w for which u(Xm , m lam{w)l1 1 ,., am o(w ), am ui )

is a complex-valued random variable measurable X + I> ) . By the root test, E A0 if and onl� if lim supm < r - 1 • For each m0, the condition for E A D can thus be expressed in term'> of I( CL.I , . Thus Ao has a probability, and !n > alone, and so Ao I fact P(A0) is 0 or 1 by the zero·one law. Of course, P( A0) = 1 if D c D0. The central step in the proof is to show that P( A0) = 0 if D contains points not in D0. Assume on the contrary that P( A0) = 1 for such a D. Consider that part of the circumference of the unit circle that lies in D, and let k be an integer large enough that this arc has length exceeding 27T jk. Define

w

•

w

\=.

u(Xmu' xm o +

•

•

•

)

Let B0 be the w-set w�ere the function

•

•

•

•

•

if n ¥: 0 ( mod k ) , if n = 0 ( mod k) .

"'

G ( w, z ) = L Yn(w ) z n

( 22 . 19 )

rl

=0

coincides in D0 with a function analytic in D0 U D. . . } has the same structure as the original sequence: The sequence the are independent and assume the values + 1 with probability i each. Since B0 is defined in terms of the in the same way as A0 is defined in terms of the it is intuitively clear that P(B0) and P( A 0) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular each of (22. 17) and (22. 19) coincides in D0 with a function analytic in D0 U D, the same must be true of

{Y0 , Yw

Yn

Yn

Xn,

w

"'

(22.20)

F( w, z) - G( w, z) = 2 L= xm k ( w)z mk. m O

294

RANDOM VARIABLES AND EXPECTED VALUES D1 = [ ze21Tilf k : z E D ].

Since replacing z by ze 21Ti/k leaves the function (22.20) unchanged, it can be extended analytically to each D0 u Dp I = 1, 2, . . . . Because of the choice of k , it can therefore be extended analytically to [ z : l z l < 1 + E] for some positive E; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, AD n BD cannot contain a point w satisfying (22.18). Since (22. 18) holds with probability 1, this rules out the possibility P(AD) = P(BD) = 1 and by the zero-one law leaves only the possibility P(A D) = P( BD) = 0. Let A be the w-set where (22.17) extends to a function analytic in some open set larger than D0. Then w E A if and only if (22. 17) extends to D0 U D for some D = [ z : l z - ( l < r] for which D - D0 =t= 0, r is rational, and ( has rational real and imaginary parts; in other words, A is the countable union of ) and has probability 0. A D for such D. Therefore, A lies in a-( X0 , X 1 , It remains only to show that P( AD) = P(BD), and this is most easily done by comparing {Xn} and {Y) with a canonical sequence having the same n structure. Put Zn(w) = (Xn(w) + 1)/2, and let Tw be L� = o Zn(w)2 - - l on the w-set A* where this sum lies in (0, 1]; on f! - A* let Tw be 1, say. Because of (22.16) P( A* ) = 1. Let .9"= u (X0 , X 1 , . . . ) and let � be the u-field of Borel subsets of (0, 1 ]; then T: fl � (0, 1] is measurable .9"/ �. Let rn (x) be the nth Rademacher function. If M = [x: r;( x ) = u ; , i = 1 . . . . , n ], where u; = + 1 for each i, then P(T- 1M) = P[w: X;(w) = u1, i = 0, 1, . . . , n 1 ] = 2 - n , which is the Lebesgue measure A(M ) of M. Since these sets form a 7T-system generating �. P(T- 1 M) = A( M) for all M in f!J (Theorem 3.3). Let MD be the set of x for which L:� =o rn + 1 (x) z n extends analytically to D0 U D. Then MD lies in f!J, this being a special case of the fact that AD lies in !F. Moreover, if w E A* , then w E A D if and onl� if Tw E M0: A* nAD = A* n T- 1MD. Since P( A* ) = 1, it follows that P(AD) = A(MD). This argument only uses (22.16), and therefore it applies to {Yn } and BD as • wei!. Therefore, P( BD) = A(MD) = P(AD). Let

•

•

•

PROBLEMS 22.1.

Suppose that X 1 , X , . . . is an independent sequence and Y is measurable u ( Xn , Xn + I > . . . ) for 2 each n. Show that there exists a constant a such that P[ Y= a] =

22.2.

1.

Assume {Xn} independent, and define X�c> as in Theorem 22.8. Prove that for !:IXnl to converge with probability 1 it is necessary that [P[ I Xnl > c] and [E[IX�c>l] converge for all positive c and sufficient that they converge for some positive c. If the three series (22.13) converge but [E[ I ��cll] = oo, then there is probability 1 that [Xn converges conditionally but not a bsolutely. Generalize the Borei-Cantelli lemmas: Suppose Xn are nonnegative If [E[ Xn] < oo, then [Xn converges with probability 1. If the Xn are indepen dent and uniformly bounded, and if [E[ Xn] = oo, then [Xn diverges witt probability 1.

22.3. i

(a)

SECTION

22.


295

(b) Construct independent, nonnegative X11 such that [X11 converges with probability 1 but [£[ X11] diverges For an extreme example, arrange that P[ Xn > 0 i.o.] 0 but £[X11] oo. =

=

22.4. Show under the hypothesis of Theorem 22.6 that

extend Theorem 22.4 to infimte sequences. •

22.5. 20.14

[Xn has finite variance and

a re independent, each with the Cauchy 22.1 i Suppose that XI > X2 , distribution (20.45) for a common value of u. (a) Show that n - 1 !:Z 1 Xk does not converge with probability 1 . Contrast with Theorem 22. 1 . (b) Show that P[ n - I max k "' " xk < X ] --+ e - u j TTX for X > 0. Relate to Theorem 14.3. •

•

•

=

22.6. If

X1 , X2, are independent and iden tically disuibuted, and if P[ X1 > 0] 1 and P[ X1 > 0] > 0 , then [11X11 = oo with probability 1. Deduce this from Theorem 22. 1 and its corollary and also directly: find a positive E such that xr > c infinitely often with probability 1 . •

•

=

•

22.7. Suppose that

X1, X2, . . . are independent and identically distributed and E[ I X1 I] = oo. Use (21.9) to show that L11 P[ I Xn l > an] = oo for each a, and conclude that S Up11 n - 1 1 X111 oo with probability 1. Now show that sup" n - : 1511 1 = oo with probability 1. Compare with the corollary to Theorem 22. 1 . =

22.8.

Wald's equation. Let X1, X2, be in dependent and identically distributed + Xn. Suppose that -r is a stopping with finite mean, and put Sn = X1 + time: -r has positive integers as values and [ -r = n] E u(X1, , X11); see Section 7 for examples. Suppose also that £[ -r] < oo. (a) Prove that •

•

•

·

·

·

.

•

.

(22.21 )

(b) Suppose that Xn is + 1 with probabilities p and q, p -=F q, let -r be the first n for which Sn is - a or b (a and b positive integers), and calculate £[-r]. This gives the expected duration of the game in the gambler's ruin problem for unequal p and q. 22.9. 20.9 i

Let Z11 be 1 or 0 according as at time n there is or is not a record in the sense of Problem 20.9. Let R11 = Z1 + + Zn be the number of records up to time n. Show that R11jlog n -+P 1. ·

·

·

22.10. 22. 1

{X11} the radius of conver i (a) Show that fo r an independent sequence 11 gence of the random Taylor series L11 X11 Z is r with probability 1 fo r some nonrandom r. (b) Suppose that the Xn have the same distribution and P[X1 -=F 0] > 0. Show that r is 1 or 0 according as log +I X1 1 has finite mean or not.

22.1 1. Suppose that

X0, X1, . . . are independent and each is uniformly n d istributed over [0, 2 '7T ]. Show that with probability 1 the series [11eiX,z has the unit circle as its natural boundary.

22.12. Prove (what is essentially Kolmogorov's zero-one law) that if

A is independe nt

of a '7T-system f?J and A E u(f?J), then P(A ) is either 0 or 1 .


296 22.13.

22. 14.

Suppose that .w' is a semiring conta ining !l. (a) Show that if P(A n B) < bP( B) for all B E .w', and if b < 1 and A E a-(.9«'), then P(A) = 0. (b) Show that if P( A n B) < P( A )P( B) for all B E .w', and if A E a-(.9«'), then P( A) is 0 or 1. (c) Show that i f aP(B) < P(A n B) for all B E .w', and if a > 0 and A E a-(.9«'), then P( A ) = 1. (d) Show that if P( A )P( B) < P( A n B) for all B E .w', and if A E a-(.9«'), then P( A) is 0 or 1. (e) Reconsider Problem 3.20

Burstin's theorem. Let f be a Borel function on [0, 1] with arbitrarily small periods: For each *' there is a p such that 0 < p < *' and f(x) = f(x + p) for 0 < x < 1 - p. Show that such an f i n] and variables are defined by = ), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original N, . Therefore, anything that can be said about the can be stated in terms of the N, , and conversely. The points . . . of (O, oo) are exactly the discontinuities of N,{w) as a function of t; because of the queueing example, it is natural to call them arrival times. The program is to study the joint distributions of the under conditions on the waiting times and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect:

Sn(w)

sn _,( w

N, w

=

Xn(w) Sn(w) Xn S 1{ w), S 2(w), N,

Xn

Condition r. The X,. are independent , and each is exponentially distributed with parameter a.

Xn

n

In this case P[ > OJ = 1 for each n and n s � a - I by the strong law of large numbers (Theorem 22.1), and so (23.3) and (23.4) hold with probabil ity 1; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition 1°, has the distribution function specified by (20.40), so that P[ N, > n] = f.7 ne - "''(at)'/i! by (23.6), and -

•

Sn

(23.8)

n (at )

n = 0 1,... .

P[ Nr = n] = e - ar n ! ,

,

N,

Thus has the Poisson distribution with mean at. More will be proved in a moment.

Condition 2°. {i) For 0 < t < · · · < t k the increments N, , , N,l - N, , , . . . , N,k - N, are independent. {ii) The individual increments have the Poisson distribution: 1

k-I

t s (23.9) P[ Nr - N = n ] = e - a< r - s ) ( a ( n-! s

n ))

'

n = 0 , 1 , . . . , 0 < s < t.

[N,:

t > 0] of Since N0 = 0, (23.8) is a special case of (23.9). A collection random variables satisfying Condition 2° is called a Poisson process, and a is the rate of the process. As the increments are independent by {i), if r < s < t, then the distributions of � - Nr and - � must convolve to that of - Nr But the requirement is consistent with {ii) because Poisson distribu tions with parameters u and v convolve to a Poisson distribution with parameter u + v.

N,

N,

Theorem 23.1.

Condition oo.

Conditions 1 o and 2° are equivalent

tn

the presence of

300


PROOF OF 1 o -7

2°. Fix t, and consider the events that happen after time t. By (23.5), SN( < t < SN( + I • and the waiting time from t to the first event following t is SN( + 1 - t; the waiting time between the first and second events following t is XN( + 2 ; and so on. Thus (23. 10) XU1 ) - sN( + I - t , X2< l> -- XN( + 2 ' X3< l> - XN, + 3 • · · · define the waiting times following t. By (23.6), N, + ' - N, > m, or N, + s > N, + )< . . . +X� m, if and only if S N + m < t + s , which is the same thing as xv> + , s . Thus

Nl + s - N = max [ m · x Iu > +

( 23. 1 1 )

1

•

· · ·

+ xu> m -< s ] ·

Hence [ Nl + s - N1 = m ] = [ x I + · · · +X < 1 > -< s < x u > + · · + Xu> m + I ]· A comparison of (23.11) and (23.5) shows that for fixed t the random variables N, + s - N, for s � 0 are defined in terms of the sequence (23. 10) in exactly the same way as the � are defined in terms of the original sequence of waiting times. The idea now is to show that conditionally on the event [N, = n] the random variables (23.10) are independent and exponentially distributed. Because of the independence of the Xk and the basic property (23.2) of the exponential distribution, this seems intuitively clear. For a proof, apply (20.30). Suppose y > 0; if Gn is the distribution function of S n ' then since X11 + I has the exponential distribution,

=

J

X :5;

I

·

l

m

P[ Xn + l > t + y - x ] dGn( x)

= e - "')'

J

X :5; I

P[ Xn + 1 > t - x ] dGn( x )

By the assumed independence of the X11,

= P[ S n + l - t > y I • Sn -< t < Sn + l ] e - "'Y2 = P[Sn -< t < Sn + l ] e - ay , ( 23 .12)

p

[ Nl = n ' ( X I

I)'

.

•

.

'

· · · e - "'Yi

· · · e - ayi ·

X} I ) ) E H ] = p [ Nl = n ] p [ ( X ' . . . ' xj) E H ] . l

SECTION

23.

THE POISSON PROCESS

301

By Theorem 10.4, the equation extends from H of the special form above to all H in �i. Now the event [ N5 . = m;, 1 , . . . , xp >) E H] is by (23. 1 1) the same as the event [ N, - N, = m;, 1 < i < u]. Thus (23.12) gives •

(

.

.

+s (

From this it follows by induction on

k

that if 0 = t0 < t k

1

t] = P[N, = 0] = e - ar , so that X1 is exponentially distributed. To find the joint distribution of X1 and X2, suppose that 0 < s 1 < t 1 < s2 < t2 and perform the caiculation PRooF oF

P[ s < s < t s 2 < s 2 < t2 ] = P[ N5 = 0, N, 1 - � = 1 , � 2 - N, = 0, N, 2 - �s > 1 ] 1 I

I'

I

1

1

= e - as, X a ( t l - s l)e - a 1. In the case where rn = n and Pn k = Ajn, (23.15) is the Poisson approximation to the binomial. Note that if A > 0, then (23.14) implies rn -7 oo.

PRooF. The argument depends on a construction like that in the proof of be independent random variables, each uni Theorem 20.4. Let U1 , U2 , formly distributed over [0, 1). For each p, 0 < p < 1, �plit [0, 1) into the two intervals /0( p) = [0, 1 - p) and / 1 p) = f 1 - p, 1), as well as into the sequence . - 0, 1 , . . . . D efi ne vn - 1 i i i I l "' ') t jl·' p of mterva I s Ji ( P ) -- [ "' p �... J. , J. = le �...i = le , i k if uk E / J( Pn k ) and vn k = 0 if u�.. E Io 0, n > 1 ;

that is, each of S 1 , S 2 , . . . has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that N,(w) has a discontinuity somewhere (and indeed has infinitely many of them). But {23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition.

If Condition oo holds and [ N, : t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution. This is Prekopa ' s theorem. The conclusion is not that [ Nr : t > 0] is a Poisson process, because the mean of N, - N5 need not be proportional to t - s. If rp is an arbitrary nondecreasing, continuous function on [0, oo) and cp{O) = 0, and if [ N, : t > 0] is a Poisson process, then N 1 ] j A L , k= l

( 23.19) for some finite A and

( 23.20) Third,

(23.21)

l

[

pr L N,n k-1 > 1 ] -7 0. N, nk �k �r max

n

-

,

P l max ( N,nk - N,n k-1 ) > � k � rn

2]

-7

0.

Once the partitions have been constructed, the rest of the proof is easy: Let Z,. k be 1 or 0 according as N, nk - N, n,k - 1 is positive or not. Since [ N,: t > 0] has independent increments, the Zn k are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = L�"= 1 Znk satis fies P[ Zn = i] -7 e- AAiji! Now N,. - N,, > Zn , and there is strict inequality if and only if N, k - N, - > 2 for some k. Thus (23.21) implies P[ N," - N,, =I= Zn ] -7 0, and therefo;� P[ N,, - N1, = i] = e- AAiji! To construct the partitions, consider for each t the distance D, = infm 1 It - Sml from t to the nearest arrival time. Since Sm -7 oo, the infimum is achieved. Further, D, = 0 if and only if Sm = t f01 some m , and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P[ D, = O] = O. Choose 8, so that 0 < 8, < n - 1 and P[D, < 8,] < n - 1 • The intervals (t - o, t + 8,) for t' < t < t" cover [t', t"]. Choose a finite subcover, and in (23.18) take the tn k for 0 < k < rn to be the endpoints (of intervals in the subcover) that are contained in (t', t"). By the construction, �

(23.22) and the probability that (t n, k _ 1 , tn k ] contains some Sm is less than n - 1 • This gives a sequence of partitions satisfying (23.20). Inserting more points in a partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one. To prove (23.21) it is enough (Theorem 4. 1) to show that the limit superior of the sets involved has probability 0. It is in fact empty: If for infinitely many n, N, (w ) - N,n, k-J(w ) > 2 holds for some k < rn , then by (23.22), N,(w) as a n k.

SECTION

23.

THE POISSON PROCESS

305

function of t has in [t', t"] discontinuity points (arrival times) arbitrarily close together, which requires S m(w) E [t', t"] for infinitely many m, in violation of Condition 0°. It remains to prove (23.19). If Znk and Zn are defined as above and Pnk = P[Zn k = 1] , then the sum in (23. 19) is L k Pnk = E [ Zn ] . Since zn + I > Zn , L k Pnk is nondecreasing in n. Now

P[ N,.. - N, . = O) = P[ Zn k = 0, k < rn ]

- 0 (1 -p k= l

nk

[

) < exp -

t

k=l

l

Pnk ·

If the left-hand side here is positive, this puts an upper bound on [, k pr. k • and (23.19) follows. But suppose P[N,. - N,. = 0] = 0. If s is the midpoint of t' and t", then since the increments are independent, one of P[ N5 - N,. = 0] and P[N,. - � = OJ must vanish. It is therefore possible to find a nested sequence of interval'> [urn, vn ] such that vm - urn -7 0 and the event A m = [N, - Nu > 1] has probability 1. But then P( n m A m ) = 1 , and if t is the point common to the [urn, vm ], there is an arrival at t with probability 1, contrary to the assumption that there are no fixed discontinuities. • m

m

Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be essentially independent if the arrivals to time s cannot seriously deplete the population of potential arrivals, so that N5 has for t > s negligible effect on N, - N5 And the condition that there are no fixed discontinuities is entirely natural. The'>e conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, predetermined times. If the arrival rate is essentially constant, this leads to the following condition. •

Condition 3°. {i) For 0 < t < · < tk the increments N, N, 2 - N, N, - N, are independent. k k -I {ii) The distribution of N, - � depends only on the difference t - s. 1

Theorem 23.4.

Condition 0°.

· ·

1,

1,

• • •

,

Conditions 1°, 2°, and 3° are equivalent in the presence of

Obviously Condition 2o implies 3°. Suppose that Condition 3o holds. If 1, is the saltus at t (1, = N, - sups < r N5 ), then [ N, - N, _ n - 1 > 1] � [1, > 1] , and it follows by (ii) of Condition 3o that P[1, > 1] is the same for all t. But if the value common to P[1, > 1] is positive, then by the independence of the increments and the second Borel-Cantelli lemma there is probability 1 that 1, > 1 for infinitely many rational t in (0, 1), for example, which contra dicts Condition 0°. PRooF.


306

By Theorem 23.3, then, the increments have Poisson distributions. If f(t) is the mean of N, then N, - Ns for s < t must have mean f(t) -f(s) and must by (ii) have mean f(t - s ); thus f(t) = f(s) + f(t - s ). Therefore, f satisfies Cauchy's functional equation [A20] and, being nondecreasing, must have the form f(t) = at for a > 0. Condition oo makes a = 0 impossible. • One standard way of deriving the Poisson process is by differential equations.

Condition 4°. If G < t 1 < then

· · ·

< tk and if n 1 , . . . , n k are nonnegative integers,

( 23.23) and (23.24) as h � 0. Moreover, [ N, : t > 0] has no fixed discontinuities. The occurrences of o(h) in (23.23) and (23.24) denote functions, say 0 and It - sl < n- 1 , then

As n oo, the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[N, = n] is continuous at t. The same kind of argument works for conditional probabilities and for t = 0, and so Pn(t) is continuous for t > 0. To simplify the notation, put D, = N, k + r - N,k . If D, + h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities, �

Pn( t + h) = Pn( t )P [ Dr + h - D, = OIA n [ D, = n ]) + Pn - 1 ( t ) P [ D, +11 - D, -= 1 I A n [ D, = n - 1 ] ] n- 2 + L Pm( t )P[ Dr + h - D, = n - m i A n [ D, = m l ] . m=O

For n < 1, the final sum is absent, and for n = 0, the middle term is absent as well. This holds in the case Pn (t) = P[N, = n] if D, = N, and A = fl. (If t = 0, some of the conditioning events here are empty; hence the assumption t > 0.) By (23.24), the final sum is o(h) for each fixed n. Applying {23.23) and (23.24) now leads to

Pn( t + h ) = pn( t)(1 - ah) +pn_ 1 ( t)ah + o( h ) , and letting h � 0 gives

( 23 .26) In the case n = 0, take p _ 1(t) to be identically 0. In (23.26), t > 0 and p�(t) is a right-hand derivative. But since Pn(t) and the right side of the equation are continuous on [0, oo), (23.26) holds also for t = 0 and p�(t) can be taken as a two-sided derivative for t > 0 [A22]. Now (23.26) gives [A23]

Since Pn(O) is 1 or 0 as n = 0 or n > 0, (23.25) follows by induction on n.

•

RANDOM VARIABLES

308

AND

EXPECTED VA LUES

Stochastic Processes

The Poisson process [ N, : t > 0] is one example of a stochastic process - that is, a collection of random variables (on some probability space (f!, sz-, P)) indexed by a parameter regarded as representing time. In the Poisso n case, time is continuous. In some cases the time is discrete: Section 7 concerns the sequence ( F } of a gambler's fortunes; there n represents time, but time that . . . mcreases m JUmps. Part of the structure of a stochastic process is specified by its finite-dimen sional distributions. For any finite sequence t 1 , • • • , tk of time points, the k-dimensional random vector (N1 1 , • • • , N, ) has a distribution Jl- 1 1 1 over R k . These measures Jl- 1 1 1 k are the finite-dimensional distributions of the pro cess. Condition 2° of this section in effect specifies them for the Poisson case: n

k

( 23.27 ) P [ NI · = n1. ' ]. -< k J = n e �l 1

j

-

a( rI

· -

11 1 ) · -

(a(t - t )) j

j-l

n - - n J.

1

( ni - ni - l ) .I

.

-•

if O < n 1 :::;; . . . < n k and 0 < t 1 < . . . < tk (takc n 0 t 0 = 0) . The finite-dimensional distribution s do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed w, N1 (w) as a function of t has the regularity properties given iu the second version of Condition 0°. These properties are used in an essentiai way in th.e proofs. Suppose that f(t) is t or 0 according as t is rational or irrational. Let N, be defined as before, and let =

Ml ( w) = N,( w) + f ( t + X I ( w ) ) . If R is the set of rationals, then P[w: f(t + X 1 (w)) =I= 0] = P[w : X1 ( w) E R - t ] = 0 for each t because R - t is countable and X, has a density. Thus P[M1 = N, ] = 1 for each t, and so the stochastic process [M1 : t > 0] has the same finite-dimensional distributions as [N1: t > 0]. For w fixed, however, M1 (w) as a function of t is everywhere discontinuous and is neither mono ( 23.28)

-

tone nor exclusively integer-valued. The function s obtained by fixing w and letting t vary are calle d the path functions or sample paths of the process. The example above shows that th e finite -dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to p lace conditions on the character of the samp le paths as well as on the finite-dimensional distributions. Condition oo was imposed throughout this section to ensure that the sample paths are nonde creasing, right-continuous , integer-valued step functions, a natural condition if N, is to represent the number o f events in [0, t ]. Stochastic processes i n continuous time are studied further in Chapter 7.

SECTION

23.

THE POISSON PROCESS

309

PROBLEMS

Assume the Poisson processes here satisfy Condition oo as well as Condition 1°. 23.1.

Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. •

23.2.

20. 17 i Show that the time Sn of the nth event in a Poisson stream has the gamma density f( x; a, n) as defined by (20.47). This is sometimes called the

Erlang density. 23.3.

Let A , = t - SN he the time back to the most recent event in the Poisson stream (or to 0); and let B, = SN + 1 - t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as X1 (exponen tially with parameter a ), and that A, is distributed as min{Xp t}: P[ A, < t] is or 1 as x < 0, 0 < x < t, or x > t. 0, 1 - e -ax,

Let . L, =A, + B, = SN + 1 - SN be the length of the interarrival interval covenng t. (a) Show that L, has density

23.4. i

'

'

if 0 < X < t , if X > t . Show that E[L1] converges to 2 E[Xd as t --+ oo. This seems paradoxical because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. (b)

23.5.

Merging Poisson streams. Define a process {N,} by (23.5) for a sequence {Xn} of

random variables satisfying (23.4). Let {X�} be a second sequence of random variables, on the same probability space, satisfying (23.4), and define {N,'} by N,' = max[ n : X) + +X� < t ]. Define {N,"} by N;' = N, + N;. Show that, if u(X1 , X7 , . . . ) and u(X;, X�, . . . ) are independent and {N1} and {N;l are Poisson processes with respective rates a and {3, then {N,"} is a Poisson process with rate a + {3. ·

23.6. i

The n th and

sn + l "

·

·

(n + l)st events

1 n the process {N,} occur at times Sn and

Find the distribution of the number Ns process during this time interval. (b) Generalize to N5 - N5, .

(a)

..

'"

23.7.

+,

- Ns of events in the other ..

,

Suppose that X 1 , X2 , are independent and exponentially distributed with parameter a, so that (23.5) defines a Poisson process {N,}. Suppose that Y1 , Y2 , . . . are independent and identically distributed and that u(X1 , Xz, . . . ) and u(Yp Y2 , ) are independent. Put Z, = L:k ,; N Yk . This is the compound Poisson process. If, for example, the event at time \ in the original process •

•

•

•

•

•

310


represents an insurance claim, and if Yn represents the amount of the claim, then z, represents the total claims to time t . (a) I f Yk = 1 with probability 1 , then {Z,} is an ordinary Poisson process. (b) Show that {Z,} has independent increments and that Zs +r - Z5 has the same distribution as z,. (c) Show that, if Yk assumes the values 1 and 0 with probabilities p and 1 - p (0 < p < 1), then {Z,} is a Poisson process with rate pa. 23.8.

Suppose a process satisfies Condition oo and has independent, Poisson-distrib uted increments and no fixed discontin uities. Show that it has the form {N (1 - E)P(T"I). S uppose the arc I runs from a to b. Let n 1 be arbitrary and, using the fact that {T "a} is dense, choose n 2 so that T"1/ and T"2/ a re disjoint and the distance from T"tb to T "2 a is less than Ea. Then choose n3 so that T"1/, T"2/, T " 'l are disjoint and the distance from T"2b to T"'a is less than Ea. Continue until T" kb is within a of T"1a and a further step is impossible. Since the T"•l are disjoint, ka < 1; and by the construction, the T "• I cover the circle to within a set of measure kw + a, which is at most 2€. And now by disjointness, • • •

• • •

k

k

P ( A ) > L P ( A n T"• I ) > ( 1 - € ) L P ( T"• I ) > ( 1 - € )( 1 - 2 €). i=

I

i=

I

Since € was arbitrary, P(A) must be 1: T is ergodic if c is not a root of unity.t t For

a

simple Fourier-series proof, see Problem 26.30.

SECfiON

24.

317

THE ERGODIC THEOREM

Proof of the Ergodic Theorem

The argument depends on a preliminary result the statement and proof of which are most clearly expressed in terms of functional operators. For a real fu nction f on f!, let Uf be the real function with value (Uf)(w ) = f( Tw ) at If f is integrable, then by change of variable (Theorem 16. 13),

w.

_

(24. 12) E[Uf ] = 1 t(Tw)P(dw) = 1 t(w)PT- 1(dw) = E[ f ] . l1

l1

U is nonnegative in the sense that it carries nonnegative functions to nonnegative functions; hence f < g (pointwise) implies Uf < Ug . n Make these pointwise definitions: S0f = O, SJ=f+ Uf + · · · + U - 1f, Mn f = max0 ,., k ,., n Sk f, and MJ = supn 0 Sn f = supn 0 MJ. The maximal ergodic theorem: And the operator

�

Theorem 24.2.

�

Iff is integrable, then

fMxf>OfdP ?: 0.

(24.13)

Since Bn = [Mnf> O] i[M,J > 0], it is enough, by the dominated convergence theorem, to show that f8n f dP > 0. On Bn , Mn f = max 1 ,., k ,; n Sk f. Since the operator U is nonnegative, Sk f = f + US k _ I f < f + UMJ for 1 < k < n, and therefore Mnf 0. =

•

Replace f by f/A. If A is invariant, then Sn{f!A) = (SJ)IA , and M,,(f!A ) (M,,J )JA , and therefore (24.13) gives

( 24.14 ) Now replace f here by f- ,.\ ,

,.\

a constant. Clearly [M,,( f - ,.\ ) > 0] is the set

318


where for some n > 1, Sn(f - A) > O. or n - 1 Snf> A. Let

(24.15 )

FA =

[ w: n � l � kt= l f( Tk - 1w) > A ] ; sup

it follows by (24.14) that fA n FA(f- A) dP > 0, or

A P( A n FJ j

(24.16)

A], th-en A P( GA ) < 2 £[1/ll (trivial if A < 0). There fore, for positive a and A , h

h

n

1

lanl dP _!_n t J I J( T k - lw) I P( dw ) [la n i > A ) k = I GA
a

a

I J(T' - Iwll > a

i f( T k - 1 w) I P ( dw ) + aP ( G>. )

l f(w) I P ( dw ) + aP ( G>. )

l f (w) IP( dw) + 2 � E[Itl] .

)

Take a = A1 12 ; since f is integrable, the final expression here goes to 0 as

SECfiON

24.

THE ERGODIC THEOREM

319

,.\ -7 oo . The a n(w) are therefore uniformly integrable, and £[ / ] E[f]. The uniform integrability also implies E[ia n - !11 -7 0. Set f(w) = 0 outside the set where the an(w) have a finite limit. Then (24.17) still holds with probability 1, and j(Tw) = /Cw). Since [w: j(w) < x ] is invariant, in the ergodic case its mea�ure is either 0 or 1; if x0 is the infimum of the x for which it is 1, th en f(w) = x0 with probability 1, and from • x0 = E[f] = E[f] it follows that f(w) = E[f] with probability 1. =

h

h

A

h

The Continued-Fraction Transformation

Let f! consist of the irrationals in the unit interval, and for x in f! let Tx = {1/x} and a 1(x) = l1/xJ be the fractional and integral parts of 1jx. This defines a mappi11g

(24.18 ) of f! into itself, a mapping associated with the continued-fraction expansion of x [A36]. Concentrating on irrational x avoids some triviai details con nected with the rational case, where the expansion is finite; it is an inessential restriction because the interest here centers on results of the almost-every where kind. n For x E f! and n > 1 let a n(x) = a 1(T - 1 x) be the nth partial quotient, and define integer-valued functions Pn(x) and qn(x) by the recursions

(24.19) p _ 1(x) = 1, q _ 1(x) = 0,

p0(x) = 0, q0(x) = 1,

Pn(x ) = a n(x )pn _ 1(x) + Pn - /x), n > 1, qn(x) = an(x)qn _ 1(x) + qn _ /x), n > 1.

Simple induction arguments show that

n > 0,

(24.20 ) and [A37: (27)]

n > 1. It also follows inductively [A36: (26)] that

(24.22)

_!ja 1 ( x ) + +1J an _ 1 (x) +_!ja n(x) + t Pn( x) + tpn _ 1 ( x) , n > 1, O < t < l. qn( x) + tqn - l ( x) ·

-

·

·

320


Taking t = 0 here gives the formula for the nth convergent: n > 1,

(24.23)

where, as follows from (24.20), pJx) and qn(x) are relatively prime. By (24.21) and (24.22),

n Pn( x ) -t ( T x) Pn - l ( x) x = J x ) + T n x)qn _ 1 ( x ) ' q (

( 24.24)

n > 0,

which, together with (24.20), impliest n ?:. 0.

(24.25)

Thus the convergents for even n fall to the left of x, and those for odd n fall to the right. And since (24. 19) obviously implies that qn(x) goes to infinity with n, the convergents Pn(x)jqJx) do converge to x: Each irrational x in (0, 1) has the infinite simple continued-fraction representation

(24.26) The representation is unique [A36: (35)], and Tx = lla 2 ( x ) + J]a 3 ( x ) + · · T shifts the partial quotients in the same way the dyadic transformation (Example 24.3) shifts the digits of the dyadic expansion. Since the continued fraction transformation turns out to be ergodic, it can be used to study the continued-fraction algorithm. Suppose now that a" a?., . . . are positive integers and define Pn and qn by the recursions (24.19) without the argument x. Then (24.20) again holds n (without the x), and SO Pn /qn - Pn - 1 /qn - l = (- l) + l /qn - l qn , n > 1. Since qn increases to infinity, the right side here is the n th term of a convergent alternating series. And since p0jq0 0, the n th partial sum is Pn /qn , which therefore converges to some limit: Every simple infinite continued fraction converges, and [A36: (36)] the limit is an irrational in (0, 1). Let A a 1 a. be the set of x in f! such that a k (x) a k for 1 < k < n; call it a fundamental set of rank n. These sets are analogous to the dyadic intervals and the thin cylinders. For an explicit description of A a a -necessary for the proof of ergodicity-consider the function :

=

•

•

=

I

(24.27) tTheorem 1 .4 follows from this.

n

·

SECTION 24. THE ERGODIC THEOREM

321

If x E d a a • then x = l/la a ( Tn x) by (24.21); on the other hand, because 1 1 . of the uniqueness of the partial. quotients [A36: (33)], if t is an irrational in the unit interval, then (24.27) lies in d a 1 an . Thus d a ( an is the image under (24.27) of f! itself. Just as (24.22) follows by induction, so does • •

• •

( 24.28)

I/Ja 1 a.{ t )

=

Pn + lPn - 1 q + tqn - I n

·

And 1/Ja I an(t) is increasing or decreasing in t according as n is even or odd, as is clear from the form of (24.27) (or differentiate in (24.28) and use (24.20)). It follows that if n is even, if n is odd. By (24.20), this set has Lebesgue measure ( 24.29) The fundamental sets of rank n form a partition of f!, and their unions form a field �; let ,9 2qn _ 2 by (24. 19), induction gives qn > 2< n - I J/ 2 for n > 0. And now (24.29) implies that .9Q generates the u-field !F of linear Borel sets that are subsets of f! (use Theorem 10.1(ii)). Thus .9, �. !F are related as in the hypothesis of Lemma 2. Clearly T is measurable !Fj !F. Although T does not preserve ,.\ , it does preserve Gauss ' s measure, defined by (24.30) In fact, since

P ( A ) = log1 2 f 1 dx +x ' A

A E !F.


322

it is enough to verify

ljk dx [, ljk dx . [, dx ol 1 + x = k = l fI/(k + l ) 1 + x = k = l f1 /(k + I ) 1 + x (I

Gauss's measure is useful because it is preserved by T and has the same sets of measure 0 as Lebesgue measure does. Proof that T is ergodic. Fix a 1 , , an, and write 1/Jn for 1/J a and dn for Suppose that n ni]i even, so that 1/Jn is increasing. !f X E d n , then da l n (since x = 1/Jn(T x )) s < T x < t if and only if if;/s) < x < 1/Jn(t ); and this last condition implies x E dn. Combined with (24.28) and (24.20), this shows that a

• • •

a

n

·

I

n

If B is an interval with endpoints s and t, then by (24.29),

A similar argument establishes this for n odd. Since the ratio on the right lies between t and 2,

( 24.31 ) Therefore, (24. 10) holds for .9, .9Q, !F as defined above, A = dn, nA = n, c = 4 , and A in the role of P. Thus r - 1 C = C implies that ,.\(C) is 0 or 1, and since Gauss ' s measure (24.30) comes from a density, P(C) is 0 or 1 as well. Therefore, T is an ergodic measure-preserving transformation on ( f!, :Y, P). It follows by the ergodic theorem that if f is integrable, then

( 24. 3 2)

1. t f( Tk - l x )

lim n n

k= I

=

1 r t f( x ) dx log 2 }0 1 + X

holds almost everywhere. Since the density in (24.30) is bounded away from 0 and oo, the "integrable" and "almost everywhere" here can refer to P or to A indifferently. Taking f to be the indicator of the x-set where a 1(x) = k shows that the asymptotic relative frequency of k among the partial quotients is almost everywhere equal to

1

lfk

dx log 2 JI /< k + D 1 + x

=

1 log ( k + 1) 2 log 2 k(k + 2 )

·

In particular, the partial quotients are unbounded almost everywhere.

SECTION

24.

THE ERGODIC THEOREM

323

For understanding the accuracy of the continued-fraction algorithm, the magnitude of a n(x) is less important than that of qn(x). The key relationship . IS

(24.33) which follows from (24.25) and (24.19). Suppc 2 < n - I )/ 2 for n > 1. Therefore, by (24.33),

1 1 X Pn ( X ) fqn ( X ) 1 -< qn + 1 ( X ) < 2 n f 2 ' n > 1. Since llog{ 1 + s)l < 4lsl if lsi < 1/ fi . log x - log Therefore, by (24.36),

n 1 k I log log y x L qn ( X )

k� I

4 . Pn( x) < -= qn( X) - 2 n f 2


324

By the ergodic theorem, then, t

1 1 I.1 m - Iog n qn ( X ) n

-

1

(

I log X dx

log 2 10 1 + x

=

( - l)k

I dx log( 1 x + x ) }

-1 (

log 2 0

_ 7T 2 - 12 log 2 - log 2 L 2 k O (k + 1) -1

00

·

Hence (24.34 ). Diophantine Approximation

The fundamental theorem of the measure theory of Diophantine approximation, due to Khinchine, is Theorem 1.5 together with Theorem 1.6. As in Section 1, let �p(q ) be a positive function of integers and let A 'I' be the set of x in (0, 1) such that

p X - -q

(24.37)

x for infinitely many n , then (25.1) fails at the discontinuity point of F. • Example 25.4.

Uniform Distribution Modulo t•

For a sequence x 1 , x2, of real numbers, consider the corresponding sequence of their fractional parts {xn} = xn l x nl· For each n, define a probability measure 1-Ln by • • •

-

(25.3)

f.L n ( A )

=

1 -#[ n k : 1 < k < n, { x d E A ] ;

t There is (see Section 28) a related notion of vague convergence in which IL may be defective in the sense that f.L(R 1 ) < 1. Weak convergence is in this context sometimes called complete convergence. * This topic, which requires ergodic theory, may be omitted.

SEC'TION 25. WEAK CONVERGENCE

329

has mass n - 1 at the points {x 1 }, , {xn}, and if several of these points coincide, the masses add. The problem is to find the weak limit of {JLn} in number-theoretically interesting cases. If the JLn defined by (25.3) converge weakly to Lebesgue measure restricted to the is said to be uniformly distributed modulo 1 . In unit interval, the sequence x" x 2, this case every subinterval has asymptotically its proportional share of the points {xn}; by Theorem 25.8 below, the same is then true of every subset whose boundary has Lebesgue measure 0.

JLn

• • •

• • •

.

Theorem 25.1.

F.or e irrational, e, 2e, 3e, . . . is uniformly distributed modulo 1.

in Example 24.4, PRooF. Since { ne} = {n {e}} , e can be assumed to lie in [0,1]. As x map [0, 1) to the unit circle in the complex plane by cf>( x) = e 2 -rri . If e is irrational, then c = cf>(e) is not a root of un ity, and so (p. 000) Tw = cw defines an ergodic transformation with respect to circular Lebesgue measure P. Let J be the class of open arcs with endpoints in some fixed countable, dense set. By the ergodic theorem, n the> orbit {T w} of almost every w enters every I in J with asymptotic relative frequency P( /). Fix such an w. If /1 c1 c /2,1 where 1 is a closed arc and /1 , /2 are in J, then the upper and lower limits of n - ['/, = 1 /iTk - lw) are between P(/1 ) and P(/2), and thereiore the limit exists and equ als P(J). Since the orbits and the arcs are rotations of one another, every orbit enters every closed arc 1 with frequency P(J). n This is true in particular of the orbit {e } of 1. n Now carry all this back to [0, 1) by cf> - I : For every x i n [0, 1), {ne } = cf> - '(e ) lies in [0, x] with asymptotic relative frequency ;,;. • For a simple proof by Fourier series, see Example 26.3. Convergence in Distribution

Let Xn and X be random variables with respective distribution functions Fn and F. If Fn = F, then Xn is said to converge in distribution or in law to X, written Xn = X. This dual use of the double arrow will cause no confusion. Because of the defining conditions (25 . 1 ) and (25.2), Xn = X if and only if

( 25.4)

lim P[Xn < x ] = P[X < x ] ,

for every x such that P[X = x] = 0. be independent random variables, each Let X1 , X2, with the exponential distribution: P[Xn > x ] = e -ax, x > 0. Put Mn = max{ X1 , , Xn } and bn = a- 1 log n. The relation (14.9), established in Exam ple 14. 1 , can be restated as P[Mn - bn < x] � e - e - ax. If X is any random variable with distribution function e - e - ax' this can be written Mn - bn = X. Example 25.5.

• • •

• • •

•

One is usually interested in proving weak convergence of the distributions of some given sequence of random variables, such as the Mn - bn in this example, and the result is often most clearly expressed in terms of the random variables themselves rather than in terms of their distributions or

CONVERGENCE OF DISTRIBUTIONS

330

distribution functions. Although the Mn - bn here arise naturally from the problem at hand, the random variable X is simply constructed to make it possible to express the asymptotic relation compactly by Mn - bn = X. Recall that by Theorem 14. 1 there does exist a random variable for any prescribed distribution. For each n, let f!n be the space of n-tuples of O's and 1 's, let � consist of all subsets of f!n, and let Pn assign probability (Ajn)k (l Ajn) n - k to each w consi�tjng of k 1's and n - k O's. Let Xn(w) be the number of 1's in w; then Xn , a random variable on (f!n, �. Pn), represents the number of successes in n Bernoulli trials having probability A/n of success at each. Let X be a random variable, on some (f!, !F, P), having the Poisson • distribution with parameter A. According to Example 25.2, X,. = X. Example 25.6.

As this example shows, the random variables Xn may be defined on entirely different probability spaces. To allow for this possibility, the P on the left in (25.4) really should be written Pn . Suppressing the n causes no confusion if it is understood that P refers to \Whatever probability space it is that xn is defined on; the underlying probability space enters into the definition only via the distribution 11- n it induces on the line. Any instance of Fn = F or of 11- n = 11- can be rewritten in terms of convergence in distribution: There exist random variables Xn and X (on some probability spaces) with distribution functions Fn and F, and Fn = F and Xn = X express the same fact. Convergence in Probability

Suppose that X, X 1 , X2, . . . are random variables all defined on the same probability space (f! , !F, P). If Xn � X with probabiliry 1, then P[IXn - XI > E i.o.] 0 for E > 0, and hence =

(25.5)

lim P [ I Xn - XI > E ] n

=

0

by Theorem 4. 1. Thus there is convergence in probability Xn �P X; see Theorems 5.2 and 20.5. Suppose that (25.5) holds for each positive E. Now P[X < x - E ] - P[IXn XI > E] < P[Xn < x ] < P[X < x + €] + P[IXn - x i > E]; letting n tend to oo and then letting E tend to 0 shows that P[ X < x ] < lim infn P[Xn < x] < lim sup n P[Xn < x ] < P[X < x ]. Thus P[Xn < x ] � P[X < x ] if P[X = x] = 0, and so Xn = X:

Suppose that Xn and X are random variables on the same probability space. If Xn � X with probability l, then Xn � p X. · If Xn � p X, then Xn = X. Theorem 25.2.

SECTION 25. WEAK CONVERGENCE

331

Of

the two implications in this theorem, neither converse holds. B ecause of Example 5.4, convergence in probability does not imply convergence with probability 1 . Neither does convergence in distribution imply convergence in probability: if X and Y are independent and assume the values 0 and 1 with probability � each, and if Xn = Y, then Xn = X, but Xn � P X cannot hold because P[IX - Yl] = 1 = � - What is more, (25.5) is impossible if X and the Xn are defined on different probability spaces, as may happen in the case of convergence in distribution. Although (25.5) in general makes no sense unless X and the Xn are defined on the same probability space, suppose that X is replaced by a constant real number a-that is, suppose that X(w) = a. Then (25.5) be comes (25.6)

lim P [ IXn - al > € ] = 0, r.

and thi� condition makes sense even if the space of Xn does vary with n. Now a can be regarded as a random variable (on any probability space at all), and it is easy to show that (25.6) implies that Xn = a: Put E = lx - al; if ..i > a, then P[ Xn < x] > P[1Xn - ai < E] � 1, and if x < a, then P[Xn <x] < P[IXn - al > E] � 0. If a is regarded as a random variable, its distribution function is 0 for x < a and 1 for x > a . Thus (25.6) implies that the distribu tion function of Xn converges weakly to that of a. Suppose, on the other hand, that Xn = a. Then P[IXn - al > E] < P[Xn < a - €] + 1 -- P[Xn < a + € ] --> 0, so that (25.6) _holds:

The condition (25.6) holds for all positive E if and only if xn = a, that is, if and only if Theorem 25.3.

if x < a , if x > a . If (25.6) holds for all positive E, Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.5). Convergence in probability in this new sense will be denoted xn = a, in accordance with the theorem just proved. Example 14.4 restates the weak law of large numbers in terms of this concept. Indeed, if X 1, X2 , • . • are independent, identically distributed ran dom variables with finite mean m, and if Sn = X 1 + · · · +Xn , the weak law of large numbers is the assertion n - I Sn = m. Example 6.3 provides another illustration: If sn is the number of cycles in a random permutation on n letters, then Sn / log n = 1.

332


Suppose that Xn = X and 8n � 0. Given E and YJ , choose x so that P[IXI > x ] < 17 and P[X = + x] = 0, and then choose n0 so that n > n0 implies that l8ni < Ejx and I P[Xn < y] - P[X < y]i < YJ for y = +x. Then P[ !8nXn1 > E] < 3YJ for n > n0• Thus Xn = X and 8n � 0 imply that • 8n Xn = 0, a restatement of Lemma 2 of Section 14 (p. 193). Example 25. 7.

The asymptotic properties of a random variable should remain unaffected if it is altered by the addition of a random variable that goes to 0 in probability. Let (Xn • Y) be a two-dimensional random vector. Theorem 25.4.

If X" = X and Xn - Yn = 0, then Yn = X.

PROOF.

Suppose that y' < x < y" and P[X = y'] = P[X = y" ] = 0. If y' < X - € < X < X + f < y", then

( 25 7 ) P [ Xn < y' ] - P [ I Xn - Yn : > € ] Ej . •

Since Xn = X, letting n � oo gives

(25.8 )

P[ X < y' ] < lim infP n [ Y,. < x ] < lim sup P[ Y,. < x ] < P[ X < y" ] . n

Since P[ X = y] = O for all but countably many y, if P[X = x ] = O, then y' and y" can further be chosen so that P[X < y'] and P[ X < y" ] are arbitrarily near P[ X < x ]; hence P[ Yn < x] � P[ X < x]. • Theorem 25.4 has a useful extension. Suppose that (X�" >, Yn) is a two-dimensional random vector. Theorem 25.5.

If, for each u, XA" ) = x = X as u -+ oo, and if lim lim sup P ( I X� > - Ynl > € ] = 0

( 25.9)

u

for positive

€,

n

"

then Yn = X.

PROOF. Replace Xn by x�u ) in (25. 7). If P[ X = y' 1 = 0 P[ x < u) = y' 1 and P[ X = y"1 0 P[x =uy"1, letting n -+ oo and then u -+ oo gives (25.8) once again. Since P[X = y 1 = 0 P[ x < > = y 1 for all but countably many y, the proof can be completed '-=

as before.

=

=

=

•


333

Fundamental Theorems

Some of the fundamental properties of weak convergence were established in Section 14. It was shown there that a sequence cannot have two distinct weak limits: If Fn = F and Fn = G, then F = G. The proof is simple: The hypothe sis implies that F and G agree at their common points of continuity, hence at all but countably many points, and hence by right continuity at all points. Another simple fact is this: If lim n Fn( d ) = F(d) for d in a set D dense in R 1 , then Fn = F. Indeed, if F is continuous at x, there are in D points d' and d" such that d' < x < d" and F(d") - F(d') < E, and it follows that the limits superior and inferior of Fn(x) are within E of F(x ). For any probability measure on (R 1 , .9i' 1 ) there is on some probability space a random variable having that measure as its distribi.Ition. Therefore, for probability measures satisfying 11- n = J.L, there e,..ist 1 andom variables Yn and Y having these measures as distributions and satisfying Yn = Y. Accord ing to the following theorem, the Yn and Y can be constructed on the same probability space, and even in such a way that Yn(w) � Y(w) for every w-a condition much stronger than Y,. -=> Y. This result, Skorohod ' s theorem, makes possible very simple and transparent proofs of many important facts.

Suppose that f-L n and J-L are probability measures on and J.Ln = 11- · There e.x:ist random variables Yn and Y on a common probability space (f!, !F, P) such that Yn has distribution J.Ln , Y has distribution JL, and Yn(w) � Y(w) for each w. Theorem 25.6.

( R 1 , .9i' 1 )

PRooF.

For the probabili ty space (f!, !F, P), take f! = (0, 1), let !F con sist of the Borel subsets of (0, 1), and for P(A) take the Lebesgue measure of A. The construction is related to that in the proofs of Theorems 14. 1 and 20.4. Consider the distribution functions Fn and F corresponding to J.Ln and J.L· For 0 < w < 1, put Yn(w) = infix: w � Fn(x)] and Y(w) = infi x: w < F(x)]. Since w < Fn(x) if and only if Y,.(w) < x (see the argument following (14.5)), P[ w: Yn(w) < x ] P[ w: w < Fn(x)] Fn (x). Thus Yn has distribution function Fn ; similarly, Y has distribution function F. It remains to show that Yn(w) � Y(w). The idea is that Yn and Y are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 < w < 1. Given E, choose x so that Y(w) - E <x < Y(w) and JL{X} = 0. Then F(x) < w; Fn(x) � F(x) now implies that, for n large enough, Fn(x) < w and hence Y(w) - E < x < Yn(w). Thus lim infn Y,.(w) > Y(w). If w < w' and E is positive, choose a y for which Y(w') < y < Y(w') + E and J.L{Y} 0. Now w < w' < F(Y(w')) < F( y), and so, for n large enough, w < Fn(y) and hence Y,.(w) < y < Y(w') + E. Thus lim sup n Y,.(w) < Y(w') if w < w'. Therefore, Yn( w) � Y( w) if Y is continuous at w. =

=

=


334

Since Y is nondecreasing on (0, 1), it has at most countably many disconti nuities. At discontinuity points w of Y, redefine Yn(w) = Y(w) = 0. With this change, Yn(w) � Y(w) for every w. Since Y and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still 11- n and J.L. • Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in R k is more complicated. The following mapping theorem is of very frequent use.

Suppose that h: R 1 � R 1 is measurable and that the set Dh of its discontinuities is measurable.t If 11-n = J.L and J.L(D;,) = 0, then 11-n h - i = J.L h - 1 . Recall (see (13.7)) that 11-h - 1 has value p (h - 1A ) at A. Theorem 25.7.

PRooF.

Consider the random variables Yn and Y of Theorem 25.6. Since Yn(w) � Y(w), if Y(w) f!- D h then h(Yn(w)) � h(Y(w)). Since P[w: Y(w) E Dh ] = J.L(Dh ) = 0, it follows that h(Yn(w)) � h(Y(w)) with probability 1. Hence h(Y,.) = h(Y) by Theorem 25.2. Since P[h(Y) EA] P[ Y E h - 1A] = J.L(h - 1.4 ), h(Y) has distl ibution 11-h - I ; similarly, h(Yn) has distribution J.L n h - 1 . Th us • h(Yn ) = h(Y) is the same thing as J.L n h - I = 11-h - I . =

Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary

l.

If Xn = X and P[X E Dh ] = 0, then h(Xn) = h(X).

Take X = a: Corollary 2.

If Xn = a and h is continuous at a, then h(Xn) = h(a).

From Xn = X it follows directly by the theorem that aXn + b = aX + b. Suppose also that a n � a and bn � b. Then (a n - a)Xn � 0 by Example 25.7, and so (a nXn + bn ) - (aXn + b) = 0. And now a nXn + bn = aX + b follows by Theorem 25.4: If Xn = X, a n � a, and bn � b, then anXn + bn = aX + b. This fact was stated and proved differently in Section 14 • -see Lemma 1 on p. 193. Example 25.8.

By definition, 11- n = J.L means that the corresponding distribution functions converge weakly. The following theorem characterizes weak convergence

!JR 1

tThat Dh lies in is generally obvious in applications. In point of fact, it always holds (even if h i s not measurable): Let A(E, 15) be the set of x for which there exist y and z such that lx - y l < 15, lx - zi < 15, a r. d ih( y ) - h( z)i > E. Then A(E, 15) is open and Dh = u . n aA(E, 15 ), where E and 15 range over the positive rationals.

SECTION 25.

WeAK

CONVERGENCE

335

without reference to distribution functions. The boundary aA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, aA is the closure of A minus its interior. A �et A is a wcontinuity set if it is a Borel set and J.L(aA) = 0. Theorem 25.8.

The following three conditions are equivalent.

{i) 1-Ln = /-L; {ii) ffdi-Ln � ffdJ.L for every bounded, continuous real function f; (iii) 1-L n( A ) � J.L( A) for every wcontinuity set A.

PROOF.

Suppose that 1-Ln 1-L· and consider the random variables Yn and Y of Theorem 25.6. Suppose that f is a bounded function such that J.L( D ) = 0, where D1 is the set of points of discontinuity of F. From P[ Y E D1 = J.L(D1) = 0 it follows that fCY,,) � f(Y) with probabiiity 1, and so by change of variable (see (21. 1)) and the bounded convergence theorem, ffdi-Ln = E[f(Yn)] � E[ f(Y)] = ffdJ.L. Thus 1-Ln = J.L and J.L(D1) = 0 together imply that ffdi-L n � JfdJ.L if f is bounded. In particular, (i) implies (ii). Further, if f = /A , then D1 = a A, and from J.L(aA) = 0 and 1-Ln = 1-L follows 1-Ln(A) -= ffdi-L n � Jfdi-L = J.L(A). Thus (i) also implies {iii). Since a( - oo, x ] = {x}, obviously (iii) implies {i). It therefore remains only to deduce 1-Ln = 1-L from {ii). Consider the corresponding distribution func tions. Suppose that x < y, and let f(t) be 1 for t < x, 0 for t > y, and interpolate linearly on [x, y ]: f(t) = { y - t)j(y - x) for x < t < y. Since Fn(x) < Jfdi-Ln and Jfdi-L < F(y), it follows from (ii) that lim supn Fn(x) < F(y); letting y ! x shows that lim supn Fn(x) < F(x). Similarly, F(u) < lim infn Fn(x) for u < x and hence F(x - ) < lim infn Fn(x). This implies • convergence at continuity points.

f

=

The function f in this last part of the proof is uniformly continuous. Hence 1-Ln = 1-L follows if ffdi-L n � ffdJ.L for every bounded and uniformly continuous f. The distributions in Example 25.3 satisfy 1-Ln = J.L, but 1-Ln( A ) does not converge to J.L( A) if A is the set of rationals. Hence this A cannot be a wcontinuity set; in fact, of course, aA = R 1 • • Example 25.9.

The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when J.L(aA) > 0. Since F(x) - F(x - ) = J.L{x} = J.L(a( - oo, x ]), it is therefore natural in the original definition to allow (25.1) to fail when x is not a continuity point of F.


336 Melly's Theorem

One of the most frequently used results in analysis is the Helly selection

theorem:

For every sequence {F11} of distribution functions there exists a subsequence {Fn ) and a nondecreasing, right-continuous function F such that lim k F11 k( x) = F( x) at continuity points x of F. Theorem 25.9.

PROOF.

An

application of the diagonal method [Al4] gives a sequence {n k} of ir.tegers along which the limit G(r) = limk F k(r) exists for every rational r. Define F(x) = inf[ G(r ): x < r ]. Clearly F is non decreasing. To each x and there is an r fo r which x < r and G(r) < F(x) + E. If x < y < r, then F(y) < G(r) < F(x) + E. Hence F is continuous from the right. If F is continuous at x, choose y < x so that F(x) - E < F( y ); now choose rational r and s so that y < r <x < s and G(s) < F(x) + E. From F(x) - E < G(r) < G(s) < F(x) + E and F11(r) < F11( X ) < F11(s) it follows that as k goes to infinity F11 k(x) has limits superior and inferior within E of F(x ) . • 11

E

The F in this theorem necessarily satisfies 0 < F(x) < 1. But F need not be a distribution function: if F11 has a unit jump at n, for example, F(x) 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures 11- n on (R 1 , .9i' 1 ) is said to be tight if for each E there exists a finite interval (a, b] such that JL11(a, b] > 1 - E for all n. In terms of the corresponding distribution functions F11, the condition is that for each E there exist x and y such that F11(x) < E and F11(y) > 1 - E for all n. If 11- n is a unit mass at n, {Jl-11} is not tight in this sense-the mass of P-n "escapes to infinity." Tightness is a condition preventing this escape of mass. =

Tightness is a necessary and sufficient condition that for every subsequence {Jl-11 k } there exist a further subsequence {Jl-11 k(})} and a probability measure 11- such that 11- n kuJ = 11- as j � oo. Theorem 25.10.

Only the sufficiency of the condition in this theorem is used in what follows. theorem to the subsequence {Fn) of corresponding distribution functions. There exists a further subsequence {Fn k(})} such that lim ,. Fnk(}J(x) = F(x) at continuity points of F, where F is nondecreasing and right-continuous. There exists by Theorem 12.4 a measure 11- on (R 1 , .9i' 1 ) such that JL(a, b] = F(b) - F(a). Given E, choose a and b s o that JL11( a, b] > 1 - E for all n, which is possible by tightness. By decreasing a

PROOF. Sufficiency. Apply Helly's


337

and increasing b, one can ensure that they are continuity points of F. But then JL(a, b] > 1 - E. Therefore, 11- is a probability measure, and of course

= 11- · Necessity. If {JLn} is not tight, there exists a positive E such that for each finite interval (a, b ], 11- n(a, b] < 1 - E for some n. Choose n k so that 11-n ( - k, k] < 1 - €. Suppose that some subsequence {JLn } of {JL n } were to converge weakly to some probability measure 11-· Choose (a, b] so that JL{a} = JL{b} = 0 and JL(a, b] > 1 - E . For large enough j, (a, b] c ( - k(j), k{j)], and so 1 - € > 11-n ( - k{j), k{ j )] > 11- n (a, b] � JL(a, b ]. Thus • JL(a, b] < 1 - E, a contradiction. 11-

n k(} )

k

k(} )

k(J)

k

k(J)

is a tight sequence of probability measures, and !f each subsequence th at converges weakly at all converges weakly to the probability measure JL, then 11- n = JL. Corollary.

If {JL n}

PROOF.

By the theorem, each subsequence {JLn) contains a further subsequence {JL n . } converging weakly (j � oo) to some limit, and that limit must by hypothesis be 11- · Thus every subsequence {JL n ) contains a further subsequence {JLn } converging weakly to 11-· Suppose that �'� = 11- is false. Then there exists some x such that JL{x} = 0 but 11- n( - oo, x] does not converge to JL( - oo, x ]. But then there exists a positive E such that I 11- n ( - oo, x] - JL(- oo, x 11 > E for an infinite sequence {n k } of integers, and no subsequence of {JL n ) can converge weakly to 11- · This • contradiction shows that 11- n = 11- · k(J)

k

lf 11-n is a unit mass at xn, then {JLn} is tight if and only if {x n} is bounded. The theorem above and its corollary reduce in this case to standard facts about real, line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Let 11-n be the normal distribution with mean mn and variance a-n2 • If mn and a-n2 are bounded, then the second moment of 11-n is bounded, and it follows by Markov's inequality (21. 12) that {JLn} is tight. The conclusion of Theorem 25. 10 can also be checked directly: If {n k U >} is chosen so that lim ,. m n = m and lim,. a-n2 . = a- 2 , then 11-n = JL, where 11- is normal with mean m and variance a- 2 (a unit mass at m if a- 2 = 0). If mn > b, then Jl n(b, oo ) > 4; if m n < a, then Jl n( - oo, a] > 4. Hence {JLn} cannot be tight if m n is unbounded. If m n is bounded, say by K, then 1 is the standard normal distribu oo, (a - K)a-n- ], where 11- n( - oo, a] > 1 oo, (a - K)a-n- 1 � t along some subse tion. If a-n is unbounded, then quence, and {JL n} cannot be tight. Thus a sequence of normal distributions is tight if and only if the means and variances are bounded. • Example 25.10.

k(J)

v(-

k(})

v(-

k(J)

v


338 Integration to the Limit Theorem 25.11.

If Xn = X, then E[ I XIl < lim infn E[ IXn l].

PROOF.

Apply Skorohod's Theorem 25.6 to the distributions of xn and X: There exist on a common probabili ty space random variables Yn and Y such that Y = lim n Yn with probability 1, Yn has the distribution of Xn , and Y has the distribution of X. By Fatou's lemma, E[ I Y il < lim infn E[ I Y,. Il. Since l X I and I Y I have the same distribution, they have the same expected value • (see (21.6)), and similarly for I Xn l and I Y,.l. The random variables Xn are said to be uniformly integrable if

(25.10)

lim sup

a -+ oo

n

1

1 Xn1 dP

[I Xn l � a]

=

0;

see (16.21). This implies (see (l6.22)) that

(25 . 1 1 ) Theorem 25.12.

integrable and

n

If Xn = X and the Xn are uniformly integrable, then X is

(25.12 )

PROOF.

Construct random variables Yn and Y as in the preceding proof. Since Y,. � Y with probability 1 and the Yn are uniformly integrable in the • sense of (16.21), E[ Xn l = E[ Y,. ] � E[ Y ] = E[X ] by Theorem 16.14. If supn E[IXi + E ] < cc for some positive integrable because

E,

then the Xn are uniformly

(25.13) Since Xn = X implies that x; = xr by Theorem 25.7, there is the following consequence of the theorem.

Let r be a positive integer. If Xn = X and sup n E[I XX+El < oo, where E > 0, then E[ l x n < oo and E[ X;l � E[ Xr]. Corollary.

'

The Xn are also uniformly integrable if there is an integrable random variable Z such that P[IXn l > t ] < P[ IZI > t ] for t > 0, because then (21.10)


339

.

gives

From this the dominated convergence theorem follows again. PROBLEMS

Show by example that distribution functions having densities can converge weakly even if the densities do not converge: Hint: Consider f/x) = 1 + cos 211'nx on [0, 1]. (b) Let fn be 2" times the indicator of the set of x in the unit interval for which d,. + 1(x) = · · · = d2n(x) = 0, where dk (x) is the k th dyadic digit. Show that f,.(x) --+ 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine ,f,.( x) = 0 for all n, so that fn(x) --+ 0 everywhere. Show that the distributions corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which ttn(A ) --+ p,(A ) fails but in which all the measures come from continuous densities on [0, 1].

25.1. (a)

25.2.

14.8 i Give a simple proof of the Gilvenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous.

25.3.

Initial digits.

x is d (in the scale of 10) if and only if {log 10 x} lies between log 1 0 d and log 1 0(d + 1), d 1, . . . , 9, where the braces denote fractional part. (b) For positive numbers x 1, x2, . . . , let Nr(d) be the number among the first n that have initial digit d. Show that (a) Show that the first signifi c ant digit of a po!'itive number

=

lim n _!n Nn ( d) = log 10 ( d + 1) - log 10 d,

( 25 . 14)

d = 1, . . . , 9,

if the sequence log 1 0 X11, n = 1, 2, . . . , is uniformly distributed modulo 1. This is true, for example, of xn {}" if log10 {} is irrational. (c) Let Dn be the first signifi c ant digit of a positive random variable X11 • Show that =

( 25. 15)

lim n P [ D" = d] = log 10( d + 1 ) - log 10 d,

if {log 10 Xn} 25.4.

=

d = 1,

. . .

, 9,

U, where U is uniformly distributed over the unit interval.

Show that for each probability measure p, on the line there exist probability measures 1-tn with finite support such that p, 11 p,. Show further that ttn{x} can =


340

be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every p., is the weak limit of some sequence from the set. The space of distribution fu nctions is thus separable in the Levy metric (see Problem 14.5). 25.5. Show that

(25.5) implies that

P([ X < X ] ll [ xn

< X ]) --+ 0 if

P[ X = X 1 =

0.

25.6. For arbitrary random variables Xn there exist positive consta nts a n such that a n xn = 0.

25.8 by showing for three-dimensional random vectors ( A n, Bn , Xn) and constants a and b, a > 0, that, if A n = a , Bn = b, and Xn = X, then A n Xn + Bn = aX + b Hint: First show that if Y,, = Y and Dn 0, then DnYn = 0.

25.7. Generalize Example =>

hn and h are Borel fu nctions. Let E be the set of x for which hnxn -+ hx fails for some sequence xn --+ x. Suppose that E � 1 and P( X £] = 0. Show that h 1 Xn hX.

25.8. Suppose that Xn = X and that

E

E

1

=>

25.9. Su ppose that the distributions of random variables Xn and X have densities fn and f. Show that if fn(x) --+ f(x) for x outside a set of Lebesgue measure 0,

then xn => X.

Suppose that Xn assumes as values Yn + kon , k = 0, + 1, . . . , where on > 0. Suppose that on --+ 0 and that, if k,. is a n integer varying with n in such a way that 'Yn + knon --+ x, then P[ Xn = 'Yn + k nonlo; 1 --+ f(x), where f is the density of a random variable X. Show that Xn = X.

25.10. i

Let sn have the binomial distributioil with parameters n and p. Assume as known that

25.11. i

(25 .16) if ( k n - np)(np(1 -p )) - 1 12 --+ x. Deduce the DeMoivre-Laplace theorem: (Sn - np )(np(l -p )) - 1 12 N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. =>

25.12. Prove weak convergence in Example

theory of the Riemann integral.

25.3 by using Theorem 25.8 and the

25.13. (a) Show that probability measures satisfy /1- n

=>

p., if p.,n(a, b ] --+ p.,( a , b] when

ever p.,{a} = p.,{b} = 0. (b) Show that, if ffdp.,n --+ ffdp., fo r all continuous f with bou nded support, then /1- n = p.,.


341

Let p., be Lebesgue measure confined to the unit interval; let l.tn corre spond to a mass of xn , i - xn,i - l at some point in (xn , i - l , xn), where 0 =xn o <xn 1 < · · · <xnn = 1. Show by considering the distribution functions that l.tn = p., if max; < n(.xn,i - x,,; _ 1 ) -+ 0. Deduce that a bounded Borel fu nction continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17. 1.

25. 14. i

A function f of positive integers has distribution function F if F is the weak limit of the distribution function Pn[m: f(m) < x ] of f under the measure having probability 1/n at each of 1, (see 2.34)). In this case D[m: f(m) <x] = F( x ) (see (2.35)) for continuity points x of F. Show that

25.15. 2. 18 5. 19 i

. . . , n

q; (m) j m

(see (2.37)) has a distribution: (a) Show by t he mapping theorem that it suffices to prove that f(m) = log(q;(m)jm) = [P o/m) log(1 - 1 jp ) has a distribution. (b} Let fu(m) = Lp ,; u o (m)log(1 - 1jp), and shew by (5.45) that fu has distribution function F/x) = P[I:P , u Xp Iog(1 - 1/p) < x ], where the XP are independent random variables (one for each prime p) such that P[XP = 1] = 1 /p and P[ XP = 0] = 1 - 1/p. (c) Show that I: P XP Iog(1 - 1jp) converges with probabi!it} 1. Hint: Use Theorem 22.6. (d) Show that lim u supn En[ If- fu ll = 0 (see (5.46) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that f has the distribution of the sum in (c). _ ""

E �1

and T > 0, put A T( A ) = A([ - T, T] n A )j2T, where A is Lebesgue measure. The relative measure of A is

25.16. For A

(25. 17)

p ( A ) = lim A T( A ) , T - oo

provided that this limit exists. This is a continuous analogue of density (see (2.35)) for sets of integers. A Borel fu nction f has a distribution under A T ; if thi!; converges weakly to F, then (25. 18)

p [ x : f( x ) < u] = F ( u )

for continuity points u of F, and F is called the distribution function of f. Show that all periodic fu nctions have distributions. 25. 17. Suppose that supn ffdp., n < oo for a nonnegative f such that

x -+

± oo.

25.18. 23 . 4 i

Show that {p.,n} is tight.

f(x) -+ oo as

1

Show that the random variables A and L, in Problems 23.3 and 23.4 co nverge in distribution. Show that the moments converge.

25.19. In the applications of Theorem 9.2, only a weaker result is actually needed:

For each K there exists a positive a = a(K) such that if E(X] = 0, E[X 2 ] = 1 , and E[ X4] < K , then P[ X > 0] > a. Prove this by using tightness and the corollary to Theorem 25. 12.

Xn for which there is no inte grable Z satisfying P[IXnl > t] < P[l Zl > t] for t > 0.

25.20. Find u niformly integrable random variables


342

SECTION 26. CHARACTERISTIC FUNCTIONS Definition

Th e characteristic function of a probability measure JL on the l ine is defined for real t by

�( t)

=

j'"'00eitxJL ( dx) -

"'

= f COS txJL(dx) - 00

00

+ if 00 �in txJL( dx ) ; -

see the end of Section 16 fer integrals of complex-valued fu ncticns. t A random variable X with distribution J..L has characteristic function

� ( t ) = E[ e i ' X ] =

"'

J- eitxJL(dx) . "'

The characteristic function is thus defined as the moment generat ing function but with the real argument s replaced by it; it has the advantage that it a lways exists because eit x is bounded. The characteristic function in nonprob abilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be estab lished here: {i) If JL 1 and JL 2 have respective characteristic functions � 1 (t ) and �it), then JL 1 * JL 2 has characteristic function � 1(t)�it). Although convolution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding cha 1 acteristic functions. {ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in {i), no information is lost. {iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives

It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. t From complex variable theory only De Moivre's formula and the simplest properties of the exponential function are needed here.

SECTION 26. CHARACTERISTIC FUNCTIONS

343

Of course, cp{O) = 1, and by (16.30), lcp(t )I < 1 for all t. By Theorem 1 6 .80 ), cp(t ) is continuous in t. In fact, lcp(t + h) - cp(t )I < Jle ih x - 1 I J.L{dx), and so it follows by the bounded convergence theorem that cp(t) is umformly continu

ous.

In the following relations, versions of Taylor's formula with remainder, x is assumed real. Integration by parts shows that

( 26. 1)

X

1o ( x -

n s ) e ;5 ds =

l j ( - s ) n + l e 1s ds + ' n+1 n+1 o x X

.

n+J

X

and it follows by induction that

k ·n + l n lX ( ) ·x l j e' = L k .1 + n .1 ( X - s ) e'� ds o k =O n

( 26.2 )

.l.

"

·

for n > 0. Replace n by n - 1 in (26. 1), solve for the integral on the right, and substitute this for the integral in (26.2); this gives

( 26. 3) Estimating the integrals in (26.2) and (26.3) (consider separately the cases x > 0 and x < 0) now leads to

{

k n n +l x l ( ix ) l 2lxl . e ix < mm ' n! k ! ! n + 1 ) ( L k O ' I

( 26.4 )

n

}

for n > 0. The first term on the right gives a sharp estimate for lxl small, the second a sharp estimate for lxl large. For n = 0, 1, 2, the inequality special izes to

le ix - 11 < min{ lxl, 2} , l e'x - ( 1 + ix) l < min{Ix 2 , 2 lxl} ,

(26.40 ) ( 26.41 )

l e i x - (1 + ix - Ix 2 ) l < min{-i;lxl 3 , x 2 } .

( 26.4 2 )

If X has a moment of order n , it follows that ( 26.5 )

[ { ltX I n + I

k n it tX 2l . I ;, ( k ).I E[X k ] -< £ mm cp ( t ) L. I I ' . n + . n 1 ) ( k=O _

}]

.


344

For any

t

satisfying

(26.6) cp(t) must therefore have the expansion (26.7) compare

(21 .22).

If

[, 1 � 1 ; E( IX Ik) k =O

=

E[ e l r Xi ] < oo,

then (see (16.3 1)) (26.7) must hold. Thus generating function over the whole line. Since

Example 26.1.

tion, by

(26. 7) and (21. 7)

( 26.8) cp( t )

"'

=

L k =0

(26.7)

holds if

X

has

a

moment

E[ei' XI ] < oo if X

has the standard normal distribu its characteristic function is

( lt ) 2 k 1 X 3 X · · · X (2k - 1) ( 2k ) !

"'

•

This and (21.25) formally coincide if

s

=

11 -

L k =O k .

( - !.._2 _ ) k 2

=

e - 1 2 /2 •

= it.

If the power-series expansion (26. 7) holds, the moments of off from it:

X can be read

(26.9) This is the analogue of (21.23). It holds, however, under the weakest possible assumption, namely that E[ I X k l] < oo. Indeed,

[

]

h X _ l - ihX i cp( t + h) - cp ( t) e v i · r r . - E[ 1.11. e X ] - E e i X h h By (26.4 1 ), the integrand on the right is dominated by 2 I X I and goes to 0 with h; hence the expected value goes to 0 by the dominated convergence theore m. Thus cp'(t) = E[iXelr x]. Repeating this argument inductively gives

(26 .10)


345

if E[IX k l] < oo. Hence (26.9) holds if E[ IX k l] < oo. The proof of uniform k continuity for cp(t) works for cp< >(t) as well. If £[X 2 1 is finite, then

t -7 0.

(26 . 1 1 )

Indeed, by (26 .4 2 ), the error is at most t 2 £[ min{lti iX I 3, X 2 }], and as t -7 0 the integrand goes to 0 and is dominated by X 2 . Estimates of this kind are essential for proving limit theorems. The more moments JL has, the more derivatives cp has. This is one sense in which lightness of the tails of JL is reflected by smoothness of q>. There are results which connect the behavior of cp(t) as It 1 -� oo with smoothness properties of JL. The Riemann -Lebesgue theorem is the most important of these: Theorem 26.1.

If JL has a density, then cp(t) -7 0 as I t 1 -7

oo .

PROOF.

The problem is to prove for integrable f that ff(x )eirx dx -7 0 as ltl -7 oo. There exists by Theorem 17.1 a step function g = [k ak iA ,• a finite linear combination cf indicators of intervals Ak = (a k , bk ], for which J!J gl dx < € . Now Jf(x)e ir x dx differs by at most € from Jg(x)ei rx dx • [k ak(eirbk - e ir ak )jit, and this goes to 0 as ltl -7 oo. =

Independence

The multiplicative property (21 .28) of moment generating functions extends to characteristic functions. Suppose that X 1 and X2 are ind ependent random variables with characteristic functions cp 1 and cp 2 . If }j = cos Xj and 21 sin tXj, then (Y1 , Z 1 ) and (Y2 , 2 2 ) are independent; by the rules for integrat ing complex-valued functions, =

•

cp 1 ( t)cp 2 ( t) = ( E [ Y1 ] + i£[2 1 ) )( E[ Y2 ) + i£[ 22 )) = E[ Y1 ] E[ Y2 ] - £[ 2 1 ) £[ 2 2 ) + i ( E[ YJ l £[ 2 2 ) + £[2 J l E[ Y2 ]) i t ( X I + X2 > ] = E [ Y, Y - 2 1 2 + i( yl 2 + 2 1 Y2 )] = E [ e . 2 2 2

This extends to sums ·of three or more: If

(26.12)

E [ e i t 'f.Z - I Xk l

=

n

X1 , • • • , Xn

n E[ e irXk ] .

k=l

are independent, then


346

If X has characteristic function function

cp(t ),

then

aX + b

has characteristic

(26.13) cp( - t ), which

In particular, -X has characteristic function conjugate of cp(t ).

is the complex

Inversion and the Uniqueness Theorem

A characteristic function cp uniquely determines the measure JL it comes from. This fundamental fact will be derived by means of an inversion formula through which JL can in principle be recovered from cp. Define

T > 0. In Example

18.4 it is shown that 7T ; T) S = ( 2 T-+ oo

(26.14)

lim

is therefore bounded. If sgn 0 is + 1, negative, then

S(T)

(26.15)

1

T

0,

or

-1

as () is positive,

0,

or

sin tO

0

t dt = sgn o · S ( T IOI) ,

If the probability measure JL has characteristic function cp, and if J1 {a} = JL{b} = 0, then - ir b T e- ir a 1 . e hm - J JL ( a , b] = T(26.16) cp( t) dt. l't + oo 2 7T - T Theorem 26.2.

_

Distinct measures cannot have the same characteristic function. integrand here converges as t --) 0 to b - a, which is to be taken as its value for t = 0. For fixed a and b the integrand is thus continuous in t, and by (26.40) it is bounded. If JL is a unit mass at 0, then cp(t) 1 and the integral in (26. 16) ca nnot be extended over the whole line.

Note: By (26.4 1 ) the =

PROOF.

The inversion formula will imply uniqueness: It will imply that if JL and v have the same characteristic function, then JL(a, b] = v(a, b] if JL{a} v{a} = JL{b} = v{b} = 0; but such intervals (a, b] form a 1r-system gen erating g 1 • =

SECTION 26. CHARACTERISTIC FUNCTIONS Denote by

347

IT the quantity inside the limit in (26.16). By Fubini's theorem

( 26.17 ) This interchange is legitimate because the double integral extends over a set of finite product measure and by (26.40) the int_egrand is bounded by lb - ai. Rewrite the integrand by DeMoivre's formula. Since sin s and cos s are odd and even, respectively, (26.15) gives

The integrand here is bounded and converges as 1

2

!/la, b ( x ) =

-7 oo to the function

if x < a, if x = a , if a < x < b, if X = b, if b < x.

0 (26.18)

T

1 I

2

0

Thu� / -7 f i/Ja, b dfL, which implies that (26.16) holds if JL{a} = JL{b} = 0. T The inversion formula contains further information. Suppose that

•

"'

f 1 �( t ) 1 dt < oo.

( 26.19)

- 00

In this case the integral in

(26.16) can be extended over R 1 • By (26.40),

e - ir b - e - i ra

it

i e r( b - a ) - 11 < lb - a l; ,= ! tl i

therefore, JL(a, b) < (b - a)f':"'l�(t)l dt, and there can be no point masses. By (26.16), the corresponding distribut ion function satisfies F( x + h ) - F( x ) h

1 "' = 27T !_

"'

i e - rx _ e - ir (x +h )

ith

�( t) dt

(whether h is positive or negative). The integrand is by (26.40) dominated by l�{t)l and goes to eir x�(t) as h -7 0. Therefore, F has derivative

(26.20 )


348

Since f is continuous for the same reason cp is, it integrates to F by the fundamental theorem of the calcuius (see (17.6)). Thus (26.1 9) implies that JL has the continuous density (26.20). Moreover, this is the only continuous density. In this result, as in the Riemann-Lebesgue theorem, conditions on the size of cp(t) for large I t I are connected with smoothness properties of JL . The inversion formula (26.20) has many applications. In the first place, it can be used for a new derivation of (26.14). As pointed out in Example 17.3, the. existence of the limit in (26. 14) is easy to prove. Denote this limit temporarily by 7r0/2-without assuming that 7r o = 7r. Then (26.16) and (26.20) follow as before if 7r is replaced by 7r0. Applying the latter to the standard normal density (see (26.8)) gives

( 26.21)

where the 7r on the left is that of analysis and geometry--it comes ultimately from the quadrature (18. 10). An application of (26.8) with x and t inter changed reduces the right side of (26.21 ) to (J27r /2 7r0)e - x 2 1 2 , and therefore 7r 0 does equal 7r. Consider the densities in the table. The characteristic function for the normal distribution has already been calculated. For the uniform distribution over (0, 1 ), the computation is of course straightforward; note that in this case the density cannot be recovered from ( 26.20), because cp(t) is not integrable; this is reflected in the fact that the density has discontinuities at 0 and 1.

Distribution

Density

Interval

1. Normal

1 - x 2 /2 Vl rr

- oo < x < oo

Uniform

1

0 <x < 1

3. Exponential 4. Double

e

2.

exponential or Laplace

5. Cauchy 6. Triangular 7.

e

-x

-!-

e

- lxl

1 1 rr 1 + x 2 1 - lxl 1 1 - COS X x2 7r

0 < x < oo - oo < x < oo

Characteristic Fu nction e

e

il

-

it 1

1

1 - it 1

1 + t2

- oo < x < oo

e

- 1 <x < 1

2

- oo < x < oo

_ ,2 /2

-It!

1-

cos t

t2 ( 1 - I t l) /( - I, I )(t)


349

The characteristic function for the exponential distribution is easily calcu lated; compare Example 2 .3 As for the double exponential or Laplace distribution, e - l x l e irx integrates over (0, oo) to - it )- 1 and over ( oo, 0) to + it )- 1 , which gives the result. By (26.20), then,

1 .

(1

e -lxl

(1

=

1 j "' e

7T

-

- oo

- ir x

-

dt . 1 + t2

= 0 this gives the standard integral f':"' dtj{l + t 2 ) = 1r; see Example Thus the Cauchy density in the table integrates to and has character istic function e 1 1 1 This distribution has no first moment, and the characteris tic function is not differentiable at the origin. A straightforward integration shows that the triangular density has the characteristic function given in the table, and by (26.20),

For

17.5.

1

x

-

.

( 1 - I X I) 1( - 1, 1)\ 1

) X =

1 ! "' e 1 - COS t dt.

7T

- oo

- i rx

t2

"'( 1

For x = 0 this is J': - cos t )t - 2 dt = 1r ; hence the last line of the table. Each density and characteristic function in the table can be transformed by (26.13), which gives a family of distributions.

The Continuity Theorem

Because of (26.1 2), the characteristic function provides a powerful means of studying the distributions of sums of independent random variables. It is often easier to work with products of characteristic functions than with convolutions, and knowing the characteristic function of the sum is by Theorem 26.2 in principle the same thing as knowing the distribution itself. Because of the following continuity theorem, characteristic functions can be used to study limit distributions. Theorem

26.3. Let JLn, JL be probability measures with characteristic func

tions 'Pn ' cp. A necessary and sufficient condition for JLn = JL is that cpn(t) -7 cp(t ) for each t.

PROOF. Necessity. For each t,

e i rx

has bounded modulus and is continu ous in x. The necessity therefore follows by an application of Theorem 25.8 (to the real and imaginary parts of e irx ).


350

Sufficiency. By Fubini's theorem, ( 26 .22)

>

2j

lx l � 2/u

1 ) dx (1 - 1 1 /-Ln( ) UX

(Note that the first integral is real.) Since cp is continuous at the origin and cp(O) = 1, there is for positive E a u for which u - 1J�J1 - cp(t)) dt < E . Since 'Pn converges to cp, the bounded convergence theorem implies that there exists an n0 such that u - lf�u(l cpn (t )) dt < 2 € for n > n0. If a = 2/u in (26.22), then JLn [x: lxl > a] < 2 € for n > n0. Increasing a if necessary will ensure that this inequality also holds for the finitely many n preceding n0. Therefore, {JLn } is tight. By the corollary to Theorem 25 . 1 0, 1-L n = JL will follow if it is shown that each subsequence {JL n) that converges weakly at all converges weakly to J.L . But if JL n = v as k -7 oo, then by the necessity half of the theorem, already k proved, v has characteristic function Iim k 'Pn k(t) = cp(t ). By Theorem 26.2, v and fL must coincide. II

-

Two corollaries, interesting in themselves, will make clearer the structure of the proof of sufficiency given above. In each, let J.Ln be probability measures on the line with characteristic functions 'Pn · l.

Suppose that limn cpn (t) = g(t) for each t, where the limit function g is continuous at 0. Then there exists a JL such that 1-L n = JL, and J.L has characteristic function g. Corollary

PRoOF.

The point of the corollary is that g is not assumed at the outset to be a characteristic function. But in the argument following (26.22), only cp{O) = 1 and the continuity of cp at 0 were used; hence {JLn} is tight under the present hypothesis. If 1-L nk => v as k oo, then v must have characteristic function limk 'Pnk(t) = g(t ) Thus g is, in fact, a characteristic function, and the proof goes through as before. • .

--+


351

In this proof the continuity of g wa s used to establish tightness. Hence if {J.Ln} is assumed tight in the first place, the hypothesis of continuity can be suppressed:

Suppose that lim n cp,,(t) = g(() exists for each t and that {J.L n} is tight. Then there exists a J.L such that J.L n = J.L, and J.L has characteristic function g. Corollary 2.

This second corollary applies, for example, if th e 1-L n have a common bounded support. If 1-L n is the uniform distribution over ( n , n), its charac teristic function is (nt) - 1 sin tn for t =I= 0, and hence it converges to Ic0J( t ). In this case {J.L n} is not tight, the limit fu nction is not continuous at 0, and 1-Ln does not converge weakly. • Example 26.2.

-

Fourier Series •

Let p., be a probability measure on coefficients are defined by

(26.23)

.91 1 that is supported by [0, 2'7T]. Its Fom ier

[2rr .m em = lo e• xp., ( dx ) , 0

m

= 0, + 1, + 2, . . . .

Th ese coefficients, the values of the characteristic functio n for integer arguments, suffice to determine p., except for the weights it may put at 0 and 2'7T. The relatio n between p., and its Fourier coefficients can be expressed formally by

1 p., ( dx ) - 2 '7T L c,e -•lx dx : I= 00

(26.24 )

•

- oo

if the p., ( dx } in (26.23} is replaced by the right side of (26.24}, and if the sum over I is interchanged with the integral, the result is a formal identity. To see how to recover p., from its Fourier coefficients, consider the symmetric partial sums sm(t) = (2'7T )- 1 'L'i'= mc1 e -u' and their Cesaro averages um(t) = - identity [A24] m 'L'i'=o 1s1(t ). From the trigonometric

-I

(26.25) it follows that

(26.26) * This topic may be omitted

m-

L

I

I 2 2 L e ikx = sin 2lmx sin � x 1 = 0 k= - 1

352

I


If JL is (2 '7T) - I times Lebesgue measure confined to [0, 2'7T ], then c 0 = 1 and c m = 0 for m -=fo 0, so that um(t) = sm(t) = (2 '7T) - ; this gives the identity

1 rr sin 2 �ms 2'7Tm J- rr sin 2 h ds = 1 . --==-=---

( 26.27)

Suppose that 0 < a < b < 2'7T, and integrate (26.26) over (a, b). Fubini's theorem (the integrand is nonnegative) and a change of variable lead to

(26.28)

b m( t ) dt = j2 rr [ 1 b-x x sin22 tms ] ds 1 fa 2 '7T m Ja-x Sin 2s 0 u

.

JL

( dx )

.

The denominator in (26.27) is bounded away from 0 outside ( o, o), and so as to oo with o fixed (0 < o < 1T ), -

m

goes

I 1 Sin 2 2ms I ds --+ 0 ' 2'7Tm J.li , and suppose that limn c�> = c m for all m. Since {JLn} is tight, JLn = JL will hold if JL n = v (k --+ oo) implies v = JL · But in this case v and JL have the same coefficients ckm ' and hence they are identical except perhaps in the way they split the mass v{O, 2'7T} = JL{O, 2'7T} between the points 0 and 2'7T. But this poses no problem if JL{O, 2'7T} = 0: If lim n c�> = e m for all m and JL{O} = JL{2'7T} = 0, then JL n = JL ·

Example 26.3. If JL is (2'7T) - I times Lebesgue measure confined to the interval [0, 2'7T], the condition is that Iimn dr7> = 0 for m -=fo 0. Let x 1 , x 2 , be a sequence of 1 reals, and let JLn put mass n at each point 2'7T{x k }, 1 < k < n, where { x k } = x k - [ x k J denotes fractional part. This is the probability measure (25.3) rescaled to [0, 2'7T]. The sequence x 1 , x 2 , is uniformly distributed modulo 1 if and only if •

•

for m

-=fo

•

•

0. This is Weyl' s criterion.

•

•


353

If x k = k8, where 8 is irrational, then exp(2 '7T i8 m ) -=fo 1 for m -=F 0 and hence n 1 L: 27Tik8m -1 2-rr i8m 1 - e 2-rr in8m = ne -+ o. nk e 1 e 2-rr18m _

=l

_

.

Thus 8, 28, 38, . . . is uniformly distributed modulo 1 if 8 is irrational, which gives • another proof of Theorem 25.1. PROBLEMS

lattice distribution if for some a and b, b > 0, the lattice [a + nb: n = 0, ± 1, . . . ] supports the distribution of X. Let X have

26.1. A random variable has a

characteristic function 'P· (a) Show that a necessary condition for X to have a lattice distribution is that l�p(t )I = 1 for some t -=F 0. (b) Show that the condition is sufficient as well. (c} Suppose that l�p(t)l = l�p(t')l = 1 for incommensurable t and t' (t -=F 0, t' -=F 0, t jt' irrational). Show that P[ X = c] = 1 for some constant c.

p.( - oo, x ] = p.[ -x, oo) for all x (which implies that p.(A) = p.(-A) for all A � 1 ), then p. is symmetric. Show that this holds if and only if the

26.2. If

E

characteristic function is real.

26.3. Consider functions 1p that are real and nonnegative and satisfy

�p(t ) and �p(O) = 1.

�p( -t) =

are positive and Lk. 1 dk = oo, that s 1 > s2 > · · · > 0 and lim k sk = 0, and that Lk = l sk d k = 1. Let 1p be the convex polygon whose successive sides have slopes -s 1 , - s2 , . . . and lengths d 1 , d 2 , when pro jected on the horizontal axis: 1p has value 1 - I:j= 1 s dj at t k = d 1 + · · · +dk . If sn = 0, there are in effect only n sides. Let �p0(t ) = (1 - It 1)1< 1 , 1 >(t) be the characteristic function in the last line in the table on p. 348, and show that �p(t ) is a convex combination of the characteristic functions �p0(t jt k ) and hence is itself a characteristic function. (b) P6lya's criterion. Show that 1p is a characteristic function if it is even and continuous and, on [O, oo), nonincreasing and convex t�p(O) = 1). (a) Suppose that

d 1 , d2 ,

•

•

•

=

•

_

•

•


354 26.4.

i Let 'P I and �p2 be characteristic functions, and show that the set A = [t: �p1(t) = �p2(t)] is closed, contains 0, and is symmetric about 0. Show that every set with these three properties can be such an A. What does this say about the uniqueness theorem?

26.5. Show by Theorem

26.1 and integration by1 parts that if JL has a density f with integrable derivative f', then �p(t) = o(t - ) as It I -+ oo. Extend to higher deriva tives.

26.6. Show for independent random variables uniformly distributed over ( - 1, n

that X1 +

26. 7.

···

+ Xn has density 71"- 1/Q((sin t)jt ) cos tx dt for n > 2.

+ 1)

21.17 i Uniqueness theorem for moment generating functions. Suppose that F has a moment generating function in ( -s0, s0), s0 > 0. From the fact that f":.. oo ezx dF( x ) is analytic in the strip -s0 < Re < s0, prove that the moment z

generating function determines F. Show that it is enougn that tbe moment generating function exist in [0, s0), s0 > 0. 26.8.

21.20 26.7 i Show that the gam111a density (20.47) has characteristic func tion

(1

� - t/a ) I

u

(

(

it

= exp -u log 1 - a

)] ,

where the logarithm is the principal part. Show that f0euf( x ; a, u) d.x 1 � analytic for Re z < a. 26.9. Use characteristic functions for a simple proof that the family of Cauch .

distributions defined by (20.45) is closed under convolution; compare th argument in Problem 20. 14(a). Do the same for the normal distributic (compare Example 20.6) and for the Poisson and gamma distributions.

26. 10. Suppose that Fn -=> F and that the characteristic functions are dominated by

integrable function. Show that of the Fn.

26. 11. Show for all

F

has a density that is the limit of the densit

a and b that the right side of (26.16) is JL(a, b) + �JL {a} + �JL

26.12. By the kind of argument leading to

1

Let x 1 , x2, prove that

(26.31)

J

T

.

e -110�p( t ) dt . JL { a } = lim 2T T-. oo -T

(26.30)

26. 13. i

(26.16), show that

•

•

•

be the points of positive JL-measure. By the following '

2 2 .hm 1 T " 1 I �p ( t ) dt = L..,., ( JL { x d ) . T-.oo 2T T k

J-


355

Let X and Y be independent and have characteristic function cp. (a) Show by (26.30) that the left side of (26.31) is P[ X - Y = 0]. (b) Show (Theorem 20.3) that P( X - Y = 0] = f:_ ooP( X = y]JL(dy) =

[k ( �L {x k )) 2. 26.14. r

Show that 1-t has no point masses if cp 2 (t) is integrable.

{1-tn} is tight, then the characteristic functions 'Pn(t) are uniformly equicontinuous (for each there is a o such that Is - t1 < o implies that lcpn(s) - Cf'n(t)l < for all n). 1-t implies that 'Pn(t) -+ cp(t) uniformly on bounded sets. (b) Show that 1-tn

26.15. (a) Show that if

=

E

E

(c) Show that the convergence in part (b) need not be uniform over the entire

line.

14.5 26.15 i For distribution functions F and G, define d'(F, G) =

26.16.

sup, lcp(t) - r/J(t )lj(l + ltl), where 1p and rfJ are the corresponding characteristic functions. Show that this is a metric and equivalent to the Uvy metric.

26.17.

25.16 i

A real function f has

mean value

J

1 M[ f ( x )] = lim 2 T T f( x ) d.x , -T T -.oo

(26.32)

provided that f is integrable over each [ - T, T] and the limit exists. (a) Show that, if f is bounded and ei•f<x> has a mean value for each t, then f has a distribution in the sense of (25.18). (b) Show that if t = 0, if t * 0.

( 26.33 ) Of course, f(x) = x has no distribution.

1. Let 1-tn be the distribution of the fractional part {nX}. Use the continuity theorem and Theorem 25.1 to 1 show that n - I:Z = 1 1-t k converges weakly to the uniform distribution on [0, 1 ].

21 18. Suppose that X is irrational with probability

26 1. �

25.13 i The uniqueness theorem for characteristic functions can be derived

from the Weierstrass approximation theorem. Fill in the details of the follow ing argument. Let 1-t and v be probability measures on the line. For continuous f with bounded support choose a so that 1-t( -a, a) and v( -a, a) are nearly 1 and f vanishes outside (-a, a). Let g be periodic and agree with f in (-a, a), and by the Weierstrass theorem uniformly approximate g(x) by a trigonomet ric sum p(x) = I:f= l a k eitk x. If 1-t and v have the same characteristic function, then ffd �L fgd�L fp d�L = fpdv fgdv ffdv. ==

==

==

==

26.20. Use the continuity theorem to prove the result in Example

convergence of the binomial distribution to the Poisson.

25.2 concerning the

356

CONVERGENCE OF D ISTRIBUTIONS

26.21. According to Example

25.8, if Xn = X, an --+ a, and bn --+ b, then aX + b. Prove this by means of characteristic functions.

26.22.

a

n Xn + bn =

26. 1 26. 15 i According to Theorem 14.2, if Xn = X and anXn + bn = Y, where an > 0 and the distributions of X and Y are nondegenerate, then an --+ a > 0, bn --+ b, and aX + b and Y have the same distribution. Prove this by characteristic functions. Let 'Pn' tp, r/J be the characteristic functions of Xn , X, Y. (a) Show that l�pn(ant )I -+ lt/J(t)l uniformly on bounded sets and hence that an cannot converge to 0 along a subsequence. (b) Interchange the roles of tp and rjJ and show that an cannot converge to infinity along a subsequence. (c) Show that an converges to some a > 0. (d) Show that eitb,. --+ r/J(t )jtp(at ) in a neighborhood of 0 and hence that Jci e isb ds fci(r/J(s)/tp(as)) ds. Conclude that bn converges. "

____,

26.23. Prove a continuity theorem for moment generating functions as defined by

(22.4) for probability measures on [0, the analogue of (26.22) is 2

( ( 1 - M(s)) ds

u .0

26.24.

oo). For uniqueness, see Theorem 22.2; >

�-t{�-,oo) . u

26.4 i Show by example that the values tp(m) of the characteristic function at integer arguments may not determine the distribution if it is not supported by [0, 2 '1T ].

m

f is integrable over [0, 2 1T ], define its Fourier coefficients as Jl"ei xf(x)dx. Show that these coefficients uniquely determine f up to sets of measure 0.

26.25. If

26.26.

19.8

26.25 i

Show that the trigonometric system (19. 17) is complete.

(26. 19) is I:mlcml < Show that x I:mcme-im x on [0, 21T ], where f is f(21T ). This is the analogue of the inversion formula

26.27. The Fourier-series analogue of the condition it implies 11- has density f( ) (2 1T ) - 1

oo.

=

continuous and f(O) (26.20).

26.28.

i

=

Show that

O < x < 21T .

26.29.

(a) Suppose X' and X'' are independent random variables with values in [0, 21T ], and let X be X' + X" reduced module 21T. Show that the correspond ing Fourier coefficients satisfy em c;.,c�. (b) Show that if one or the other of X' and X" is uniformly distributed, so is X. =

SECTION 27. THE CENTRAL LIMIT THEOREM 26.30.

357

26.25 i The theory of Fourier series can be carried over from [0, 21T] to the

unit circle in the complex plane with normalized circular Lebesgue measure P. The circular functions e im x become the powers wm, and an integrable f is determined to within sets of measure 0 by its Fourier coefficients e = m fnwmf(w)P(dw). Suppose that A is invariant under the rotation through the angle arg c (Example 24.4). Find a relation on the Fourier coefficients of /A , and conclude that the rotation is ergodic if c is not a root of unity. Compare the proof on p. 316.

SECTION 27. THE CENTRAL LIMIT THEOREM Identically Distributed Summands

The central limit theorem says roughly that the sum of many independent random variables will be approximately normally distributed if each sum mand has high probability of being small. Theorem 27. 1 , the Lindeberg-Levy theorem, will give an idea of the techn iques and hypotheses needed for the more general results that follow. Throughout, N will denote a random variable with the standard normal distribution: ( 27 .1 )

Suppose that {Xn } is an independent sequence of random variables having the same distribution with mean c and finite positive variance u 2 • If sn = XI + . . . +Xn , then Theorem 27.1.

(27 .2)

S__:.:_n- nc = N. u fii -=-

By the argument in Example 25.7, (27.2) implies that n - 1 Sn = c. The central limit theorem and the strong law of large numbers thus refine the weak law of large numbers in different directions. Since Theorem 27.1 is a special case of Theorem 27.2, no proof is really necessary. To understand the methods of this section, however, consider the special case in which Xk takes the values + 1 with probability 1/2 each. Each Xk then has characteristic function cp(t) = !e ;1 + !e - it = cos t . By n (26. 12) and (26. 13), Sn/ Vn has characteristic function cp (tj Vn), and so, by n the continuity theorem, the problem is to show that cos t/ Vn � E[e i1 N] e - 1 2 12 , or that n log cos t j Vn (well defined for large n) goes to - �t 2 • But this follows by 1 ' Hopi tal's rule: Let t I rn = X go continuously to 0. =

CONVERGENCE OF DISTRIBUTiONS

358

For a proof closer in spirit to those that follow, note that (26.5) for 3 gives l�(t) - (1 - tt 2 )1 < ltl (IXk l < 1). Therefore,

n=2

(27.3) Rather than take logarithms, use (27.5) below. which gives (n large) ( 27.4)

n

2 But of course (1 - t 2 j2 n) -7 e - 1 12 , which completes the proof for this special case. Logarithms for complex arguments can be avoided by use of the following simple lemma. at

Lemma l. Let most 1; then

lz 1

( 27 .5)

PROOF.

( z 1 - w 1 Xz 2

z1,

•

•

•

•

and w 1 ,

, zm

•

•

-

zm - W 1

•

•

•

•

•

•

, wm

wmi
1, n > 1. Suppose that an -7 0, the idea being that xk and xk +n are then approximately independent for large n. In this case the sequence {XJ is said to be a-mixing . If the distribution of the random vector (Xn , xn + I • ' xn +) does not depend on n, the sequence is said to be stationary. . • .

Example

Let {Y;J be a Markov chain with finite state space and positive transition probabilities pij, and suppose that Xn = f(Y), where f is some real function on the state space. If the initial probabilities P; are the stationary ones (see Theorem 8.9), then clearly {XJ is stationary. Moreover, n n _ by (8.42), I P� ) Pj l < p , where p < l . By (8. 1 1 ), P[ Y1 = ip . . . , Yk = ik , n · · · · , = = j P; k - l ; k P�k,.JO P1·o1 I • P1·/ - 11·/ , which differs �k + n o , · · Yk + n + l = jl] P; 1 P,· 1 ;2 27.6.

•

• This topic may be omitted.

•

364


, Yk + n + l = j1 ] by at most from P[ Y1 = i l > . . . , Yk = i k ] P[ Yk + n = j0, P; I P,· 1 ;2 . . . P;k - 1 ,· � p np,.ol· I . . . P,·/ - 1,·/ . It follows by addition that, if s is the number of states, then for sets of the form A = [( YI > . . . , Yk ) E H ] and B = n [(Yk +n• · · · · Yk +n + l ) E H'], (27. 19) holds with an = sp . These sets (for k and n fixed) form fields generating u-fields which contain u( X 1 , , Xk ) and u( Xk + n • Xk +n + 1, ), respectively. For fixed A the set of B satisfying (27. 19) is a monotone class, and similarly if A and B are interchanged. It follows by the monotone class theorem (Theorem 3.4) that {Xn} is a-mixing with n • an = sp . •

•

•

• • •

• • •

The sequence is m-dependent if ( X 1 , , Xk ) and (Xk + n , . . . , Xk+n + l ) are independent whenever n > m. In this case the sequence is a-mixing with a n = 0 for n > m . In this terminology an independent sequence is 0-depen dent. • • •

Example 27. 7. Let Y1, Y2 , and put X = f(Y,., . . . , }'� + m ) �

stationary and m-dependent.

then

(27 .20)

be independent and identically dis tributed, m 1 for a real function f on R + • Then {XJ is • •

•

Suppose that X 1 , X2 , is stationary and a-mixing wich and that E[XJ = O and E[X! 2 ] < oo. If S n = X 1 + · · · + Xn ,

Theorem 27.4.

an = O( n - 5)

•

• • •

1 n - Var [ SJ -7 CT 2 = E [ xn

"'

+ 2 L E[ X I XI + k ] , k= l

where the series converges absolutely. If u > 0, then Sn /uVn N. =

The conditions an = O(n - 5) and E[ X! 2 1 < oo are stronger tha n nece�sary; they are imposed to avoid technical complications in the proof. The idea of the proof, which goes back to Markov, is this: Split the sum X I + . . . + xn into alternate blocks of length bn (the big blocks) and In (the little blocks). Namely, let

where rn is the largest integer i for which (i let

(27.22)

l)(bn + I) + bn < n.

V,.; = X(i - IX b , + l, ) + b , + l + . . . +Xi(b , + l, ) • V,. ,, = X(r, - I X b, + l, ) + b, + l + . . . +Xn .

Furth er,

Then sn = L:;� I U,. I + L:;� I V,. , , and the technique will be to choose the In small enough that L;V,.; is small in comparison with L:; Uni but large enough

SECfiON 27. THE CENTRAL LIMIT THEOREM

365

that the Un i are nearly independent, so that Lyapounov's theorem can be adapted to prove L; Un i asymptotically normal.

If Y is measurable CT(X 1 , , Xk) and bounded by C, and if Z is measurable CT(Xk + n • xk + n + I > ) and bounded by D, then Lemma 2.

• • •

• • •

I E[ YZ ] - £[ Y ] £[ Z ] I < 4CDa n .

(27.23)

PROOF.

It is no restriction to take C = D = 1 and (by the usual approxi mation method) to take Y L;YJA , and Z = [jzJ8 simple ( IY;I, l zjl < 1). If i d1 j = P(A; n B) - P(A)P(B), the left side of (27.23) is IL.; jYi zjd iil. Take {; to be + 1 or - 1 as f.jzjd ;j is positive or not; now take TJj to be + 1 or - 1 as L;{; d;j is positive or not. Then =

" y . z . d . . -< " " L,. L,. z d . . .

L,. 1,1

I

)

I)

I

1

)

I)

< I: [g;d;j . l ]

=

"c L,. i+k oo.

Verify Lyapounov's

27.5. Suppose that the random variables in any single row of the triangular array arc

identically distributed. To what do Lindeberg's and Lyapounov's conditions reduce?

27.6. Suppose that

Z 1 , Z2 ,

are independent and identically distributed with mean 0 and variance 1, and suppose that Xnk = unkzk . Write down the Lindeberg condition and show that it holds if max k s. r, u}k = o([/;"� 1 un2k). • • •

27.7. Construct an example where Lindeberg's condition holds but Lyapounov's

does not.

27.8.

27.9.

27.10.

22.9 i Prove a central limit theorem for the number

time

n.

sn

Rn

of records up to

be the number of inversions in a random permutation on letters. Prove a central limit theorem for Sn .

6.3 i Let

n

The 8-method. Suppose that Theorem 27.1 applies to {Xn}, so that vn u- 1 (Xn - c) = N, where Xn = - 1 [f: � 1 Xk . Use Theorem 25.6 as in Exam ple 27.2 to show that, if f(x) has a nonzero derivative at c, then {n (fC' oo.

i Suppose that X1, X2, are in dependent and identically distributed with mean 0 and variance 1, and suppose that a n ----> oo. Formally combine the central limit theorem and (27.28) to obtain •

•

•

(27.29) where ?, ----> 0 if a n ----> oo. For a case in which this does hold, see Theorem 9.4. 27.18.

Stirling's formula. Let Sn = X1 +

+Xn , where the Xn are indepe n dent and each has the Poisson distribution with parameter 1. Prove succes sively. n k nn +(lf2>e -n Sn - n - - - n n n k (a) E [ ( .fn ) ] e 1 n.1 . k n n

21.2 i

sn - n (b) ( rn

) - = N -.

( I:

k

�

o

In

)

·

·

·

_

Sn n (c) E [ ( in ) ] -) E[ N - ] =

(d)

n! - /2rr n n +( lf2>e - n.

27.19. Let ln(w) be the length of the run of O's starting at the nth place in the dyadic

expansion of a point w drawn at random from the unit interval; see Example

4.1.

(a) Show that / 1, /2 , . . . is an a-mixing sequence, where a11 = 4j2 n.

(b) Show that [f: 1 /k is approximately normally distributed with mean

variance 6n.

�

27.20. Prove under the hypotheses of Theorem

Hint: Use (27.25).

27.21.

26. 1 26.29 i Let X 1 , X2 ,

n and

27.4 that Sn!n ----> 0 with probability 1.

be independent and identically distributed, and suppose that the distribution common to the xn is supported by [0, 2rr] and is not a lattice distribution. Let sn = X I + . . . +XII , where the sum is reduced modulo 27T. Show that Sn U, where U is uniformly distributed over [0, 21T ]. • • •

=

SECfiON 28.

INFINITELY DIVISIBLE DISTRIBUTIONS

371

SECTION 28. INFINITELY DMSIBLE DISTRIBUTIONS*

Suppose that ZA has the Poisson distribution with parameter ,.\ and that Xn P . . . , Xnn are independent and P[ Xn k 1] A/n, P[ Xn k = 0] = 1 - Ajn. According to Example 25.2, xn I + . . . + xnn zA . This contrasts with the central limit theorem, in which the limit law is normal. What is the class of all possible limit laws for independent triangular arrays? A suitably restricted form of this question will be answered here. =

=

=

Vague Convergence

The theory requires two preliminary facts about convergence of measures Let J.Ln and JL be finite measures on (R 1 , � 1 ). If J.L n( a b) --> JL(a. b] for every finite interval for which JL{ a} = JL{b} = 0, then J.Ln converges vaguely to JL, written J.Ln --> , JL. If J.Ln and JL are probability measures, it is not hard to set! that this is equivalent to wedk conv�rgence · t-Ln = JL · On the other hand, if J.Ln is a unit mass at n and JL(R 1 ) = 0, then 1-Ln --> , JL, but J.Ln JL makes no sense, because JL is not a probability measure. The first fact needed is this: Suppose that J.Ln --> JL and ,

-=>

1

(28 . 1 )

then (28.2)

for every conti."!uous real f that 1 vanishes at + oo 1in the sense that lim 1 x1 _,,., f(x) = O. Indeed, choose M so that JL(R ) < M and J.Ln(R ) < M for all n. Given t:, choose a and b so that JL{a} = JL{b} = 0 and lf(x)l < t:jM if x $ A = (a, b]. Then I JAcfdJ.Lnl < €

and I JA . fdJLI < t:. If JL(A) > O, define v(B) = JL(B nA)/JL(A) and vn(B) =J.Ln(B n A )/J.Ln(A). It is easy to see that vn = v, so that ffdvn --> ffdv. But then lfA fdJLn fA fdJLI < " for large n, and hence I JfdJLn - JfdJL I < 3t: for large n. If JL(A) = 0, then fA fdJLn --> 0, and the argument is even simpler. The other fact needed below is this: If (28. 1) holds, then there is a subseq/Jence {JL ' } and a finite measure JL such that 1-Ln . --> JL as k --> Indeed, let Fn(x) = J.Ln( - oo, x ]. Since the Fn are uniformly bounded because of (28.1), the proof of Helly's theorem shows there exists a subsequence {Fn) and a bounded, nondecreasing, right-continuous function F such that lim k Fn.(x) = F(x) at continuity points x of F. If JL is the measure for which JL(a, b ] = F(b) - F(a) (Theorem 12.4), then clearly 1

�

oo.

J.Ln • --> I JL ·

The Possible Limits

Let xn l • . . . ' xnrn• n = 1, 2, . . . be a triangular array as in the preceding ' section. The random variables in each row are independent, the means are 0, *This section may be

omitted.

CONVERG ENCE OF DISTRIBUTIONS

372

and the variances are finite: rn

s� = L un2k ·

(28.3)

k=l

Assume s� > 0 and put Sn = Xn 1 + the total variance is bounded:

(28.4) ln order that the

(28.5)

· · + Xn n. r

·

Here it will be assumed that

sup s� < oo.

n

Xnk

be small compa red with lim max u 2k =

n

k s rn

n

Sn, assume that

0.

,

The arrays in the preceding section were normalized by replacing Xnk by Xnkfs�. This has the effect of replacing sn by 1, in which case of course (28.4) holds, and (28.5) is the same thing as maxk un2kfs� � 0. A distribution function F is infinitely dimsible if for each n there is a distribution function Fn such that F is the n-fold convolution Fn * · · · * Fn (n copies) of Fw The class of pos:;ible limit laws will turn out to consist of the infinitely divisible distributions with mean 0 and finite variance. t It will be possible to exhibit the characteristic functions of these laws in an explicit way. Theorem 28.1.

Suppose that

1 1 i x - 1 - itx ) 2 11- ( dx ) , � ( t) = exp j ( e x Ri where 11- is a finite measure. Then � is the characteristic function of an infinitely divisible distribution with mean 0 and variance JL(R 1 ). ( 28.6)

By (26.4 2 ), the integrand in (28.6) converges to - t 2/2 as x � 0; take this as its value at x = 0. By (26.4 1 ), the integrand is at most t 2/2 in modulus and so is integrable. The formula (28.6) is the canonical representation of � . and 11- is the

canonical measure.

Before proceeding to the proof, consider three examples.

If 11- consists of a mass of u 2 at the origin, (28.6) is e - u 2 12 1 2 , the characteristic function of a centered normal distribution F. It is certainly infinitely divisible-take Fn normal with variance u 2jn. • Example 28.1.

tThere do exist infinitely divisible distributions without moments (see Problems 28.3 and 28. 4 ) but they do not figure in the theory of this section

,

SECfiON 28. INFINITELY DIVISIBLE DISTRIBUTIONS

373

Suppose that IL consists of a mass of ,.\ x 2 at x * 0. Then (28.6) is exp ,.$ e i1x - 1 - itx ); but this is the characteristic function of x(ZA ,.$, where Z,� has the Poisson distribution with mean ,.\ . Thus (28.6) is the characteristic function of a distribution function F, and F is infinitely • divisible-take Fn to be the distribution function of x(ZA; n - Ajn). Example 28.2.

IL

Example 28.3. If cpit) is given by (28.6) with ILi for the measure, and if = L.J= 1 1L 1 , then (28.6) is cp 1(t) . . . cp k ( t). It follows by the preceding two

examples that (28.6) is a characteristic function if IL consists of finitely many point masses. It is easy to check in the preceding two examples that the distribution corresponding to cp( t) has mean 0 and variance IL(R 1 ), and since the means and variances add, the same must be true in the present example . •

Let IL k have mass IL{j2-\ {j + 1)2 - k ] at j2 - k for j = 0, + 1, . . . , + 2 2 . Then ILJ. � IL · As observed in Example 28.3, if cp k ( t) is (28.6) with IL k in place of IL, then 'P k is a characteristic function. For each t the integrand in (28.6) vanishes at + oo; since supk 1L k(R 1 ) < oo, 'P/t) � cp(t) follows (see (28.2)). By Corollary 2 to Theorem 26.3, cp( t) is itself a characteristic function. Further, the distribution corresponding to cpk ( t ) has second moment 1L k (R 1 ), and since this is bounded, it follows (Theorem 25.1 1) that the distribution corresponding to cp( t) has a finite second moment. Differentiation (use Theorem 16.8) shows that the mean is cp'(O) = 0 and the variance is - cp"(O) = IL(R 1 ). Thus (28.6) is always the 1 characteristic function of a distribution with mean 0 and variance IL( R ). If 1/Jn(t ) is (28.6) with JL jn in place of IL, then cp t ) = rfJ;(t ), so that the distribution corresponding to cp( t) is indeed infinitely divisible. •

PROOF OF THEOREM k28.1.

1

(

The representation (28.6) shows that the normal and Pois�on distributions are special cases in a very large class of infinitely divisible laws.

Every infinitely divisible distribution with mean 0 and finite variance is the Limit Law of Sn for some independent triangular array satisfying • (28.3), (28.4), and (28.5). Theorem 28.2.

The proof requires this preliminary result:

If X and Y are independent and X + Y has a second moment, then X and Y have second moments as well. Lemma.

PROOF.

Since X 2 + Y 2 < (X + Y) 2 + 2I XY I, it suffices to prove I XY I integrable, and by Fubini's theorem applied to the joint distribution of X and Y it suffices to prove lXI and I YI individually integrable. Since IYI < lxl + lx + Yl, E[IYI] = oo would imply E[lx + Yl] oo for each x; by Fubini's

=


374

theorem again E[ I Y I] = oo would therefore imply E[IX + Yl] impossible. Hence E[I Y I] < oo, and similarly E[ I X I] < oo.

= oo,

which is •

PROOF OF THEOREM 28.2. Let F be infinitely divisible with mean 0 and variance u 2 • If F is the n-fold convolution of Fn, then by the lemma (extended inductively) Fn has finite mean and variance, and these must be 0 and u 2 jn. Take rn = n and take Xn 1 , • • • , Xnn independent, each with distri bution function Fn. •

If F is the Limit law of Sn for an independent triangular array satisfying (28.3), (28.4), and (28.5), then F has characteristic function of the form (28.6) for some finite measure 11- · Theorem 28.3.

PROOF. The proof will yield information making it possible to identify the limit. Let 'Pn k (t) be the characteristic function of Xnk · The first step is to prove that rn

rn

k=l

k=l

O 'Pn k ( t ) - exp I: ('Pr.k ( t) - 1 ) � o

(28.7)

for each t. Since l z l < 1 implies that l e z - I I = eRez - I < 1, it follows by (27.5) that the difference 8n(t) in (28.7) satisfies l8n(t)l < L�"= 1 lcpn k (t) exp(cpn k(t) - 1) 1. Fix t. If 'Pn k (t) - 1 = ()n k • then IOn k l < t 2un2d2, and it follows by (28.4) and (28.5) that max k l()n k l � 0 and Ek l()n k l = 0{1). Therefore, for sufficiently large n, l8n(t)l < Ek l1 + ()n k - e8•k l < e 2 Ek 1 0nk 1 2 < e 2 max k l()n k l · Ek l()n k l by (27.15). Hence (28.7). If Fn k is the distribution function of Xn k • then rn

L ('Pn k ( t ) - 1 ) =

k=l

rn

J L k=l

R1

( e ii x - 1) dFn k ( x )

Let 11-n be the finite measure satisfying (28.8) and put (28.9)

11-n( - oo, x ] =

r"

L= I j

k

y s; x

y 2 dFn k ( y) ,

SECTION 28.

375

INFINITELY Dl VISIBLE DISTRIBUTIONS

Then (28.7) can be written

r"

n 'Pn k ( t ) - 'Pn( t) � 0.

( 28.10)

k= l

By (28.8), Jl n( R 1 ) = s�, and this is bounded by assumption. Thus (28.1) holds, and some subsequence {JLn "} converges vaguely to a finite measure 11- · Since the integrand in (28.9) vanishes at + oo, 'Pn"(t) converges to (28.6). But, . of course, limn cpn(t) must coincide with the characteristic function of the limit law F, which exists by hypothesis. Thus F must have characteristic function of the form (28.6). • Theorems 28. 1, 28.2, and 28.3 together show that the possible. limit laws are exactly the infinitely divisible distributions with mean 0 and finite vari ance, and they give explicitly the form the characteristic functions of such laws must have. Characterizing the Limit

Suppose that F has characteristic function (28.6) and that an independent triangular array satisfies (28.3), (28.4), and (28.5). Then sn has limit Law F if and only if 11-n �" JL, where 11-n is defined by (28.8). Theorem 28.4.

PROOF. Since (28.7) holds as before, s n has limit law

cpn ( J.L n

F if and only if

t ) (defined by (28.9)) converges for each t to cp(t) (defined by (28.6)). If � ,. JL, then cpn(t) � cp(t) follows because the integrand in (28.9) and (28.6)

vanishes at + oo and because (28. 1) follows from (28.4). Now suppose that
28.10. Suppose that for all

such that

28.11. Show that a stable law is infinitely divisible. 28.12. Show that the Poisson law, although infinitely divisible, is not stable. 28. 13. Show that the normal and Cauchy laws are stable. 28. 14. 28.10 j

of

a'', b"

Suppose that F has mean 0 and variance 1 and that the dependence on a, a', b, b' is such that

(X) (X)

F - *F - =F ul O"z

( X ) ul + u}

V

Show that F is the standard normal distribution.

.


378

28.15. (a) Let Yn k be independent random variables having the Poisson distribution with mean cnajl k l 1 a, where c > O and O < a < 2. Let Zn = n - 1 'L'k� - n2kYn k (omit k = 0 in the sum), and show that if c is properly chosen then the characteristic function of zn converges to e _ ,,,a, (b) Show for 0 < a < 2 that e - llla is the characteristic fu nction of a symmetric stable distribution; it is called the symmetric stable law of exponent a. The case a = 2 is the normal law, and a = 1 is the Cauchy law.

SECTION 29. LIMIT THEOREMS IN

Rk

If Fn and F are distribution functions on R\ then Fn converges weakly to F, written Fn = F, if limn Fn( x ) = F(x) for all continuity points of F. The corresponding distributions 11- n and 11- are in this case also said to converge weakly: 11- n = 11-· If ¥n and X are k-dimensional random vectors (possibly on different probability spaces), Xn converges in distribution to X, written Xn = X, if the corresponding distribution functions converge weakly. The definitions are thus exactly as for the line. x

The Basic Theorems

The closure A - of a set in R k is the set of limits of sequences in A; the interior is Ao = R k - ( R k - A ) - ; and the boundary is aA = A - - A0• A Borel set A is a wcontinuity set if JL(aA ) = 0. The first theorem is the k-dimen sional version of Theorem 25.8.

measures 11- n and J.L on ( R\ IN k ), each of the following conditions is equivalent to the weak convergence of 11- n to 11-: Theo rem 29.1.

(i) (ii) (iii) (iv)

For probability

lim n ffdll n = ffdJL for bounded continuous f; lim supn JL/C) < JL{C) for closed C; lim infn Jln(G) > JL{G) for open G; limn Jln(A) = JL(A) for wcontinuity sets A.

It will first be shown that {i) through (iv) are all equivalent. (i) implies {ii): Consider the distance dist(x, C) = inf[ l x - yl: y E C] from x to C. It is continuous in x. Let PRooF.

cpj( t ) =

1 1

0

- jt

if t < 0, if o < t < r I , if r I < t.

SECTION 29.

LIMIT THEOREMS IN

Rk

379

Then f/x) = cp/dist(x, C)) is continuous and bounded by 1, and f/xH Ic(x) as j i oo because C is closed. If (i) holds, then lim supn J.Ln(C) < limn ffi dJ.Ln = ffi dJ.L. As j i 00, ffi dJ.L ! fie dJ.L = J.L(C). (ii) is equivalent to (iii). Take C = R k - G. (ii) and {iii) imply (iv): From {ii) and (iii) follows

J.L( Ao ) < lim infJ.Ln n ( Ao ) < lim inf n J.Ln( A ) n

1l

Clearly (iv) follows from this. (iv) implies (i): Suppose that f is continuous and lf(x )I is bounded by K. Given E , choose reals a0 < a 1 < · · · < a1 so that a0 < - K < K < a1, a; a; _1 < E, and J.L[x: f(x) = a;] = O. The last condition can be achieved be cause the sets [x: f(x) = a] are disjoint for different a. Put A; = [x: a;_ 1 0 (see (12. 12)). Given E and a rectangle A = (a 1 , b 1 ] X · · · X (a k , bk], choose a o such that if z = (8, . . . , 8), then for each of the 2 k vertices x of A, x < r <x + z implies IF(x) - G(r )I < E j 2 k . Now choose rational points r and s such that a < r < a + z and b < s 0 and E is arbitrary, it follows that !:J. A F > 0. PROOF. Take

.

I

I

With the present interpretation of the symbols, the proof of Theorem 25.9 shows that F is continuous from above and lim; Fn (x) F(x) for continuity points x of F. I

=

t The approach of this section carries over to general metric spaces; for this theory and its applications, see BILLINGSLEY1 and BILLINGSLEY 2• Since S korohod's theorem is no easier in R k than in the general metric space, it is not treated here.

SECTION

29. LIMIT THEOREMS IN R k

381

12.5, there is a measure J.L on ( R\ � k ) such that J.L(A ) = !lA F for rectangles A. By tightness, there is for given € a t such that J.L J y: - t < yi < t, j < k ] > 1 - E for all n. Suppose that all coordinates of x exceed t: If r > x, then Fn (r) > 1 - E and hence (r rational) G(r) > 1 - E, so that F(x) > 1 - E. Suppose, on the other hand, that some coordinate of x is less than - t: Choose a rational r such that x < r and some coordinate of r is less than - t; then Fn(r) < E, hence G(r) < E, and so F(x) < E. Therefore, for every E there is a t such that By Theorem

F( x ) \f

( 29 . 1 )

> 1-€

t for all j, if xi < - t for some j.

If B5 = [ y: - s < yi < xi, j < k ] , then J.L(S) = I im 5 J.L(B5 ) = 1im 5 1l8l. Of the 2 k te 1 ms in the sum !l8F, all but F(x) go to 0 (s -7 oo) because of the ' second part of (29. 1). Thus J.L(S) = F(x ).t Because of the other part of (29. 1), J.L is a probability measure. Therefore, Fn = F and 11-n = J.L . • Obviously Theorem 29.3 implies that tightness is a sufficient condition that each subsequence of {J.Ln} contain a further subsequence converging weakly to some probability measure. (An easy modification of the proof of Theorem 25. 10 shows that tightness is necessary for this as well.) And clearly the corollary to Theorem 25.10 now goes through:

If {J.Ln} is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure J.L , then J.Ln = J.L · Corollary.

Characteristic Functions

Consider a random vector X = ( X1 , , Xk ) and its distribution J.L in R k . Let t · x = L� 1 t u x., denote inner product. The characteristic function of X and of J.L is defined over R k by •

•

•

=

(29 .2) To a great extent its properties parallel those of the one-dimensional charac teristic function and can be deduced by parallel arguments. "t This requires proof because there exist (Problem 12.10) functions F' other than F for which J.L( A ) AAF' holds for all rectangles A. =

382

CONVERGENCE OF DISTR IBUTIONS

(26.16)

The inversion formula takes this form: For a bounded rectangle A = [x: a u <x u < bu , u < k ] such that J.L(aA ) 0, =

II II QII - e- 1/ 1 bII e 1 .t 1-L ( A ) = lim cp( t ) dt, j k 0 u l T -+ "' (27T) BT u = I k

(29.3)

.

where B T [t E R k : l tul < T, u < k ] and it, replace cp{t) by the middle term in The integral in is

(26.17):

=

.

dt is short for dt 1

(29.3)

The inner integral may be evaluated by Fubini's theorem in IT =

f un= I [ sgn( x�Rk

-

dtk . To prove

(29.2) and reverse the integrals as in •

a

•

•

R k , which gives

u) S T · Ixu - aul) (

sgn ( x; - bu)

J

S ( T · Ix u - bui) J.L( dx ) .

(26.1 8)), (29.3) 29.1

Since the integrand converges to n� = 1 1/laII' bII(xJ (see follows as in the case k The proof that weak convergence implies (iii) in Theorem shows that k for probability measures 1-L and 11 on R there exists a dense set D of reals such that J.L(aA) = ll(aA ) 0 for all rectangles A whose vertices have coordi nates in D. If J.L{A) 11{A) for such rectangles, then 1-L and 11 are identical by Theorem Thus the characteristic function cp uniquely determines the probability mea sure 1-L Further properties of the characteristic function can be derived from the one-dimensional case by means of the following device of Cramer and Wold. For t E R\ define h 1: R k � R 1 by h 1 (x) = t · x. For real a, [ x : t · x < a] is a half space, and its wmeasure is =

1.

=

3.3.

=

(29.4)

By change of variable, the characteristic function of J.L h ; 1 is

( 29 .5)

f e isyJ.Lh ; l (dy ) f e is(l x)J.L( dx ) =

Rl

Rk

= cp( st 1 ,

• • •

, stk ),

(29.4))

to know each J.L h 1- 1 To know the J.L- measure of every half space is (by for s to know cp( t ) for every t; and to know the and hence is (by

(29.5)

=

1)

SECTION


383

J.L is to know J.L. Thus J.L is uniquely determined by the values it gives to the half spaces. This result, very simple in its statement, seems to require Fourier methods-no elementary proof is known. If J.Ln = J.L for probability measures on R k , then cpn(t) -7 cp(t) for the corresponding characteristic functions by Theorem 29. 1. But suppose that the characteristic functions converge pointwise. It follows by (29.5) that for each t the characteristic function of J.L n h; 1 converges pointwise on the line to the characteristic function of 11-h 1 ; by the continuity theorem for characteristic functions on the line then, J.L n h :- 1 = J.Lh ,- 1• Take the u th component of t to be 1 and the others 0; then the J.Ln h,- 1 are the marginals for the uth coordinate. Since {J.Lnh,- 1 } is weakly convergent, there is a bounded inte rval (a u , bJ such that J.L n[x E R k : a u < xu < bJ = JL n h,- 1 (a u , bJ > 1 - E jk for all n . But then J.Ln(A) > 1 - € for the bounded rectangle A = [x: a u < x u < bu , u = 1, . . ; k]. The sequence {J.L n} is therefore. tight. If a subsequence {;.t n } converges weakly to then 'Pn(t) converges to the characteristic function of v , which is therefore cp(t ). By uniqueness, v J.L, so that 11-n = J.L. By the corollary to Theorem 29.3, 11-n = J.L. This proves the continuity theorem for k-dimensional characteristic functions: J.Ln = J.L if and only if cpn( t ) -7 cp(t) for all t. characteristic function cp of

1

.

v'

I

=

•

The Cramer-Wold idea leads also to the following result, by means of which certain limit theorems can be reduced in a routine way to the one-dimensional case.

For random vectors Xn = (Xn 1 , , Xn k ) and Y = (YI , . . . ' Yk ), a necessary and sufficient condition for xn = y is that L� = l tu xnu , tk ) in R k . � L �= 1 tu Yu for each (t Theorem 29.4.

•

P

.

.

•

•

•

PRooF. The necessity follows from a consideration of the continuous

mapping h, above-use Theorem 29.2. As for sufficiency, the condition implies by the continuity theorem for one-dimensional characteristic func tions that for each (t 1 , , tk ) •

•

•

for all real s. Taking s = 1 shows that the characteristic function of Xn converges pointwise to that of Y. • Normal Distributions in

Rk

By Theorem 20.4 there is (on some probability space) a random vector X = (XP . . . , Xk ) with independent components each having the standard normal distribution. Since each Xu has density e-x 2 12 jVz1r , X has density


384

(see (20.25)) (29.6) where lxl 2 = r.�= x� denotes Euclidean norm. This distribution plays the 1 normal distribution in R k . Its characteristic function is role of the standard (29.7) Let A = [a u, ] be a k X k matrix, and put Y =AX. where X is viewed as a column vector. Since £[ X X13 ] = 80 13 , the rr..atrix l, = [uuJ of the covariances of Y has entries uu, = E [Y)'; ] = L. ! a u a . Thus 2: =AA', where the prime denotes transpose. The matrix I 1is symmetric and nonnegative defin ite: Lu r uUl x.,x, IA'x l 2 > 0. View t also as a column vector with transpose t', and note that t · x = t' x. The characteristic function of AX is thus a

=

a

, a

=

(29. 8)

£{ e ir'( A X l ] = E [ e i( A'r )' X ] = e- I A'q2 /2 = e - r'lr ;2 .

Define a centered normal distribution as any probability measure whose characteristic function has this form for some symmetric nonnegative definite I. If I is symmetric and nonnegative definite, then for an appropriate orthogonal matrix U, U'IU = D is a diagonal matrix whose diagonal ele ments are the eigenvalues of I and hence are nonnegative. If D0 is the diagonal matrix whose elements are the square roots of those of D, and if A = UD0, then I = AA'. Thus for every nonnegative definite I there exists a centered normal distribution (namely the distribution of AX) with covari ance matrix I and characteristic function exp( - 1t'It). If I is nonsingular, so is the A just constructed. Since X has density (29.6), Y = AX has, by the Jacobian transformation formula (20.20), density f(A - 1x)ldet A - 1 1. From I = AA' follows ldet A - 1 I = (det i) - 11 2 • Moreover, I - 1 = (A') - 1A - 1, so that IA - 1xl 2 = x'I - 1x. Thus the normal distribution has density (27T) k l2(det i) - 1 12 exp( - {x'I - 1x) if I is nonsingular. If I is singular, the A constructed above must be singular as well, so that AX is confined to some hyperplane of dimension k - 1 and the distribution can have no density. By (29.8) and the uniqueness theorem for characteristic functions in R k , a

centered normal distribution is completely determined by its covariance matrix.

Suppose the off-diagonal elements of I are 0, and let A be the diagonal matrix with the u;)/ 2 along the diagonal. Then I = AA', and if X has the standard normal distribution, the components X; are independent and hence so are the components u;)/ 2X; of AX. Therefore, the components of a

SECTION


385

normally distributed random vector are independent if and only if they are uncorrelated. If M is a j X k matrix and Y has in R k the centered normal distribution with covariance matrix l, then MY has in R i the characteristic function j exp( - HM'tYl(M't)) = exp( - tt'(MlM')t) (t E R ). Hence MY has the centered normal distribution in R i with covariance matrix MlM'. Thus a Linear transformation of a normal distribution is itself normal.

These normal distributions are special in that all the first moments vanish. The general normal distribution is a translation of one of these centered distributions. It is completely determined by its means and covariances. The Central

Limit

Theorem

Let Xn (Xn 1 , , Xnk) be independent random vectcrs all having the same distribution. Suppose that £[ x;"J < oo; let the vector of means be c = (c 1 , . . . , ck), where C11 = E[ XnJ, and let the covariance matrix be l = [u11 , ], = E[(Xn ll - ciiX Xnl - c, )1. P ut sn = X I + . . . +Xn " where =-

•

•

•

(Til l

Under these assumptions, the distribution of the random vector (Sn - nc)j Vn converges weakly to the centered normal distribution with covariance matrix l. Theorem 29.5.

Let Y = ( YP . . . , Yk ) be a normally distributed random vector with 0 means and covariance matrix l. For given t = (t P t k ), let Zn = L� = 1 tJXn11 - C11) and Z = L�= 1 t11Y11 By Theorem 29.4, it suffices to prove that n - 1 12 L:j= 1 Zi = Z (for arbitrary t ). But this is an instant consequence of • the Lindeberg-Uvy theorem (Theorem 27. 1). PROOF.

•

•

•

,

•

PROBLEMS

k

29.1. A

real function f on R is everywhere upper semicontinuous (see Problem 13.8) if for each x and E there is a B such that I x - y l < B implies that f( y ) < f( x) E; f is lower semicontinuous if f is upper semicontinuous. (a) Use condition (iii) of Theorem 29.1, Fatou's lemma, and (2 1 .9) to show that, if JLn JL and f is bounded and lower semicontinuous, then -

.J..

=

lim inf ffdJLn > ffdJL.

( 29.9)

n

Show that, if (29.9) holds for all bounded, lower semicontinuous functions f, then JLn JL · (c) Prove the analogous results for upper semicontinuous functions. (b)

=


386

X

X

Show f o r probabi l i t y measures on the l i n e that JLn vn = JL v if and only if JLnSuppose = JL and vn = v. that Xn and Yn are independent and that X and Y are indepen dent. Show that, i f Xn = X and Yn = Y, then ( Xn, Yn ) = (X, Y) and hence that Xn + Yn = X + Y. Show that part (b) f a i l s wi t hout i n dependence. If Fn = F and Gn = G, then Fn * Gn = F * G. Prove thi s by part (b) and also by characteristic functions. Show that {JLn} is tight if and only if fo r each there is a compact set suchShow that that {JLn} is tightforifalandl n. only if each of the k sequences of marginal distributions is tight on the line. Assume of (Xm Yn) that X, = X and = c. Show that (X.7 , Yn) = (X, This iins dependent. an example of Problem 29.2(b) where and Yr. need not be assumed Prove analogues for R k of the corollaries to Theo!"em 26.3. Suppose that f(X) and g(Y) are uncorrelated for all bounded continuous and Show that X and Y are independent. Hint: Use characteristic fu nc tions. 20.normal 16 diSuppose that the random vector X has a centered k-dimensional as has 1 s tri b uti o n whose covari a nce matri x an ei g enval u e of mul t i 2 plchiic-isquared ty r anddi0striasbanutioeingwienvalth ruedegrees of multofiplfreedom. icity k- r. Show that IXI has the Multinomial sampling. Let p , p k be positive and add to 1, and let be i n dependent k-di m ensi o nal random vect o rs such that Zn has wi(fnthl , .probabi l i t y P; a 1 i n the ith component and O' s elsewhere. Then n = n from a . . , fnk) = i s t h e frequency count f o r a sampl e of si z e popul a ti o n wi t h cel l probabi l i t i e s mul t i n omi a l P;· Put xni liP; ) ,;np; and Show Xn = ( Xn J> · · · , Xnk). that X" has mean val u es 0 and covaria nces u;j = (f>;j Pj P;Pj)/ fP;Pj npY jnp; has asymptotically Show that the chi squared stati s ti c the chi-squared distribution with k- 1 degrees of freedom. 20.uni2f6ormlyAdistheorem of Poincare. Suppose that Xn = (X . . . , Xnn) is R" . Fix t, tri b uted sphere of radi u s {;:'i n over the surf a ce of a X , , X111 are in the l i m i t independent, each with the and show that normal standard di s tri b uti o n. Hint: If the components of Yn ( Yn , , Ynn) each then Xn has the aresameindidependent, wi t h the standard normal di s tri b uti o n, as s tri b uti o n .fn Yn!1Yn1· Suppose that t h e di s tri b uti o n of Xn = ( Xn , , Xnn) is spherically sym sense t h e that metri c i n Xn!I Xnl is uniformly distributed over the unit sphere. that Assume I Xn1 /n = 1, and show that xn l • . . . ' Xnr are asymptotically independent and normal.

29.2. (a) (b)

(c)

(d)

29.3. (a)

JLn(K) > 1 - E

(b)

29.4.

29.5. 29.6.

29.7.

E

Y,,

K

c).

Xn

f

g.

i

29.8. i

1,

Z 1 , Z2, • • •

• • •

f

I:;:, = 1 Zm

I

= (f.,; -

(a)

·

(b)

29.9.

I:7= 1(/m -

(a )

i

11

(b)

1

1,

• • •

=

1

2

• • •

1

• • •

SECTION 29. 29.10.

LIMIT THEOREMS IN R k

387

Xn = (Xn1, . - - , Xnk ), = an = (Xn , Xn Xn

be randomthatvectors the sati s fyi n g mi x i n g Letcondition (27.19) with n O(n1, 2,-...5 ). ,Suppose stati o nar y i s the sequence 0, and (tthathe dithestributiu oaren ofuniforml...y, bounded. +) is the same for al l n), that Show that i f then + + .fn has i n the l i m i t the centered normal di s tri b uti o n with covaria nces

Sn !

E[ Xnu1 = Sn = X 1 · · · Xn ,

00

00

E [ xluxl, ] + L E[ xlu xl +j,l ] + L E[ xl +j. u xl, ] =l l j=

j

Hint:

29.11.

Use the Cramer-Wold device. l e t e state be a Markov chai n wi t h fi n i t e 27. 6 , space As{1, in Exampl say. Suppose t h e t r ansi t i o r: probabi l i t i e s Pu, are all po!>itive and thefor whiinitciahl 1probabi l i t i e s Pu are the statio nary ones. Let fn u be the number of i Show that the normalized frequency count i n and j

S=

{Yn}

. . . , s},

2 for each a and ra > 2 for some a. Then r > 2u, and since IE[X:rJI < M� r,. - Zlun� ' it follows that An(rw . . , r) < (MnfsnY - 2 uAn(2, . . . , 2). But this goes to 0 because (30.5) holds and because A n(2, . . . , 2) is bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged to include all the u-tuples (i w . . , i)). It remains only to check (30.9) for r 1 = · · · = ru = 2. As just noted, An(2, . . . ,2) is at most 1, and it differs from 1 by L.sn- 2uun: , the sum extending over the (i., . . . , i) with at least one repeated index. S ince un� < M;, the u terms for example with i u = i u -l sum to at most M;s;; 2 LCTn� · · · un� < Mn2S11- 2 Thus 1 - A11 (2, . . . , 2) 

not necessarily distinct, are associated with the elements of a population of size n. Suppose that these numbers are normalized by the requirement

( 30.10)

n L Xn :, = 0 ,

h

=I

h

n L x �h = 1 , =I

Mn = maxn l x nh l . h :s;

An ordered sample Xn�> . . . , Xn k is taken, where the sampling is without " replacement. By (30. 10), E[Xnk ] = 0 and E[x;k J = 1jn. Let s� = k n /n be the fraction of the population sampled. If the Xn k were independent, which they are not, Sn = Xn 1 -t · · · +Xn k would have variance s� . If k n is small in comparison with n, the effects of dependence should be small. It will be shown that Snfsn = N if n

( 30 . 1 1 )

k , --" 2 sn = 0' n

Mn _ s n � o'

Since M; > n - I by (30.10), the second condition here in fact implies the third. The moments again have the form (30.7), but this time E[X:J , · · · x:;) cannot be factored as in (30.8). On the other hand, this expected value is by symmetry the same for each of the (kn)u = kn(kn - 1) · · · (kn - u + 1) choices of the indices i a in the sum L:'. Thus

A n( r p . . . , ru ) - ( ksnr L E [ Xn l . . . X'"J nu . n ''

The problem again is to prove (30.9).

SECTION

30.

THE METHOD OF MOMENTS

393

The proof goes by induction on u. Now An(r) kns;'n - 1 L:Z = 1 x� h • so that A n(l) = O and An(2) = 1. If r > 3, then lx�h i < M�-2x�h ' and so I An(r)l < (Mn/snY - 2 � 0 by (30. 1 1). Next suppose as induction hypothesis that (30.9) holds with u 1 in place of u. Since the sampling is without replacement, E[ X�j · : · X�d = L:x�� · · · x�'J,j(n) u , where the summation extends over the u-tuples (h I ' . . . , h) of distinct integers in the range 1 < ho: < n. In this last sum enlarge the range by requiring of (hp h 2 , , h) only that h 2 , , h u be distinct, and then com pensate by subtracting away the terms where h 1 = h 2 , where h 1 = h3, and so on. The result is =

-

•

•

•

•

•

•

n( n) u _ 1 [ X ' � ] E [ X '2 · X '" ] n 2 · · nu ( n) u E n l � ( n ) u - 1 E [ X'n 2 . . . Xn' 1 +ra . . X'" ] '-' ( n ) u o: 2 =2

_

T' U

.

0:

This takes the place of the factorization made possible in assumed independence there. It gives

u A n ( r2 , L o: =2

• • •

, r 1 + ro: ,

•

(30.8) by the

• . •

, r, ) .

By the induction hypothesis the last sum is bounded, and the factor in front goes to 0 by (30.1 1). As for the fi rst term on the right, the factor in front goes to 1. If r 1 =1= 2, then An(r1 ) � o and An(r2 , , r) is bounded, and so A n(r I ' . . . , r) � 0. The same holds by symmetry if ro: =I= 2 for some a other than 1. If r 1 · · · ru 2, then An(r 1 ) 1, and A n(r 2 , , r) � 1 by the induction hypothesis. Thus (30.9) holds in all cases, and Snfs n = N follows by the method of moments. • • •

=

=

=

=

• • •

Application to Number Theory

g(m) be the number of distinct prime factors of the integer m; for example g{34 x 5 2 ) = 2. Since there are infinitely many primes, g(m) is unbounded above; for the same reason, it drops back to 1 for infinitely many m (for the primes and their powers). Since g fluctuates in an irregular way, it Let

is natural to inquire into its average behavior. On the space f! of positive integers, let Pn be the probability measure that places mass 1/n at each of 1, 2, . . . , n, so that among the first n positive

394

CONVERGENCE OF DISTRI BUTIONS

integers the proportion that are contained in a given set A is just Pn( A). The problem is to study Pn[m: g(m) < x] for large n. If 8/m) is 1 or 0 according as the prime p divides m or not, then

( 30.12) Probability theory can be used to investigate this sum because under Pn the 8/m) behave somewhat like independent random variables. If Pp . . . , Pu are distinct primes, then by the fundamental theorem of arithmetic, 8P1(m) = · · · = 8P (m) = 1-that is, each P; divides m-if and only if the product p 1 · · · pu divides m. The probability under Pn of this is just n - I times the number of m in the range 1 < m < n that are multiples of p1 • Pu, and this number is the integer part of njp 1 • • Pu · Thus II

•

•

•

( 30.13) for distinct p1 • Now let XP be independent random variables (on some probability space, one variable for each prime p ) satisfying

1 P [ Xp = 1 ] = p' If P I > . . . , Pu are distinct, then (30 . 14)

P[ XPi = 1 ' i = 1 ' . . . ' U ] = P • 1• • Pu I

For fixed p" . . . , Pu, (30. 13) converges to (30. 14) as n � oo. Thus the behavior of the XP can serve as a guide to that of the 8/m). If m < n, (30. 12) is L.P ,s n8/m), because no prime exceeding m can divide it. The ideat is to compare this sum with the corresponding sum L.P ,s n Xp. This will require from number theory the elementary estimate*

(30.15)

1 p <X p

I: - = log log x + 0( 1 ) .

The mean and variance of Lp ,snXp are Lp ,s nP_, and Lp ,s nP- 1 { 1 -p- 1); since L.pP- 2 converges, each of these two sums is asymptotically log log n . teompare Problems 2.18, 5.19, and 6.16. t see, for example, Problem 18.17, or HARov & WRIGHT, Chapter XXII.

SECTION 30. THE METHOD OF MOMENTS

395

Lp s.n8/m) with L.p < n Xp then leads one to conjecture the Erdos -Kac centra/ limit theorem for the prime divisor function: Comparing

Theorem 30.3.

(30.16)

For all x ,

[Pn m : g( m)

- log log n vflog log n

<x -

]

�

1 =- x --=

- u 2; 2 du . e h'Tr J - oo

PROOF. The argument uses the method of moments. The first step is to

show that (30.16) is unaffected if the range of p in (30.12) is further restricted. Let {a n } be a sequence going to infinity slowly enough that log an log � 0

( 30.17 )

n

but fast enough that ( 30.18)

"\" i...J

1

n < p < n p

-

j2 = o( log log n) l .

Because of (30.15), these two requirements are met if, for example, log a n {log n)jlog log n. Now define

=

( 30.19) For a function

f of positive integers, iet En[f] = n - 1

n

L f( m)

m=l

denote its expected value computed with respect to

By

Pn. By (30.13) for u = 1 ,

(30. 18) and Markov's inequality,

Therefore (Theorem 25.4) , (30.16) is unaffected if

g(m).

gn(m) is substituted for


396

Now compare (30.19) with the corresponding sum mean and variance of sn are

Sn

=

] 0. Prove this with G in place of (a)

E

=

(b)

g

30.11.

)

=

e

J2Tr - oo

=

(a) op] (b) (c)

g]

--+

30.12.

g( ) m

g.

� v L. 'TT'

E

=

e

- 00

CHAPTER 6

Derivatives and Conditional Prob ability

SECTION 31. DERIVATIVES ON THE LINE*

This section on Lebesgue ' s theory of derivatives for real functions of a real variable serves to introduce the general theory of Radon-Nikodym deriva tives, which underlies the modern theory of conditional probability. The results here are interesting in themselves and will be referred to later for purposes of illustration and comparison, but they will not be required in subsequent proofs. The Fundamental Theorem of Calculus

To what extent are the operations of integration and differentiation inverse to one another? A function F is by definition an indefinite integral of another function f on [a, b] if X

F( x ) - F( a ) = f f( t ) dt

(3 1 . 1 )

a

a < x < b; F is by definition a primitive of f if it has derivative f: F'( x ) =f( x ) ( 31 .2) for a < x < b. According to the fundamental theorem of calculus (see (17.5)), these concepts coincide in the case of continuous f: for

Theorem 31.1.

Suppose that f is continuous on [a, b ].

An indefinite integral off is a primitive off: if (3 1.1) holds for all x in [a, b ], then so does (31.2). (ii) A primitive off is an indefinite integral off: if (31.2) holds for all x in [a, b], then so does (31. 1). (i)

* This section may be omitted.

400

SECTION

31.

401

DERIVATIVES ON THE LINE

A basic problem is to investigate the extent to which this theorem holds if

f is not assumed continuous. First consider part (i). Suppose f is integrable, so that the right side of (31 .1) makes sense. If f is 0 for x < m and 1 for x m (a < m < b), then an F satisfying (31.1) has no derivative at m . It is thus too much to ask that (31.2) hold for all x. On the other hand, according to a famous theorem of Lebesgue, if (31.1) holds for all x, then (31.2) holds almost everywhere-that is, except for x in a set of Lebesgue measure 0. In this section almost everywhere will refer to Lebesgue measure only. This result, the most one- could hope for, will be proved below (Theorem 31.3 ) . Now consider part {ii) of Theorem 31.1. Suppose that (31.2) holds almost everywhere, as in Lebesgue' s theorem, just stated. Does (31.1) follow? The answer is no: If f is identically 0, and if F(x ) is 0 for x < m and 1 for x m (a < m < b), then (3 1.2) holds almost everywhere, but (31. 1) fails for x m. The question was wrongly posed, and the trouble is not far to seek: If f is integrable and (31.1) holds, then

>

> >

( 31 .3 )

F( x + h ) - F( x )

=

Jbl( x. x +h lt)f( t ) dt � O a

as h � 0 by the dominated convergence theorem. Together with a similar argument for h i 0 this shows that F must be continuous. Hence the question becomes this: If F is continuous and f is integrable, and if (31.2) holds almost everywhere, does (31.1) follow? The answer, strangely enough, is still no: In Example 31.1 there is constructed a continuous, strictly increasing F for which F '(x) = 0 except on a set of Lebe, [' ( F( a;_ 1 ) - F (a;) ) + I L:" ( F( a;) - F( a; _ 1 ) ) 1 = [' ( F(a;_ 1 ) - F (a ;) ) + ! ( F(b) - F( a )) + [' ( F( a;_ 1 ) - F(a;)) l

·

As all the differences in this last exp ression are nonnegative, the absolute value bars can be suppressed; therefore,

IIFIIIl > F( b) - F( a) + 2 [' ( F( a;_ 1 ) - F ( a ;) ) • > F(·b) - F(a) + 2 0 [' (a ; - a ;_ 1 ) . A function F has at each x four derivates, the upper and lower right

derivatives

. sup F( x + h ) - F( x ) , D F ( x ) = hm h h J. O

DF ( x ) = I 1m mf

. . F( x + h) - F( x ) , h tO

h

and the upper and lower left derivatives

F( X ) - F( X - h ) ( , D x ) = hm sup h

F

•

h tO

F h F(x) x ( . ) . 1 ( ) _ f 1 1 0 m D X F h h tO

·

404

DERIVATIVES AND CONDITIONAL PROBABILITY

There is a derivative at x if and only if these four quantities have a common value. Suppose that F has finite derivative F'(x) at x. If u < x < v, then

F( v ) - F ( u ) ' ( x ) < v - x F( v ) - F( x ) - F ' ( x ) F v-x v-u v-u x - u F( x) - F ( u ) _ ' + F ( x) x-u v-u Therefore,

- F( u ) v-u

F( v )

(31.8)

�

F'( x )

as u j x and v � x; that is to say, for each E there is a o such that u < x < v and 0 < v - u < o together imply that the quantities on either side of the arrow differ by less than E . Suppos� that F is measurable and that it is continuous except possibly at countably many points. Thi� will be true if F is nondecrcasing or is the difference of two nondecreasing functions. Let M be a countable, dense set containing all the discontinuity points of F; let rn(x) be the smallest number of the form kjn exceeding x. Then F( y) - F( x) . Y -x --+oo x

>

a

>

a

=

Singular Functions

If f(x) is nonnegative and integrable, differentiating its indefinite integral jx_"'f(t) dt leads back to f(x) except perhaps on a set of Lebesgue measure 0. That is the content of Theorem 31.3. The converse question is this: If F(x ) is nondecreasing and hence has almost everywhere a derivative F'(x ), does integrating F'(x) lead back to F(x )? As stated before, the answer turns out to be no even if F( x) is assumed continuous: Let X1 , X2 , . . . be independent, identically distributed random variables such that P[ Xn = 0] = Po and P[ Xn = 1j = p 1 = 1 - p0, and n let X = r.: =I Xn2 - . Let F(x) = P[ X <x] be the distribution function of X. For an arbitrary sequence u I ' u 2 , . . . of O' s and 1 's, P[ xn = u n , n = 1, 2, . . . ] = limn Pu · · · Pu = 0; since x can have at most two dyadic expansions continuous. Of course, x = L:n u n 2 - n , P[X = x] = 0. Thus F nis everywhere n F(O) = 0 and F(l) = 1. For 0 k < 2 , k2- has the form "£7= 1 u ,. 2 - ; for some n-tuple (u 1 , u n ) of O's and 1 's. Since F is continuous, Example 31.1.

n

I

•

(31.15)

(

•

•

,

a

for: x E A , then J.L( A) < a..\(A). for x E A , then J.L( A) > a..\(A).

E

It is no restriction to assume A bounded. Fix for the moment. Let E be a countable, dense set, and let A n = n (A n I), where the intersec tion extends over the intervals I = (u, v] for which u, v E £, 0 < ..\( I ) < n - 1 and PROOF.

( 31 .19 )

J.L ( l )

< ( a + E ) ..\ ( 1 ) .

,

Then An is a Borel set and (see (31.8)) An i A under the hypothesis of (i). By Theorem 1 1.4 there exist disjoint intervals In k (open on the left, closed on

DERIVATIVES AND CONDITIONAL P ROBABILITY

410

the right) such that

An c U k in k and

(31 . 20 )

L A ( ln k ) < A ( An) + E. k

It is no restriction to assume that each In k has endpoints in E, meets and satisfies ..\ { In k ) < n - l . Then (31.19) applies to each In k • and hence

An,

J.L ( A n ) < L J.L( Ink ) < ( a + E) L A ( In k ) < ( a + E )( A ( A n) + E ) . k

k

In the extreme terms here let n -7 oo and then E -7 0; {i) follows . To prove {ii), let the countable, dense set E contain all the discontinuity points of F, and use tbc same argument with ,_J,(l) > (a - E)..\{l) in place of (31. 19) and L k J.L(ln k ) < J.L(An) + E in place of (31.20). Since E contains all the discontinuity points of F, it is again no restriction to assume that each In k has endpoints in E, meets An , and satisfies ..\ { In k ) < n - l . It follows that

J.L ( An) + € > L J.L( In k ) > ( k

Again let

a

n -7 oo and then E -7 0.

The measures and S,� · such that

( 31.21)

- €)

L A ( Ink ) > (a - € ) A ( An) · k

•

J.L and A have disjoint supports if there exist Borel sets S,.. J.L ( R 1 - s,.. ) = o, s,.. n s,� O . =

Suppose that F and J.L are related by (31.18). A necessary and sufficient condition for J.L and A to have disjoint supports is that F'(x) = 0 except on a set of Lebesgue measure 0. Theorem 31.5.

PROOF. By Theorem 31.4, J.L[x: l xl < a, F'(x) < E} < 2aE, and so {let E -7 0 and then a -7 oo) J.L[x: F'(x) = 0] = 0. If F'(x) = 0 outside a set of

Lebesgue measure O, then S,� [ x : F'(x) = 0] and S,.. = R 1 - S,� satisfy (3 1.21). Suppose that there exist S,_. and S,� satisfying (31.21). By the other half of Theorem 31.4, d[x: F'(x) > E] = d[ x : x E S,�, F'(x) > E] < J.L(S,�) = O, and so {let E -7 0) F'(x) = 0 except on a set of Lebesgue measure 0. • =

Suppose that J.L is discrete, consisting of a mass m k at each of countably many points x k " Then F( x) = L.m k ' the sum extending over the k for which x k < x. Certainly, J.L and A have disjoint supports, and so F' must vanish except on a set of Lebesgue measure 0. This is directly obvious if the • x k have no limit points, but not, for example, if they are dense. Example 31.2.

SECTION

31.

DERIVATIVES ON THE LINE

411

Consider again the distribution function F in Example 31.1. Here J.L(A ) = P[ X E A]. Since F is singular, J.L and ,.\ have disjoint supports. This fact has an interesting direct probabilistic proof. For x in the unit interval, let d 1(x), dix), . . . be the digits in its nontermi n nating dyadic expansion, as in Section 1. If (k2- , (k + 1)2- n ] is the dyadic interval of rank n consisting of the reals whose expansions begin with the digits u l , . . . ' u n , then, by (31.15), Example 31.3.

If the unit interval is regarded as a probability space under the measure J.L, then the d;(x) become random variables, and (3 1.22) says that these random variables are independent and identically distributed and J.L[ x: di(x) 0] = p0, =

J.L[x: di(x) = 1] = p 1 •

Since these random variables have expected value p 1 1 the strong law of large number'i implies that their averages go to p 1 with probability 1:

( 31 .23)

J.L

[X E (0, 1]

:

lim n

� i.t= I d ( X) = p I ] = 1. j

On the other hand, by the normal number theorem,

(31 .24)

,.\

[ x E (0, 1] : n n1 .[ di( x ) = 21 ] = 1 . .

hm -

n

'=I

(Of course, (31.24) is just (31.23) for the special case p0 = p 1 = f; in this case J.L and ,.\ coincide in the unit interval.) If p 1 =I= �, the sets in (31.23) and (31.24) are disjoint, so that J.L and ,.\ do have disjoint supports. It was shown in Example 31.1 that if F'(x) exists at all (0 < x < 1), then it is 0. By part (i) of Theorem 31.4 the set where F'(x) fails to exist therefore has wmeasure 1; in particular, this set is uncountable. • In the singular case, according to Theorem 31.5, F' vanishes on a support of A. It is natural to ask for the size of F' on a support of J.L · If B is the x-set where F has a finite derivative, and if (31.21) holds, then by Theorem 31.4, J.L[x E B: F'(x) < n] = J.L[x E B n S"': F'(x) < n] < nA(S"') = O, and hence J.L(B) = 0. The next theorem goes further. Theorem 31.6. Suppose that F and J.L are related by (31. 18) and that and ,.\ have disjoint supports. Then, except for x in a set of J.L-measure D(x) = oo.

F

J.L 0,

412


If J.L has finite support, then clearly FD(x) = oo if J.L{ x } > 0, while D/x ) = 0 for all x. Since F is continuous from the right, FD and DF play different roles.t PRooF. Let A n be the set where FD(x) < The problem is to prove that J.L ( A n ) = 0, and by (31.21) it is enough to prove that ( A n n S1) = 0.

n.

J.L

Further, by Theorem 12.3 it is enough to prove that J.L( K) = 0 if K is a compact subset of A n n S,.. . Fix E. Since A. ( K ) = 0, there is an open G such that K c G and A{ G) < E. If x E K, then x E A n , and by the definition of FD and the right-continuity of F, there is an open interval Ix for which x E Ix c G and J.LU) < n;>,.(l). By compactness, K has a finite subcover Ix 1, , Ixk · If some three of these have a nonempty intersection, one of them must be contained in the union of the ether two. Such superfluous intervals can be removed from the subcover, and it is therefore possible to assume that no point of K lies in more than two of the Ix But then •

I

•

•

J.L( K) < 11- ( V I_.; ) � �J.L{Ix) < n � A.( IxJ I

Since

•

I

I

< 2 n A ( l) Ir, ) 2 nA. ( G ) < 2nE.
n], then An ! 0, and since f is integrable, the dominated convergence theorem implies that fA f( x)dx < E/2 for large n. Fix such an n and take 8 Ej2n. If ..\(A ) < 8, then fA f( x ) dx < n

=

fA -A " f( x ) dx + fA

f(x) dx < n..\( A) + €/2 < €. If F is given by" (3 1.26), then F(b) - F(a) = Jabf(x) dx, and (31.27) has this consequence: For every E the1 e exists a 8 such that for each finite collection [a1, b1], i = 1, . . . , k, of nonoverlappingt intervals, ( 31 .28)

k

L j F( b1) F ( a ) J < €

i� I

-

,

if

k

L ( b, - a,) < 8.

i=I

A function F with this property is said to be absolutely continuous.+ A function of the form (31.26) (f integrable) is thus absolutely continuous. A continuous distribution function is uniformly continuous, and so for every E there is a 8 such that the implication in (31.28) holds provided that k = 1. The definition of absolute continuity requires this to hold whatever k may be, which puts severe restrictions on F. Absolute continuity of F can be characterized in terms of the measure J.L :

Suppose that F and J.L are related by (31.18). Then F is absolutely continuous in the sense of (3 1 .28) if and only if J.L(A) = 0 fo r every A for which ..\(A) = 0. Theorem 31.7.

t In tervals are nonoverlapping if their interiors are disjoint. In this definition it is immaterial

whether the intervals are regarded as closed or open or half-open, since this has no effect on (3 1 .28). tThe definition applies to all functions, not just to distribution functions If F is a distribution function as in the present discussion, the absolute-value bars in (3 1.28) are unnecessary

414


Suppose that F is absolutely continuous and that A(A) = 0. Given E, choose 8 so that (31.28) holds. There exists a countable disjoint union B = U k lk of intervals such that A c B and A(B) < 8. By (31.28) it follows that J.L( U Z= 1 /k ) < E for each n and hence that J.L(A) < J.L(B) < E. Since E was arbitrary, J.L(A) = 0. If F is not absolutely continuous, then there exists an E such that for every 8 some finite disjoint union A of intervals satisfies A(A) < 8 and J.L(A) > E. Choose An so that A( An) < n - 2 and J.L(An) Then ,.\(lim supn A n) = 0 by the first Borel-Cantelli lemma (Theorem 4.3, the proof of which does not require P to be a probability measure or even finite). On the other hand, J.L0im supn An) > E > 0 by Theorem 4. 1 (the proof of which applies because 1-L is assumed finite). • PROOF.

> €.

This result leads to a characterization of indefinite integrals.

distribution function F(x) has the form J�"'f(t) dt for an integrable f if and only if it is absolutely continuous i."l the sense of (31.28). Theorem 31.8.

A

That an F of the form (31.26) is absolutely continuous was proved in the argument leading to the definition (31.28). For anothe r proof, apply Theorem 31.7: if F has this form, then A ( A ) = 0 implies that J.L(A) = PROOF.

fAf(t) dt = 0.

To go the other way, define for any distribution function

(31 .29)

F

Fac ( x) = f F'( t ) dt X

- ex:

and

( 31 .30) Then F;; is right-continuous, and by (31.9) it is both nonnegative and nondecreasing. Since Fac comes form a density, it is absolutely continuous. By Theorem 31.3, F;c = F' and hence F; 0 except on a set of Lebesgue measure 0. Thus F has a decomposition =

(31.31)

F( x ) = Fac ( x) + F;; ( x ) ,

where Fac has a density and hence is absolutely continuous and F5 is singular. This is called the Lebesgue decomposition. Suppose that F is absolutely continuous. Then F5 of (3 1.30) must, as the difference of absolutely continuous functions, be absolutely continuous itself. If it can be shown that F5 is identically 0, it will follow that F = Fac has the required form. It thus suffices to show that a distribution function that is both absolutely continuous and singular must vanish.

SECTION

31.

415

DERIVATIVES ON THE LIN E

If a distribution function F is singular, then by Theorem 31.5 there are disjoint supports S,.. and SA . But if F is also absolutely continuous, then from ,.\(S,_.) = 0 it follows by Theorem 31.7 that JL ( S,_. ) 0. But then JL(R 1 ) 0, and • so F( x ) 0.

=

=

=

This theorem identifies the distribution functions that are integrals of their derivatives as the absolutely continuous functions. Theorem 31.7, on the other hand, characterizes absolute continuity in a way that extends to spaces f! without the geometric structure of the line necessary to a treatment involving distribution functions and ordinary derivatives.t The extension is studied in Section 32. Functions of Bounded Variation

The of t h e precedi n g remai n der of thi s sect i o n bri e fl y sket c hes theory t o the ext e nsi o n funct i o ns that are not monot o ne. The resul t s arc f o r si m pl i c i t y gi v en onl y f o r a fi n i t e inteIfrvalF( x[ a) =b]faxf(t) and fodtr fius nctian ionnsdefiFnionte i[nat,eb]gralsat, where isfyingfF(a)is in=tegrabl 0. e but not necessar itwlyononnegat i v e, then F( faxr(t) dt - J:r(t) de exhibits F as the difference of i n t e gral s thus nondecreasi n g functi o ns. The probl e m of charact e ri z i n g i n defi n i t e ldieffadserenceto theof nondecreasi prelimin aryngprobl e m of charact e ri z i n g funct i o ns repre e ntabl e as a f u nct i o ns. Now F is said to be of bounded variation over [ a , b] if sup8 II F ila is finite, where II FilA is defi n ed by (3 1 .5) and ranges over all parti t i o ns (31.4) of [ a , b]. Clearly, a diasffwelerencel: Forofevery nondecreasi n g funct i o ns i s of bounded vari a ti o n. But the converse hol d s finite collection of nonoverlapping intervals [ x;, Y;l in [a, b], put ,

X) =

�

r

Now define = supPr, N( x ) = sup Nr, where the supr::. m a extend over part i t i o ns of {a, x ] . If F is of bounded variatio n, then Pr Nr + F( x ). This gives the P( x) and N( x) are finite. For each such inequalities P( x )

r

Pr < N( x ) + F( x ) ,

r

r

r,

=

P( x ) > Nr + F( x ) ,

which in turn lead to the inequalities P(x) > N( x ) + F( x ) . P( x ) < N( x ) + F(x ) , Thus F(x ) = P( x ) - N( x ) (31 .32) representati o n: A function the difference of two nondecreasing gifunctions ves the ifrequi r ed and only if it is of bounded variation. in is

tTheorems 3 1 .3 and 3 1 8 do have geometric analogues

R k ; see

RuoiN2 , Chapter 8.

416

DERIVATIVES AND CONDITIONAL PROBABILITY ;

If Pr Nr, then I:IF( F( x )l. According to the definition (31.28), F conti n uous i f f o r whenever the iisntabsol u t e l y every € there exists a such that l e ngth l e ss than col l e cti o n have total If F is absolutely e rval s i n t h e of 1 and decompose [ a , b] i n to a fi n i t e t a ke the correspondi n g an conti n uous, t o number, say n, of subin tervals [ u , u) of lengths less than Any partition of [ a , b] can by the insertion of the u be spl i t i n to n sets of i n tervals each of total length lfunct ess tiohnanis necessari and itlyfoofl obounded wst thatvariI Filaation.n. Therefore, an absolutely continuous An absol u tel y conti n uous F thus has a representation (31.32). It fo l ows by the defi n i t i o ns that P( y) - P( x) i s at most supr where ranges over the parti tions of [ x, y]. If x Y;] are nonoverlapping in tervals, then P( y;) -P( x, ) i s at most col l e cti o ns of i n terval s that parti t i o n each of supr where now ranges over t h e thethat[ L(x y1y;]- x;)Therefore,impliifesFthatis absol[(P(y,utel)y-contiP(xnuous, there exi s t s for each such )) In other words, P i s absol u tel y contiThusn uous.an absol Similuatelrlyy, Ncontiis nabsol u t e l y conti n uous. uous F is the diffe rence of two nondecreasing absolutely conti n uou!. f u ncti o ns. By Theorem 31.8, each of these is an i n defi n i t e i n tegral, which implies that F is an indefinite integral as well: For an F on [a, b] satisfying F(a) 0, Tr =

Tr =

+

r

o

1

j

8,

Tr,

[

y;) -

8.

E

_1

Tr < E

�

o.

D F(O) = 0 F D(l) =Foo D (x) = 0 ( D( x ) = oo

2

F

JL

x,

(

l

-

(a) = - Po Po * },

2

1

x,

s,

JL \ l x ) = ,

s

n x = L:k'

( x)

ln( x )

=

sn( x )

11

l

Po -

2

l, F

Po -

(b)

*

n JLUn( x ))f2- -+ 0 F'( x )

x

(c)

JL( /n( X )) n (2- n (

F

2.

0.

0.

=

oo 0

F'( x )

f a< f

L:'k=o a k ( x ), a x a k ( x ) = 2- ka 2 k x (31.8)

2, 2.

a>

F

}

1

31. 18.

0.

Po

2

fo 2

f(x )

x

f

=

f

tA Li psc hitz condi tion of order a holds at x if F(x + h ) - F(x) O(ihia) as h -+ 0 for a > 1 this implies F'(x) 0, and for o < a < I it is a smoothness condition stronger than 'continuity =

=

and weaker than differentiability.

32.

S ECTION

31.19.

31.20.

TH E RADON -NIKODYM THEOREM

419

Show (see 01.31)) that (apart from addition of constants) a fu nction can have onllar.y one re presentation with absolutely continuous and singu Show thatwherethe iins cont the iLebesgue decomposi t i o n can be furt h er spl i t i n t o i n creases onl y i n jumps n uous and si n gul a r and iposin thteiosense t h at the correspondi n g measure i s di s cret e . The compl e t e decom n is then Show and that, i f assumes Suppose that L i F( )l + then i t is of unbounded1 variation. ( i n t e rval the valDefiuene0 in each f o r over [0, 1] by and F(x)=xasi n x For which values of is of bounded variation? 14.4 i I f is nonnegative and Lebesgue i n tegrable, then by Theorem 31.3 and (31.8), except fo r in a set of Lebesgue measure 0, F1 + F2

F2

F1

F5 Fcs

Fd + Fcs>

Fd

F = Fac + Fcs + Fd.

31.21. (a)

(b)

31.22.

x (< x 2
O.

x

1

(31.35)

v-u u

< u,

!' f( t ) dt -+ f( x ) u

There liisty anmeasure analoguef.L: Ifin whiiscnonnegat imeasure f is replaced andby a general probabi h Lebesgueive and in tegrable with respect to f.L, then as h ! 0, (31 .36 ) f.L ( X - �, h] t -h , x+h/( l)f.L ( dt) -+ onf.L , anda setputof wmeasureinf[x:1. Let be thefor di0 stributi1o(see n funct(14.5)). ion correspondi n g t o Deduce (31.36} from (31.35) by change of variable and Problem 14.4. u <x
0 for all E in A , and ip(E) < 0 for all E in A - . Theorem 32.1.

>

A set A is posuwe if cp(E) 0 for E cA and negative if 1p(E) < 0 for E cA. The A + and A - in the theorem decompose n into a positive and a negative set. This is the Hahn decomposition . If cp(A) = fAfdJL (see Example 32.1), the result is easy: take A + = [f > O] and A - = [f < 0] .

SECTION 32.

TH E RADON-NIKODYM THEOREM

421

Let a = sup[cp{A): A E .9']. Suppose that there exists a set A + satisfying cp(A + ) = a (which implies that a is finite). Let A -= n - A + . If A c A + and cp{ A ) < 0, then cp(A + - A ) > a, an impossibility; hence A + is a positive set. If A cA - and cp(A) > 0, then cp(A + u A) > a, an impossibility; hence A - is a negative set. It is therefore only necessary to construct a set A + for which cp(A +) = a. Choose sets An n such that cp(An) � a, and let A = U n An. For each n consider the 2 sets Bni (some perhaps empty) that are intersections of the form· n Z = 1 A'k • where each A'k is either A k or else A - A k . The collection n f!Jn = [Bn;: 1 0. Since An is the union of certain of the B�;, it follows that cp(An) < cp(en). Since the partitions f!J1 , f!J2 , are successively finer, m < n implies that (enr u · · · u en - I u en) - (em u · · · u en _ 1 ) is the union (perhaps empty) of certain of the sets Bni ; the Bni in this union must satisfy cp(Bn;) > 0 because they are contained in en- Therefore, cp(Cm U · · · U en _ 1 ) < cp(em u · · · u Cn ), so that by induction cp(Am) < cp(em ) < cp(em U · · · U en). If Dm U :=men, then by Lemma 1 (take E, = em u · · · u em J cp(Am) < cp( Dm). Let A + = n ';;, = 1 Dm (note that A + = lim supn en), so that +Dm ! A +. By Lemma 1, a = limm cp(A m) • < limm cp(Dm) = cp(A + ). Thus A + does have maximal cp-value. PROOF.

•

•

•

=-

If cp +{A) = cp(A nA + ) and cp - { A ) = - cp(A n A - ), then cp + and cp - are finite measures. Thus (32.2 ) represents the set function cp as the difference of two finite measures having disjoint supports. If E cA, then '{){£) < cp + (£) < cp + (A), and there is equal ity if E = A nA +. Therefore, cp+(A) = sup£c A cp(£). Similarly, cp -(A) = - inf £ c A cp( E). The measures cp + and cp - are called the upper and Lower variations of cp, and the measure lcpl with value cp +(A) + cp -{A) at A is called the total variation. The representation (32.2) is the Jordan decomposi

tion.

Absolute Continuity and Singularity

Measures J.L and v on (f!, .9') are by definition mutually singular if they have disjoint supports-that is, if there exist sets S,.. and Sv such that (32.3)

11- ( f! - S,_. ) = 0 , s,.. n sv =

0.

v (f!

- Sv ) = 0,

422

DE RIVATIVES AND CONDITIONAL PROBABILITY

In this case J.L is also said to be singular with respect to v and v singular with respect to J.L · Note that measures are automatically singular if one of them is identically 0. 1 According to Theorem 31.5 a finite measure on R with distribution function F is singular with respect to Lebesgue measure in the sense of (32.3) if and only if F'(x) = 0 except on a set of Lebesgue measure 0. In Section 31 the latter condition was taken as the definition of singularity, but of course it is the requirement of disjoint supports that can be generalized 1 from R to an arbitrary fl. The measure v is absolutely continuous with respect to J.L if for each A in !F, J.L( A) = 0 implies v(A) = 0. In this case v is also said to be domillated by J.L, and the relation is indicated by v « J.L. If v « J.L and J.L « v, the measures are equivalent, indicated by v J.L. A finite measure on the line is by Theorem 31.7 absolutely continuous in this sense with respect to Lebesgue measure if and only if the corresponding distribution function F satisfies the condition (3 1.28). The latter condition, taken in Section 31 as the definition of absolute continuity, is again not the one that generalizes from R 1 to fl. There is an E-8 idea related to the definition of absolute continuity given above. Suppose that for every E there exists a 8 such that =

v( A ) < €

(32.4)

if 11- ( A ) < 8.

If this condition holds, J.L(A) = 0 implies that v(A) < E for all E, and so v « JL. Suppose, on the other hand, that this condition fails and that v is finite. Then for some E there exist sets An such that J.L(An) < n - 2 and v(An) > €. If A = lim :;,upn An, then J.L(A ) = 0 by the first Borel-Cantelli lemma (which applies to arbitrary measures), but v(A ) > E > 0 by the right hand inequality in (4.9) (which applies because v is finite). Hence v « J.L fails, and so (32.4) follows if v is finite and v « f-L . If v is finite, in order that « J.L it is therefore necessary and sufficient that for every E there exist a 8 satisfying (32.4). This condition is not suitable as a definition, because it need not follow from v « J.L if v is infinite.t v

The Main Theorem

If v(A) = f fdJ.L, then certainly v « f.L. The in the opposite direction: A

Radon-Nikodym theorem goes

If J.L and v are u-finite measures such that v « J.L, then there exists a nonnegative f, a density, such that v(A) = f fdJ.L for all A E !F. For two such densities f and g, 11-U =I= g ] = 0. Theorem 32.2.

tsee Problem 32.3.

A

SECTION 32.

TH E RADON-NIKODYM THEOREM

423

The uniqueness of the density up to sets of J.L-measure 0 is settled by Theorem 16. 10. It is only the existence that must be proved. The density f is integrable J.L if and only if v is finite. But since f is integrable J.L over A if v ( A ) < oo, and since v is assumed u-finite, f < oo except on a set of J.L-measure 0; and f can be taken finite everywhere. By Theorem 16. 1 1, integrals with respect to v can be calculated by the form ula

JAhdv = JAhfdJ.L.

(32.5)

The density whose existence is to be proved is called the Radon-Nikodym derivative of v with respect to J.L and is often denoted dv/dJ-1-. The term derivative is appropriate because of Theorems 3 1.3 and 3 1 .8: For an abso lutely continuous distribution function F on the line, the corresponding measure J.L has with respect to Lebesgue measure the Radon-Nikodym derivative F'. Note that (32.5) can be written

v f hdv = J h : dJ.L.

(32.6)

A

A

J.L

Suppose that Theorem 32.2 holds for finite J.L and v (which is in fact enough for the probabilistic applications in the sections that follow). In the u-finite case there is a countable decomposition of f! into .r-sets An for which J.L(An) and v ( A n ) are both finite. If (32.7) then v « J.L implies vn « IL n • and so vn( A ) = fA in dJ.Ln for some density In Since .u n has density IA " with respect to J.L (Example 16.9),

Thus [nfniA is the density sought. It is therefore enough to treat finite J.L and result. n

If J.L and

v. This requires a preliminary

v are finite measures and are not mutually singular, then there exists a set A and a positive € such that J.L(A) > 0 and E J.L( E ) < v(E) Lemma 2.

for all E c A .


424

PRooF. Let A ; u A ; be a Hahn decomposition for the set function v - n _ ,11-; put M = U n A;, so that Me = n n A;. Since Me is in the negative set A; for v - n - IJL, it follows that v(Me) n - IJL(Me); since this holds for all n, v( Me ) = 0. Thus M supports v, and from the fact that 11- and v are not

0 for some n . Take A = A ; and • E = n - 1. Suppose that (f!, !T) = ( R 1 , � 1 ), 11- is Lebesgue measure ,.\, and v(a, b] = F(b) - F(a ). If v and ,.\ do not have disjoint supports, then by Theorem 31.5, ,.\[x: F'(x) > 0] > 0 and hence for somt" E, A = [x: F'(x) > E] satisfies ,.\{ A ) > 0. If E = (a , b] is a sufficiently small interval about an x in A, then v( E)jA( E) = (F(b) - F(a))j(b - a) > E, which is the same thing as Example 32.2.

d (E)

< v(E).

•

Thus Lemma 2 ties in with derivatives and quotients v( E) IJL(E) for "small" sets E. Martingale theory links Radon-Nikodym derivatives with such quotients; see Theorem 35.7 and Example 35. 10. PRooF OF THEOREM 32.2. Suppose that 11- and v are finite measures satisfying v « 11- · Let .§ be the class of nonnegative functions g such that fEgdJL < v( E) for all E. If g and g' lie in .§, then max(g, g ') also lies in .§

because

J max( g , g') dJL = j E

gd11- +

E n [ g � g 'J

j

g ' dJL

£ n[g' > g ]

< v ( E n [ g > g ']) + v ( E n [ g ' > g ] ) = v ( E ) .

Thus .§ i closed under the formation of finite maxima. Suppose that functions gn lie in .§ and gn j g. Then fEgdJL = limn fcgn d11- v(E) by the monotone convergence theorem, so that g lies in .§. Thus § is closed under nondecreasing passages to the limit. Let a = sup fg dJL for g ranging over .§ (a ::::;; v(f!)). Choose gn in .§ so that fgn d11- > a - n - I . If fn = max(g" . . . , gn) and f = lim fn, then f lies in .§ and ffdJL limnffn d11- > lim n fgn dJL = a. Thus f is an element of .§ for which ffdJL is maximal. Define vac by V3/ E) = fEfdJL and v5 by v5( E) = v( E) - v3c( E). Thus

0 and EJL(E) < 4£) for all E cA. Then for every E '

'

'

= v( E nA) + f AfdJL < v ( E nA ) + v ( E - A ) £= v( E ) . In other words, f + dA lies in .§; since f(f+ dA ) dJL = a + EJL ( A ) > a, th is contradicts the maximality of f. Therefore, 11- and v5 are mutually singular, and there exists an S such that 4S) JL(S c ) = 0. But since v « JL, v5(S c ) < v(S c ) = 0, and so v5{f!) = 0. The • rightmost term in (32.8) thus drops out. =

Absolute continuity was not used until the last step of the proof, and what the argument shows is that v always has a decomposition (32.8) into an absolutely continuous part and a singular part with respect to 11- · This is the Lebesgue decomposition, and it generalizes the one in the preceding section (see (31.31)). PROBLEMS

32.1.

32.2.

(32. 1)

absol u te: There are t w o ways t o show that the convergence i n must be Use the Jordan decomposi t i o n. Use the fact that a seri e s converges absol u tel y if it has the same sum no matter what order the terms are taken in. be other ones A(UA). a Hahne ofdecomposi t i o n of there may IfConstA +ruAuct -anisexampl thi s . Show that there i s uni q ueness t o the ext e nt that �p(A +6A i> �p(A -6A)) that absol u te conti n ui t y does not i m pl y the condi t i o n i f v is Show l e t s t subset s icounti nfiniten.g measure,Let andconsi of al l of the space of i n t e gers, v be at l e t Not e that have mass i s fi n i t e and v is a--finite. that t h e Radon-Ni k odym theorem fai l s not afi n i t e , even i f i f i s v is Show fiuncountabl nite. e Let let consibe scount t of theing count a bl e and the cocount a bl e sets i n an measure, and l e t v(A) be or 1 as A is countable or cocountable. 1p,

=

32.3.

Hint.

32.4.

= 0.

!f

!f

Hint:

n,

(32.4)

E-B

JL

n-

2

JL

n.

JL

JL

0

426

32.5.

DERIVATIVES AND CONDITIONAL PROBA BILITY

measure Let {AJL be theA restE rictiofonvertoficplalastnarrips.Lebesgue A 2 to the u-field Defi n e v on by v(A = A 2(A 1)). Show that v is absolutely continuous with respect to JL but has no density. Why does this not contradict the Radon-Nikodym theorem? LetderivJL,ativesandherpe beareu-fieverywhere nite measures on (D., .9"). Assume the Radon-Nikodym nonnegat i v e and fi n i t e. Show that v « JL and JL « p imply that v « p and X

!F

x (0,

32.6.

R1:

.9"

�1 }

X

R1)

f�,

(a)

dv dv dJL . dp = dJL dp (b)

=

Show that v JL implies ( )

dJL - I dv . djL = l[diLfdv > O] dv v

p,

1

« Suppose that JL p and « and let A be the set where dv dp dJL fdp. Show that v «JL if and onl y if p( A) = O, in which case (c)

>

0

=

dv f dvfdp = djL [diLfdp > Ol dJLfdp . 32.7. 32.8.

decomposi t i o n (32.8) in the if-fi nite as well as Show that there i s a Lebesgue the finite case. Prove that it is unique. JL i s u-fi n i t e, even The Radon Ni k odym theorem hol d s i f i f v i s not. Assume at firLetst that beJL ithes finiclatess(andof (,9:sets) v « JL).B such that JL(E) = or v(E) = for each E cB.LetShowbethatthe �clacontai n s a set B0 of maximal wmeasure. ss of set s i n 00 = 80 that are countable uni ons of sets of fiDnoit=e fiov-measure. Show t h at contai n s a set C0 of maximal JL-measure. Let - Co . Deduce from the maxi m al i t y of 8 0 and C0 that JL( D0 ) = v(D0) = Let v0(A) = v(A 00). Using the Radon-Nikodym theorem fo r the pai r JL, v0,Nowproveshowit fthat o r JL,thev. theorem holds if JL is merely u-finite. Show that if the density can be taken everywhere finite, then v is u-finite. .9"), and suppose that LetcontaiJL nanded ivn be finThen ite measures on i s a ufi e l d rest r i c t i o ns J-l-0 and v0 of JL and v to the are measures on (!l, .9" 0). Let v3c,v5, v�c, v5° be, respectively, the absolutely contin respect to v and v0 wi t h and JL0• Show that uous and si n gul a r part s of v�c(E) V3c(E) and v;(E) v5( E) for E E .9"0• on (D., Suppose that JL, v, vn are fi n i t e measures .9") and that v(A) = Ln vn( ) vn(A) = fA in dJL + v�(A) and v(A) = fA fdJL + v'(A) be the fordecomposi all A.tioLetns (32.8); si n gul a r wi t h respect here v' and v� are Show that to JL · f = En fn except on a set of JL·measu re and that v' ( A) = Env�( A) for al l Show that v « JL if and only if vn JL for all (a)

-�

(b)

.tf

oo

0

.tf

0.

(c)

n

(d) (e)

(f)

32.9.

.r.

>

32.10.

g:-o

en,

g:-o

.u

O] and A - = [w: f(w) < O] give a + (A) = that the three vari a ti o ns sat i s fy = f dJL. Hint: To construct f, start dJL, and dJL, �p - (A) with (32.2). A signed measure is a set fu nction that satisfies (32. 1) i f A 1 , A 2 , are and - but not both. Extend nt andandmayJordan assumedecomposi one of thetionsvaltouessigned+ measures dithesjoiHahn 31.22 Suppose that JL and v are a probabi l i ty measure and a u-fin i t e measure on t h e l i n e and that v JL. Show that the Radon - Nikodym derivative f satisfies m1 ;.tv(x( x -- hh ,, xx +-+ h]h] = f( x ) on a set of JL-measnre < < the unis StPinsuchtervalthatuncountabl Fi�inthd �onupport y many probabi l i t y measures JL , p JL)x} = fo r each x and and the SP are disjoint paus. Letuncountablbee theDefifienlde consion stingbyoftakithneg fi�p(A) nite tando be tthehe number cofinite ofsetpois innts anin A i f A is finite, and the negative of the number of points in A' if A i s cofinite . Show that (32. 1) holds (t hi s is not true if !l is countable). Show that there are notion,negatandivtehatsets fdoes o r (except the empty set ) , t h at there i s no Hahn decomposi not have bounded range. 1p

'P ·

= fA r

fA r

l �p i(A)

1p

32.12. i

32.13.

427

CONDITIONAL PROBABILITY 0

0.

1p

fA i l

oo

i

• • •

oo

«

I

h

.

....

c

1.

32.14.

0

m

32.15.

.9Q

1p

n.

1p

p

0

p

1,

.9Q

1p

SECTION 33. CONDITIONAL PROBABILI1Y I

The concepts of conditional probability and expected value with respect to a u-field underlie much of modem probability theory. The difficulty in under standing these ideas has to do not with mathematical detail so much as with probabilistic meaning, and the way to get at this meaning is through calcula tions and examples, of which there are many in this section and the next. The Discrete Case

Consider first the conditional probability of a set A with respect to another set B. It is defined of course by P(A I B ) = P(A n B)jP(B), unless P(B) vanishes, in which case it is not defined at all. It is helpful to consider conditional probability in terms of an observer in possession of partial information.t A probability space (f!, !F, P) describes tAs always, observer, information , know, and so on are informal, nonmathematical terms; see the related discussion in Section 4 (p. 57).


428

the working of a mechanism, governed by chance, which produces a result w distributed according to P; P( A) is for the observer the probability that the point w produced lies in A. Suppose now that w lies in B and that the observer learns this fact and no more. From the point of view of the observer, now in possession of this partial information about w, the probability that w also lies in A is P(A I B ) rather than P(A). This is the idea lying back of the definition. If, on the other hand, w happens to lie in Be and the observer learns of this, his probability instead becomes P( AIBe). These two conditional proba bilities can be linked together by the simple function ( 33 . 1 )

j P( A I B ) f( w) = \ P( A I B c )

if w E B, it w E Be .

The observer learns whether w lies in B or in Be; his new probability for the event w E A is then just f(w ). Although the observer does not in general know the argument w of f, he can calculate the value f(w) because he knows which of B and Be contains w. (Note conversely that from the value f(w) it is possible to determine whether lies in B or in SC, unless P(A I B ) = P(A I Be)-that is, unless A and B are independent, in which case the conditional probability coincides with the unconditional one anyway. ) The sets B and Be partition f!, and these ideas carry over to the general partition. Let B 1 , B 2 , be a finite or countable partition of n into .r-sets, and let .:# consist of all the unions of the B,. Then § is the u-field generated by the B,. For A in !F, consider the function with values cu

•

•

•

( 33.2) f( w) = P( A I B,) =

P( A n B; ) P( B; )

i = l , 2, . . . . w,

If the observer learns which element B, of the partition it is that contains then his new probability for the event w E A is f(w). The partition {B;}, or equivalently the u-field .:#, can be regarded as an experiment, and to learn which B, it is that contains w is to learn the outcome of the experiment. For this reason th e function or random variable f defined by (33.2) is called the conditional probability of A given § and is denoted P[ AII§J. This is written P[ A 11.:#]., whenever the argument w needs to be explicitly shown. Thus P[ A II.:#] is the function whose value on B1 is the ordinary condi tional probability P( A I BJ This definition needs to be completed, because P( A I B) is not defined if P(B) = 0. In this case P[ A II.:#J will be taken to have any constant value on B,; the value is arbitrary but must be the same over all of the set B,. If there are nonempty sets B; for which P(B) = 0, P[ A II.:#] therefore stands for any one of a family of functions on f!. A specific such function is for emphasis often called a version of the conditional

SECTION 33. CONDITIONAL PROBABILITY

429

probability. Note that any two versions are equal except on a set of probabil ity 0. Consider the Poisson process. Suppose that 0 < s < t, and and B; = [ N, = i], i = 0, 1, . . . . Since the increments are independent (Section 23), P( A I B ) = P[ Ns = O]P[ N, - Ns = i]jP[ N, = i], and since they have Poisson distributions (see (23.9)), a simple calculation reduces this to Example 33.1. let A = [Ns = 0]

;

i 0, 1 , 2, .

(33 .3)

=

Since i = N,(w) on B,, this can be written (33.4)

P[Ns = O il &' ] .,

( = 1

-t S

.

.

.

) N,(w ) .

Here the experiment or observation corresponding to {B;} or .§ deter mines the number of events -telephone calls, say-occurring in the time interval [0, t ]. For an observer who knows this r.umber but not the locations of the calls within [0, t], (33.4) gives his probability for the event that none of them occurred before time s. Although this observer does not known w, he • knows N,(w), which is all he needs to calculate the right side of (33.4). Suppose that X0, X 1 , . . . is a Markov chain with state space S as in Section 8. The events Example 33.2.

(33.5) form a finite or countable partition of f! as i0, , in range over S. If .§, is the u-field generated by this partition, then by the defining condition (8.2) for Markov chains, P[Xn + 1 = jll.§, ] ., = P;ni holds for w in (33.5). The sets .

.

•

(33 .6)

0

a u-field � smaller than � for i E S also partition f!, and they generate 0 Now (8.2) also stipulates P[Xn + l = jll-§, ]., = P;j for w in (33.6), and the essence of the Markov property is that •

(33 .7) The General Case

If .§ is the u-field generated by a partition B 1 , B 2 , . . , then the general element of .§ is a disjoint union B; 1 U B; 2 U finite or countable, of certain of the B;. To know which set B; it is that contains w is the same thing .

· · · ,


430

as to know which sets in .§ contain w and which do not. This second way of looking at the matter carries over to the general u-field .§ contained in !F. (As always, the probability space is (f!, !F, P).) The u-field .§ will not in general come from a partition as above. One can imagine an observer who knows for each G in .§ whether w E G or w 'E Gc. Thus the u-field .§ can in principle be identified with an experiment or observation. This is the point of view adopted in Section 4; see p. 57. It is natural to try and define conditional probabilities P[ A II.:#] with respect to the experiment .§. To do this, fix an A in !F and define a finite measure v on .§ by v(G)

=

P ( A n G) ,

G E .§.

Then P{ G ) = 0 implies that v ( G ) = 0. The Radon- Nikodym theorem can be applied to the measures v and P on the measurable space (f!, .§) because the first one is absolutely continuous with respect to the second.t It follows that there exists a function or random variable f, measurable .§ and integrable with respect to P, such that t P(A n G) = v(G) = J0fdP for all G in .:#. Denote this function f by P[ AII.:#]. It is a random variable with two properties: (i) P[ AII.:#] is measurable .§ and integrable. {ii) P[ A II.§] satisfies the functional equation

j P[ A II.:# ] dP = P( A n G) ,

(33.8)

G

G E .§.

There will in general be many such random variables P[ AII.:#], but any two of them are equal with probabilit-y 1. A specific such random variable is called a version of the conditional probability. If .§ is generated by a partition B 1 , B 2 , the function f defined by (33.2) is measurable .§ because [ w: f( w ) EH] is the union of those B; over which the constant value of f lies in H. Any G in .§ is a disjoint union G = U k Bik ' and P(A n G ) = [k P(AI B;)P(B;), so that (33.2) satisfies (33.8) as well. Thus the general definition is an extension of the one for the discrete case. Condition {i) in the definition above in effect requires that the values of P[ AII.:#] depend only on the sets in .§. An observer who knows the outcome of .§ viewed as an experiment knows for each G in .§ whether it contains or not; for each he knows this in particular for the set [ w ' : P[ A II.:#].,. = ], • • •

w

x

x

t Let P0 be the restriction of P to .§ (Example 1 0.4), and find on (fl, .§) a density f for v with respect to P0• Then, for G E .§, v(G) = fcfdP0 fcfdP (Example 16.4). If g is another such density, then P[f * g 1 = P0[f * g 1 0. =

=

S ECTION 33. CONDITIONAL FROBABILITY

431

and hence he knows in principle the functional value P[ A II.#L even if he does not know w itself. In Example 33. 1 a knowledge of N,(w) suffices to determine the value of (33.4)-w itself is not needed. Condition (ii) in the definition has a gambling interpretation. Suppose that the observer, after he has learned the outcome of .#, is offered the opportu nity to bet on the event A (unless A lies in .#, he does not yet know whether or not it occurred). He is required to pay an entry fee of P[ A II.#] units and will win 1 unit if A occurs and nothing otherwise. If the observer decides to bet and pays his fee, he gains 1 - P[ A II.#] if A occurs and - P[ A II.#] otherwise, so that his gain is ( 1 - P[ A II.# ] ) IA + ( - P [ A II,f ] ) IA' = fA - P[ A II.#]. If he declines to bet, his gain is of course 0. Suppose that he adopts the strategy of betting if G occurs but not otherwise, where G is some set in .#. He can actually carry out this strategy, since after learning the outcome of the experiment .§' he knows whether or not G occurred. His expected gain with this strategy is his gain integrated over G: •

f) fA -- P[ A ll.#]) dP. But (33.8) is exactly the requirement that this vanish for each G in .#. Condition (ii) requires then that each strategy be fair in the sense that the observer stands neither to win nor to lose on the average. Thus P[ A II.#] is the just entry fee, as intuition requires.

Example

Suppose that A E .#, which will always hold if .§' coin cides with th e whole u-field !F. Then /A satisfies conditions (i) and (ii), so that P[ A II.#] = /A with probability 1 . If A E .#, then to know the outcome of .§' viewed as an experiment is in particular to know whether or not A has • occurred. 33.3.

Example 33.4. If .§' is {0, f!}, the smallest possible u-field, every function

measurable .§' must be constant. Therefore, P[ A II.#L P(A) for all w in • this case. The observer learns nothing from the experiment .#. =

According to these two examples, P[ AII{O, f!}] is identically P(A), whereas /A is a version of P[ A II!F]. For any .§', the function identically equal to P(A ) satisfies condition (i) in the definition of conditional probability, whereas /A satisfies condition (ii). Condition {i) becomes more stringent as .§' de creases, and condition (ii) becomes more stringent as .§ increases. The two conditions work in opposite directions and between them delimit the class of versions of P[ A II.§].


432

Example 33.5. Let f! be the plane R 2 and let :F be the class � 2 of planar Borel sets. A point of f! is a pair (x, y ) of reals. Let .§ be the u-field consisting of the vertical strips, the product sets E X R 1 = [(x, y ): x E £], where E is a linear Borel set. If the observer knows for each strip E X R 1 whether or not it contains (x, y ), then, as he knows this for each one-point

set E, h e knows the value of x. Thus the experiment .§ consists in the determination of the first coordinate of the sample point. Suppose now that P is a probability measure on � 2 having a density y ) with respect to planar Lebesgue measure: P(A ) = y )dxdy. Let A be a horizontal strip R 1 X F = [(x, y ): y E F], F being a linear Borel s::.t. The conditional probability P[ A II.§] can be calculated explicitly. Put

ffA f(x.,

�( X ' y )

( 33.9)

=

f(x ,

j f( x , t) dt --'F'---- -!Rlf( t) dt X,

Set �(x, y ) = 0, say, �lt points where the denominator here vanishes; these points form a set of P-measure 0. Since 'fJ( -', y ) is a function of x alone, it is measurable .:#. The general element of .§ being E X R 1 , it will follow that � is a version of P[ AII.:#] if it is shown that ( 33 . 10)

fExR1�( x , y ) dP( x , y )

=

P( A n ( E X R 1 ) ) .

Since A = R 1 X F, the right side here is P( E X F). Since P has density Theorem 16. 1 1 and Fubini's theorem reduce the left side to

JE { JR I�(

X

f,

} JE { fI( X ' t ) dt } dx

' y ) ( X ' y ) dy dx =

f

=

Thus (33.9) does give a version of P[ R

ffEx Ff( x , y ) dxdy = P( E X F ) . 1

X Fll.:# ].

•

The right side of (33.9) is the classical formula for the conditional probability of the event R 1 X F (the event that y E F) given the event {x} x R 1 {given the value of x ). Since the event {x} X R 1 has probability 0, the formula P(A I B ) = P( A n B)jP(B) does not work here. The whole point of this section is the systematic development of a notion of conditional probability that covers conditioning with respect to events of probability 0. This is accomplished by conditioning with respect to collections of events-that is, with respect to u-fields .§.

SECTION 33.

433

CONDITIONAL PROBABILITY

Example 33.6. The set A is by definition independent of the u-field .§ if

it is independent of each G in .§: P( A n G) = P(A )P(G). This being the A is independent of .§ if and only if same thing as P( A n G) • ( A ) with probability 1. P[ A II.:# ]

= J0P(A) dP ,

=P

The u-field u(X) generated by a random variable X consists of the sets �1; see Theorem 20. 1. The conditional probability [w: X(w) H] for of A given X is defined as P[ AIIu(X)] and is denoted P[ A II X]. Thus P[ AIIX] = P[ A IIu(X)] by definition. From the experiment corresponding to ] contains and the u-field u(X), one learns which of the sets [ : hence learns the value X(w). Example 33.5 is a case of this: take X( y) y) in the sample space f! = R 2 there. for This definition applies without change to random vector, or, equivalentiy, to a tinite set of random variables. It can be adapted to arbitrary sets of random variables as well. For any such set [ X�> t T], the u-field u[X, T] it generates is the smallest u-field with respect to which each X, is measurable. It is generated by the collection of sets of the form [ : X,(cu) T] A P[ A II X, H] for t in T and in �1• The is by definition the conditional probability P[ A IIu[ X, T]] of A with respect to the u-field u[X, , T]. In this notation the property (33.7) of Markov chains becomes

E

HE

w' X(w') = x

w

x, = x

( x,

E

tE

H conditional probability with respect to this set of random variables tE

E w t E of tE

(33 . 1 1 ) l

=

The conditional probability of [ Xn + j] is the same for someone who knows the present state xn as for someone who knows the present state xn and th e past states X0, Xn _ as well. • • •

,

1

Y be random vectors of dimensions j and k, let JL be the distribution of X over R j, and suppose that X and Y are independent. According to (20.30), Example 33. 7. Let X and

HE

for !!J j and �j + k. This is a consequence of Fubini's theorem; it has a conditional-probability interpretation. For each in R j put

JE

x f( x) = P[(x,Y ) EJ] = P[w' : ( x , Y(w')) EJ] .

( 33.12 )

By Theorem 20. 1{ii), f(X(w)) is measurable u(X ), and since distribution of X, a change of variable gives

1

JL

is the

f( X( w ))P( dw ) = j f(x) JL( dx ) = P([( X, Y) EJ] n [X E H] ).

[ X e H]

H

434


Since [ X E H] is the general element of u(X), this proves that

f( X( ) ) P [ ( X, Y ) E J II XL , w

(33. 13)

=

•

with probability 1.

The fact just proved can be writ en P[ ( X, Y ) Ell I XL = P[( X( w ) , Y) E 1 ] = P[w' : (X( w ) , Y( w')) El].

col l i s i o n Repl a ci n g w' by w on the the one notati o nal l i k e here causes ri g ht a replacing by x causes in J:J(x, y ) dy. Suppose that and Y are independent random variables and that Y has distribution function F. For = [(u, v ): max{u, v} < m], (33. 12) is 0 for m < and F(m) for m > if M = max{ Y}, then (33.13) gives y

X

J

x;

x

X,

( 33 . 14 ) with probability 1. All equations involving conditional probabilities must be qualified in this way by the phrase with probability 1 , because the conditional probability is unique only to within a set of probability 0. The following theorem is useful for checking conditional probabilities.

Let 9 be a 1r-system generating the u-field .§, and suppose that f! is a finite or countable union of sets in g_ An integrable function f is a version of P[ A il .:#] if it is measurable .§ and if Theorem 33.1.

( 33 .15 )

j fdP = P( A n G) G

holds for all G in g_ PRooF.

•

Apply Theore m 10.4.

The condition that f! is a finite or countable union of 9-sets cannot be suppressed; see Example 10.5.

Suppose that X and Y are independent random variables with a conti n uous. What di s tri b uti o n functi o n common i s the condi that i s posi t i v e and F of [X< x] ticloenalarlyprobabi l i t y the random vari a bl e M = max{ X, Y}? it shoul d gi v en i f M < x, suppose that M x. Since X< x requires M Y, the chance of be the x] shoul d whidencech ibes !�F(x)j by symmetry, by i n depen condi t i o nal probabi l i t y of [X < F(m) !P[ X < xiX < m] with the random variable M substituted Example 33.8. 1

>

=

=

As

S ECTION 33.

CONDITIONAL PROBABILITY

435

for m. Intuition thus gives (33.16)

It suffi c es t o check (33.15) for sets G = [M < m], because these fo rm a 'IT-system generating u(M). The functional equation reduces to

P[M < min{x, m} ] + ��x<M�m :((;;) dP = P [ M<m , X<x]. other caseit folisloeasy, Siproduct nce themeasure, suppose that x < m. Since the distri b ution of (X, Y ) is ws by Fubini's theorem and the assumed continuity of F that (33.17 )

dF ( u)dF( v ) v F ) ( X 0. Similarly, P[ fiii5' L 0 1, and, if the An are dis joint, P[ U nAnll .:flw = LnP[A II.§L . Therefore, P[ A II.§lw is a probability measure as A ranges over !F. Thus conditional probabilities behave like probabilities at points of posi tive probability. That they may not do so at points of probability 0 causes no problem because individual such points have no effect on the probabilities of sets. Of course, sets of points individually having probability 0 do have an effect, but here the global point of view reenters. • • •

=

0

0

0

Conditional Probability Distributions Let

X be a random variable on (f!, !F, P), and let .§ be a u-field in !F. 1 Theorem 33.3. There exists a function JL(H, w ), defined for H in � and w in f!, with these two properties: (i) For each w in f!, JL( w) is a probability measure on � 1 (ii) For each H in !JP 1 , JL(H, · ) is a version of P[X E Hll.§]. ·

,

•

The probability measure JLL w) is a conditional distribution of X given .§. If .§= u{Z), it is a conditional distribution of X given Z. For each rational r < s, then by (33.23), PROOF.

O.

'

•

If P(B) = 0, the value of £[ XII.§] over B; is constant but arbitrary.

Example 34.2. For an indicator /A the defining properties of £[1)1 .#] and ?( All.§] coincide; therefore, £[ /All.§] = ?[ A ll.§] with probability 1. It is easily checked that, more generally, £[XII.§] = L;a;P[ A;II.#l with pwba • bility 1 for a simple function X = L;a;IA . I

In analogy with the case of conditional probability, if [X, t E T] is a collection of random variables, E[ XIIX, t E T] is by definition £ [XII.#] with u[X, t E T] in the role of .:#.

Example 34.3. Let J be the u-field of sets invariant under a measurepreserving transformation T on (f!, !F, P). For f integrable , the limit f in (24.7) is E[fiiJ]: Since f is invariant, it is measurable J. If G is invariant, then the averages a n in the proof of the ergodic theorem (p. 318) satisfy E[ J0a n ] = E[{0Jl. But since the a n converge to f and are uniformly inte • grable, E[I0f ] = E[I0f]. A

A

A

Properties of Conditional Expectation

To prove the first result, apply Theorem 16.10(iii) to f and £[ XII.§] on (f!, .§, P).

Let g be a 1r-system generating the u-field -�. and suppose that f! is a finite or countable union of sets in .:#. An integrable function f is a version of £[ XII.§] if it is measurable .§ and if Theorem 34.1.

( 34.3 )

holds for all G in

j fdP = j XdP G

G

g_

In most applications it is clear that f! E g_ All the equalities and inequalities in the following theorem hold with probability 1.

SECTION 34.

-

447

CONDITIONAL EXPECTATION

Theorem 34.2.

Suppose that X, Y, Xn are integrable.

(i) If X = a with probability 1, then £[ XII.§] = a. {ii) For constants a and b , E[aX + bYII.§] = a£[ XII.§] + b£[ Y I I 5' ]. (iii) If X < Y with probability 1, then £[ XII.§] < E[ YII.§]. (iv) I E[ XII 5' ll < £[l XI II.§]. {v) If lim n Xn = X with probability 1 , IXnl < Y, and Y is integrable, then lim n E[ Xnll.§] = £[ XII.§] with probability 1. If X= a with probability 1 , the function identically equal to a satisfies conditions (i) and (ii) in the definition of £[ XII.§'], and so {i) above follows by uniqueness. As for (ii), aE[XII.§] -r bE[ YII.§] is integrable and measurable �4, and PROOF.

f ( aE[ XII.§] + bE[ Y II.§]) dP = a f E[X I I .§ ] dP + b f E[ Y II.§ ] dP G

G

=a

j XdP + b f G

G

Y

G

dP =

j (aX + bY) dP G

for all G in .§, so that this function satisfies the functional equation. If X < Y with probability 1, then fa( E[ YII.§] - £[XII.§]) dP fa(Y X) dP > 0 for all G in .§. Since E[ YII.§] - £[ X II .§] is measurable .§, it must be nonnegative with probability 1 (consider the set G where it is negative). This proves (iii), which clearly implies (iv) as well as the fact that £[ XII.§] = E[ YII.§] if X = Y with probability 1 . To prove (v), consider Zn = supk ;?; n1Xk - X ]. Now Zn � 0 with probability 1, and by {ii), (iii), and {iv), I E[Xn l l .#] - E[ XII.§]I < E[ Zn ll.§]. It suffices, therefore, to show that E[ Zn11.4HO with probability 1. By (iii) the sequence E[ Zn ll.§] is nonincreasing and hence has a limit Z; the problem is to prove that Z 0 with probability 1, or, Z being nonnegative, that E[Z] = 0. But 0 < Zn < 2Y, and so (34. 1 ) and the dominated convergence theorem give • E[ Z] = fE[Z II.§] dP < JE[ Znll.§] dP E[ Zn] -7 0. =

=

=

The properties (33.21) through (33.28) can be derived anew from Theorem 34.2. Part (ii) shows once again that E[L:,.a,.IA II.§] = L:,.a,.P[A,.II.§] for simpie functions. If X is measurable .§, then clearly £[ XII.§] = X with probability 1. The following generalization of this is used constantly. For an observer with the information in .§, X is effectively a constant if it is measurable .#: '

Theorem 34.3.

(34.4)

with probability 1 .

If X is measurable .:#, and tf Y and XY are integrable, then E[ XY II.§] = XE[ YII.§]

448


It will be shown first that the right side of (34.4) is a version of the left side if X = 100 and G0 E .§. Since /00E[YII.§] is certainly measur able .:#, it suffices to show that it satisfies the functional equation f0100E[ YII.:#] dP = f0100YdP, G E .:#. But this reduces to fa n a11 E[ YII.:#] dP = fa na11 YdP, which holds by the definition of E[YII.:#]. Thus (34.4) holds if X is the indicator of an element of .:#. It follows by Theorem 34.2(ii) that (34.4) holds if X is a simple function measurable .§. For the general X that is measurable .:#, there exist simple functions Xn , measurable .:#, such that 1Xr1 < lXI and lim n Xn = X (Theorem 13.5). Since IXn YI < I XYI and IXYI is integrable, Theorem 34.2(v) implies that lim n E[ Xn YII.:#l = E[ XYII.:#] with probability 1. But E[XnY II.:#] = Xn E[ YII.:#] by the case already treated, and of course lim n Xn E[Y II.:#] = XE[ YII .:#]. (Note that IXnE[ Y II.:#ll = I E[XnYII.:#JI < E[ IXn YI II.:#l E[I XYI II.:#], so that the limit XE[YI!.§] is integrable.) Thus (34.4) holds in • general. Notice that X has not been assumed integrable. PRooF.

0

0 < X; < (J , =

* This topic may be omitted. tSee Problem 34.19.

f.L,

..

SECTION 34.


451

If X; is the function on R k defined by X;( X) = X;, then under P6, XI , . . . ' xk are independent random variables, each uniformly distributed over [0, e]. Let T(x ) max ; ,; k X;(x ). If g 6(t) is e - k for 0 < t < e and 0 otherwise, and if h(x) is 1 or 0 according as all x( are nonnegative or not, then fix) = g eCT(x ))h(x ). The factoriza tion criterion is thus satisfied, and T is a sufficient statistic. Sufficiency is clear on intuitive ground& as well: e is not involved in the co nditional distribution of X1 , , Xk given T because, roughly speaking, a random one of them equals T and the others are independent and uniform over [0, T]. If this is true, the distribution of X; given T ought to have a mass of k - I at T and a uniform distribution of mass 1 - k - I over [0, T ], so that =

•

•

•

(34 .9 ) For a proof of thi� fact, needed later, note that by (21 9)

(34 10) r t - u { .!_) k - I l t k +� du = = o

e

fl

2e k

if 0 < t < e. On the other hand, P6[T < t ] = (t ;ei, so that under P6 the distributio n of T has density kt k - I ;e k over [0, e]. Thus

fT,; r

(34.1 1 )

k

+2k 1 TdP.6 = k 2k+ 1 j 'uk u k -k l du = t k +kl o

e

2e

Since (34.10) and (34.11) agree, (34.9) follows by Theorem 34.1.

•

The essential ideas in the proof of Theorem 34.6 are most easily understoo d through a preliminary consideration of special cases.

E

Lemma 1. Suppose that [ P6; e e] is dominated by a probability measure P and that each P6 has with respect to P a density g 6 that is measurable .:#. Then .:# is sufficient, and P[ A ll.#] is a version of P6[ A ll.#] for each e in e. PRoOF. For

G

in .:#, (34.4) gives

Therefore, P[ A II.:#]-the conditional probability calculated with respect to P-does serve as a version of P6[ A II.:#] for each e in e. Thus .:# is sufficient for th e family

452


E 0]

even for this family augmented by p (which might happen to lie in the family to start with). • For the necessity, suppose first that the family is dominated by one of its members.

[ Pe:

e

-

E

E

Lemma 2. Suppose that [ Pe : e 0] is dominated by Pe for_ some eo e. If .:# is sufficient, then each Pe has with respect to Pe 11 a density g e /hat is measurable .:#. PROOF. Let p(A , w) be the function in the definition of sufficiency, and take and 0. Let de be any density of Pe w Pe[ A l l.:#]., = p( A, w) fo r all A with respect to Pn . By a number of applications of (34.4 ), uo

E !T, E fl,

6E

the next-to-last equality by sufficiency (the integrand on either side being p (A , - )). Thus gn = Ee [dell.:#], which is measu rable .:#, can serve as a density for Pe with respect to Pn . • u

uu

"

To complete the proof of Theorem 34.6 requires one more lemma of a technical sort.

E E>] is dominated by a u-finite measure, then it is equivalent to some finite or countably Lemma 3. If [Pe: e

infinite subfamily.

In many examples, the Pe are all equivalent to each other, in which case the subfamily can be taken to consist of a single Peu· PROOF. Since JL is u-finite, there is a finite or countable partition of n into .r-sets An such that 0 < JL(An) < oo. Choose positive constants am one for each An, in such a way that [nan < oo. The finite measure with value Ln anJL(A nAn)/JL(An) at A dominates JL. In proving the lemma it is therefore no restriction to assume the family dominated by a finite measure JL. Each Pe is dominated by JL and hence has a density fe with respect to it. Let Se = [w : fiw) > 0]. Then PiA) = PiA n Se) for all A, and Pe( A ) = 0 if and only if JL(A n Se) = 0. In particular, Se supports Pe. Call a set B in !T a kernel if B c Se for some e, and call a finite or countable union of kernels a chain. Let a be the supremum of JL(C) over chains C. Since JL is finite and a finite or countable union of chains is a chain, a is finite and JL(C) = a for some chain C. Suppose that C Un Bm whe1e each Bn is a kernel , and suppose that Bn c Se ". is dominated by [ Pe : n = 1, 2, . . . ] and The problem is to show that [ Pe: e hence equivalen t to it. Suppose that Pe ( A ) = 0 for all n. Then "JL( A n Se") = 0, as observed above. Since C c Un Se • JL(A n C) = 0, and it follows that Pe(A n C) = 0 =

E 8]

,

S ECTION 34.

453


whatever e may be. But suppose that PeCA - C ) > 0. Then P6((A - C) n S6) = PeCA - C) is positive, and so (A - C ) n S6 is a kernel, disjoint from C, of positive JL-measure; this is impossible because of the maximality of C. Thus PeCA C) is 0 • along with PeC A n C), and so P6( A) = 0. -

Suppose that [ P6: e E 0] is
.

] and hence to [ P6 : e E 0 ], and all th ree are

•

( 34.13 ) 34.6. If each P6 has density g 6h with respect to JL, then by the construction (34. 12), P has density fh with respect to where f = Lncng6 . Put r6 = g6jf if f > 0, and = 0 (say) if f = 0. If each g6 is measurable .:#, th e' same is true of f and hence of the r6 Since P[f 0] = 0 and P is equivalent to the entire family, P6[f = 0] = 0 fo 1 all e. Therefore, PROOF OF SuFFICIENCY I"' THEOREM

,u ,

ru

=

•

JA r6 dP = J r6fh dJL = JA n [f> O]r6 fh dJL = JA n [ f> O]g6h dJL 4

Each P6 thus has with respect to the probability measure P a density measurable .:#, and it follows by Lemma 1 thai .:# is sufficient • PROOF OF NECESSITY IN THEOREM

34.6. Let p(A, w) be a function such that, for

each A and e, p(A, · ) is a version of P6[ A II.:#], as required by the definition of sufficiency. For P as in (34.12) and G E .:fl,

(34. 14)

j p ( A , w ) P ( d w ) = Ln Cn }r p( A , w )P6 (dw) G

G

=

''

[ cn j0P6.,( AII .:# ] dP6

,,

n

= [ cnPe ( A n G) = P( A n G).

n

"

Thus p( A, · ) serves as a version of P[ AII.:#] as well, and .:# is still sufficient if P is added to the family Since P dominates the augmented family, Lemma 2 implies that each P6 has with respect to P a density g6 that is measurable .:#. But if h is the density of P with respect to JL (see (34.13)), then P6 has density g6h with respect to JL. •

454


E

A sub-a-field .§'0 sufficient with respect to [ P6 : e e ] is minimal if, fo r each sufficient .§', .§'0 is essentially contained in .§' in the sense that for each A in .§'0 there is a B in .§' such that P6(A A B) = 0 fo r all e in e. A sufficient .§' represents a compression of the information in !T, and a minimal sufficient .§'0 represents the greatest possible compressio n. Suppose the densities f6 of the P6 with respect to JL have the property that f6(w) i s measurable G'x !T, where G' is a u-fie!d i n e. Let rr be a probability measure on G', and define P as f0P6rr(de), in the sense that P(A) = fe!A f6(w)JL(dw)rr(de) = fe PeCA )rr(de). Obviously, P « [ P6: e E e]. Assume that {34 15) If rr has mas!. c, at e,, then P is given by (34. 12), and of course, (35.15) holds if (34. 13) does. Let r6 be a density for P6 with respect to P. Theorem

sub-a-field.

34.7. If (34. 15) holds, then .§'0 u[ r6: e =

E e] is a minimal sufficient

PROOF. That .§'0 is sufficient follows by Theorem 34.6. Suppose that .§' is sufficient. It follows by a simple extension of (34. 14) that .§' is still sufficient if P is added to the family, and then it follows by Lemma 2 that each P6 has with respect to P a density g 6 that is measurable .§'. Since densities are essentially unique, P[ g6 = r6 ] = 1. Let d? be the class of A in .§'0 such that P( A A B) = 0 for some B in .§'. Then d? is a u-field containing each set of the fo rm A = [ r6 H] (take B = [ g6 H]) and hence containing .§'0. Since, by (34. 15), P dominates each P6, §0 is essentially contained in .§', in the sense of the defi nitio n. •

E

E

Minimum-Variance Estimation •

To illustrate sufficiency, let g be a real function on e, and consider the problem of estimating g(e). One possibility is that e is a subset of the line and g is the identity; another is that e is a subset of R k and g picks out one of the coordinates. (This problem is considered from a slightly different point of view at the end of Section 19.) An estimate of g(e) is a random variable Z, and the estimate is unbiased if £6[Z] g (e) for all e. One measure of the accuracy of the estimate Z is £6[( Z g(ll)) 2 ]. If .§' is sufficient, it follows by linearity (Theorem 34.2(ii)) that £6[ Xli.§'] has for simple X a version that is independent of e. Since there are simple X, such that I X,l < lXI and X, -+ X, the same is true of any X that is integrable with respect to each P6 (use Theorem 34.2(v)). Suppose that .§' is, in fact, sufficient, and denote by £[ XII.§'] a version of £6[ XII.§'] that is independent of e. =

Theorem

Then

34.8. Suppose that E6[(Z - g(e))Z] < oo for all e and that .§' is sufficient.

(34.16)

for all e. If Z is unbiased, then so is E[ ZII.§']. * This topic may be omitted.

SECTION

34.


455

By Jensens's inequality (34.7) for �p(x) = (x - g(6))2, (E[ZII.§'] - g(6))2 < E6[(Z - g (6 ))2 1 1.§' . Applying £6 to each side gives (34.16). The second statement • follows from the fact that £6[ E[ZII.§']] = £6[Z]. PROOF.

]

This, the Rao-Blackwell theorem, says that as Z if .§' is sufficient.

E[ZII.§'] is at least as good an estimate

Example 34.5. Returning to Example 34.4, note that each X; has mean 6 j2 under P6, so that if X = k- 1t7= X; is the sample mean, then 2X is an unbiased estimate of 6. But there is a better1 one. By (34.9), £6[2 X liT] = (k + l )Tjk T', and by the Rao-Biackwell theorem, T' is an un biased estimate with variance at most t hat of 2X. In fact, for an arbitrary unbiased estimate Z, E6[(T' - 6)2] < £6[(Z - 6)2 ]. To prove this, let o = T' - E[ZIIT]. By Theorem 20. l(ii), o = f(T) for some Borel function f, and E6[ f(T )] 0 for all 6. Taking account of the density for T leads to jg[(x )x" - 1 dx = 0, so thai f(x)x k - 1 integrates to 0 over all intervals. Therefore, f(x) along with f(x )x k - 1 vani:;hes for x > 0, except on a set of Lebesgue measure 0, and hence P6[f(T) = 0] 1 and P6[T' = E[ZIIT]] 1 for all 6. Therefore, E6[(T' 6)2] = £6[(E[Z11Tl - 6)2 ] < £6[(Z - 6)2] for Z unbiased, and T' has minimum vari =

=

=

=

ance among all unbiased estimates of 6.

•

PROBLEMS 34.1. Work out for conditional expected values the analogues of Problems 33.4, 33.5, and 33.9. 34.2. In the context of Examples 33.5 and 33. 12, show that the conditional expected value of Y (if it is integrable) given X is g(X), where

g(x) =

{' f( x, y )ydy ::

- - -

J

-

f( x, y ) dy

"'

34.3. S how that the independence of X and Y implies that E[YIIX] = E[Y], which in turn implies that E[XY] = E[ X]E[Y]. Show by examples in an of three points that the reverse implications are bo th false.

D.

B be an event with P( B) > 0, and define a probability measure P0 by P0(A) = P(AIB). Show that P0[ AII.§'] = P[A n Bll.§']jP[BII.§'] on a set of P0-measure 1. (b) Suppose that d? is generated by a partition 8 , 82 , . . . , and let .§'V � 1 u(.§'U d?). Show that with probabil ity 1,

34.4. (a) Let

P[ All.§'v d?]

=

n B;ll.§' ) �18; P[PA[ B;ll.§' )

456

D ERIVATIVES AND CONDITIONAL PROBABILITY

34.5. The equation

(34.5) was proved by showing that the left side is a version of the right side. Prove it by showing that the right side is a version of the left side.

34.6. Prove fo r bounded X and Y that E[ YE[ XII.§']] = E[ XE[ YII.§']]. 34.7.

33.9 j

Generalize Theorem 34.5 by replacing X with a random vecto r

34.8. Assume that X is nonnegative but no t necessarily integrable. Show that it is

still possible to define a nonnegative random variable £[ X II.§'], measurable .§', such that (34. 1 ) holds. Prove versions of the monotone convergence theorern and Fatou's lemma. 34.9.

(a) Show fo r non nega tive X that £[ X I I .§'] = f0P[ X > tll.§'] dt with probabil ity 1. k k (b) Generalize Markov's inequ ality: P[ I X I ?- a ll.§' ] < a - E[ I X I ll.§'] with probab;lity 1 . (c) Simil arly generalize Chevyshev's and Holder"s inequalities.

34.10.

(a) Show that, if .§'1 c .§'2 and E[ X 2 ] < oo, then E[( X - E[ X II .§'2 ]) ] < E[(X 2

2

- £[ Xll .§'1 ]) ]. The dispersion of X about its conditional mean becomes smaller as the u-field grows. 2 (b) Define Var[ X II.§' ] = E[( X - £[ X II.§']) II.§']. Prove that Var[ X ] = E[Var[ X I I .§']] + Var[ E[ X II.§']].

!T, let 5;1 be the u-field generated by 5; U .§j, and let A; be the generic set in .§j. Consider three conditions:

34. 1 1 . Let .§'1 , .§'2 , .§'3 be u-fields in

all all

(i) P[ A 3II.§'1 2 ] = P[ A 3 II .§'2 ] for A 3• A 3• (ii) P[ A 1 n A 3II.§'2 ] = P[ A 1 /I .§'2 ]P[ A 3II .§'2 ] for A1 (iii) P[ A 1 II .§'23 ] P[ A 1 II .§'2 ] for A 1• If .§'1 , .§'2 , and .§'3 are interpreted as descriptions of the past, present, and future, respectively, (i) is a general version of the Markov prope rty: th e conditional probability of a future event A 3 given the past and present .#'1 2 is the same as lhe condi tional probabil ity given the present .§'2 alone. Condition (iii) is the same with time reversed. And (ii) says that past and future events A 1 and A 3 are conditionally independent given the present .§'2 • Prove the three conditions equivalent. =

34. 12.

all and

33.7 34. 1 1 i Use Exa mple 33.10 to calculate P[ Ns the Poisson process.

=

k liN;,

u >

t ] ( s < t) for

2 L be the Hilbert space of square-integrable random variables on (fi, !T, P). For .§' a u-field in !T, let M.# be the su bspace of elements of e 2 that are measurable .§'. Show that the operator P.# defined for X E [ by P.#X = £[ X II.§'] is the perpendicular projection on M.#.

34.13. Let

Suppose in Problem 34. 13 that .§= u(Z) fo r a random variable Z in L2 . Let S2 be the o ne-dimensional subspace spanned by z. Show that S2 may be 2 much smaller than Mu Z>• so that E[ XIIZ] (fo r X E L ) is by no means the projection of X on Z. Take Z the identity function on the unit interval with Lebesgue measure.

34.14. j

Hint :

SECTION

34.


457

j

34.15.

Problem 34. 13 can be turned around to give an alternative approach to conditional probability and expected value. For a u-field .:# in !T, let P.# be the perpendicular projection on the subspace M§· Show that P.#X has for X E e the two properties required of E[ XII.:#]. Use this to define E[ XII.:#] for X E L2 and then extend it to all integrable X via approximation by random variables in e. Now define conditional probability.

34.16.

Mixing sequences. A sequence A A 2 , . . . !T, P) is mixing with constant if 1,

a

(D.,

of .r-sets in a probability space

lim P ( A, n E) = a P ( E )

(34.17)

n

fo r every E in !T. Then a = lim ,P( A,). (a) Show that { A ) is mixing with constant

a

if and only if

li,?1 fA XdP = a jXdP

( 34. 18)

n

fo r each integrable X (measurable !T). (b) Suppose that (34. 17) holds fo r E E &, where g is a 'IT-system, n E 9J, and A, E u( !Y) for ail n. Show that {A,} is mixing. Hint: First check (34. 18) for X measurable u(!Y) and then use conditional expected values with respec t to u(!Y). (c) Show that, if P0 is a probability measure on !T) and P0 « P, then mixing is preserved if P is replaced by P0.

(D.,

be 34.17. i Application of mixing to the central limit theorem. Let XI > X2 , random variables on P), independent and identically distributed with mean 0 and variance u 2, and pu t S, = X1 + · · · +X,. Then S,ju{n = N by the Lindeberg-Levy theorem. Show by the steps below that this still holds if P is replaced by any probability measure P0 on (!l, !T) that P dominates. For example, the central limit theorem applies to the sums [i: = 1 r1,(w) o f Rademacher functions if w i s chosen according to the unifo rm density over the unit interval, and this result shows that the same is true if w is chosen according to an arbitrary density. Let Y,, = S,juVn and Z, = (S, - S[1o , 1 )/uVn, and take fY to consist of the sets of the form [(XI ' . . . , Xk ) E H], � > 1, H E !JR k . Prove successively: (a) P[Y, < x] --+ P[N < x]. •

•

•

(D., !T,

P[IY, - Z,I > £] --+ 0. P[Z, <x] -+ P[N < x]. P(E n [Z, <x]) --+ P(E)P[N < x] for E E 9J. P(E n [Z, <x]) --+ P(E)P[N <x] for E E !T. (f) P0[Z, <x] --+ P[ N < x]. (g) P0[IY, - Z,I > E] -+ O. (h) P0[Y, <x] --+ P[N < x]. 34.18. Suppose that .:# is a sufficient subfield for the family of probability measures P6 , (} E E>, on !T). Suppose that for each (} and A, p(A, w) is a version of P6[AII.:#].,. and suppose further t hat for each p(- , w) is a probability (b) (c) (d) (e)

�

(D.,

w,


458

measure on

Qe = Pe ·

!F.

Define Q6 on !T by QeCA) = f0 p(A , w)Pidw ), and show that

The idea is that an observer with the information in .:# (but ignorant of w itself) in principle knows the values p(A , w) because each p(A, · ) is measur able .:#. If he has the appropriate randomization device, he can draw an w' from n according to the probability measure p( · , w ), and h is w' will have the same distribution Q6 = P6 that w has. Thus, whatever the value of the unknown e, the observer can on the basis of the information in .:# alo ne, and without knowing w itself, construct a probabilistic replica of w.

34.19. 34. 13 j In the context of the discussion O fl p. 252, let !T be the u-field of sets of the form e X A for A E g:: Show that under the probability measure Q, i o is the conditional expected value of g0 given !F.

6

34.20. (a) In Example 34.4, take rr to have der.sity e - over 0 = (O, oo). S how by Theorem 34.7 that T is a minimal sufficient statistic (in the sense that u(T) is minimal). (b) Let P6 be the distribution for samples of size n from a normal distribution with parameter (J = (m, u 2 ), u 2 > 0, and let rr put unit mass at (0, 1 ). Show that the sampie mean and variance form a minimal sufficient statistk:.

SECTION 35. MARTINGALES Definition

Let X1 , X2 , be a sequence of random variables oo a probability space (f!, !F, P), and let �. !F2 , be a sequence of u-fields in !F. The sequence {( X", .9,;): n 1, 2, . . . } is a martingale if these four conditions hold: •

•

•

•

•

•

=

(i) � c .9� + 1; (ii) xn is measurable �; (iii) E[IX" Il < oo; (iv) with probability 1, (35 . 1 ) Alternatively, the sequence X1 • X2 , is said to be a martingale relative to Condition (i) is expressed by saying the � form a the u-fields .9j, !F2 , filtration and condition (ii) by saying the X" are adapted to the filtration. If X" represents the fortune of a gambler after the nth play and � represents his information about the game at that time, (35.1) says that his expected fortune after the next play is the same as his present fortune. Thus a martingale represents a fair game, and sums of independent random variables with mean 0 give one example. As will be seen below, martingales arise in very diverse connections. • •

• • •

•

•

SECTION 35.

MARTINGALES

459

is defined to be a martingale if it is a martingale The sequence Xp X2 , relative to some sequence .9;, .9;, . . . . In this case, the u-fields .§, = u( X1 , X" ) always work: Obviously, .#, c .§, + 1 and X" is measurable .§, , and if (35.1) holds, then E[ Xn + l ll�l = £[£[Xn + 1 ll�lll.§, l = E[Xnll -§, 1 = X" by (34.5). For these special u-field s .§, , (35.1) reduces to •

•

•

•

•

•

,

(35.2) Since u( X1 , , X") c � if and only if X" is measurable � for each n , the u(X 1 , , X) are the smallest u-fields with respect to which the X" are a martingale. The essential condition is embodied in (35.1) and in its specialization (35.2). Condition {iii) is of course needed to ensure that E[ Xn + l ll�l exists. Condition (iv) says that X" is a version of E[ X"+ 1 11 �]; since X" is measurable �' the requirement reduces to •

•

•

•

•

•

(35 .3)

·

·

Since the � are nested, A E � implies that fAXn dP = fAXn + l dP = = fAXn +k dP. Therefore, X,, being measurable �, is a version of

·

E[X" +k ll�] : (35 .4)

with probability 1 for k > 1. Note that for A = f!, (35.3) gives (35 .5) The defining conditions for a martingale can also be given in terms of the differences (35 .6)

(d 1 = X1). By linearity, (35.1) is the same thing as (35 .7) Note that, since xk = d I + . . . +dk and dk = xk - xk - 1 > the sets Xp . . . , X, and d 1 > . . . , d " generate the same u-field: (35 .8) Let d 1 , d 2 , . . . be independent, integrable random vari ables such that E[d " ] = 0 for n > 2. If � is the u-field (35.8), then by independence E[d n + t ii�] = E[dn + l ] = 0. If d is another random variable, Example 35.1.

460


independent of the d" , and if � is replaced by u(d, d I > , � " ), then the + d" are still a martingale relative to the �- It is natural and X" d 1 + convenient in the theory to allow u-fields � larger than the minimal ones (35.8). • •

·

=

·

•

•

·

35.2. Let (f!, !F, P ) be a probability space, let v be on !F, and let .9;, .9;_, . . . be a nondecreasing sequence of

Example

a finite measure u-fields in !F. Suppose that P dominates v when both are restricted to �-that is, suppose that A E � and P(A) 0 together imply that v(A) 0. There is then a density or Radon-Nikodym derivative X" of v with respect to P when both are restricted to �; X" is a function that is measurable � and integrable with respect to P, and it satisfies =

=

A E �.

( 35 .9)

=

If A E � ' then A E �+ I as well, so that fA Xn + I dP v(A); this and (35.9) give (35.3). Thus the X" are a martingale with respect to the �• For a specialization of the preceding example, let P be Lebesgue measure on the u-field !F of Borel subsets of f! (0, 1], and let � be the finite u-field generated by the partition of f! into dyadic intervals (k2- " , (k + 1)2 - " ], 0 k < 2 " . If A E � and P(A) 0, then A is empty. Hence P dominates every finite measure v on �- The Radon-Nikodym derivative is

Example

35.3.

=

0, and that W, is measurable � - I to exclude prevision: Before the nth play the information available to the gambler is that in � - I > and his choice of stake W, must be based on this alone. For simplicity take W" bounded. Then Jf� d " is integrable, and it is measurable � if d " is, and if X" is a martingale, then E[ W, d " ll�_ 1 ] W, £[d " ll�_ 1 ] = 0 by (34.2). Thus u

, • • •

=

=

( 35.17 ) is a martingale relative to the �· The sequence Wp W2, . . . represents a betting system, and transforming a fair game by a betting system preserves fairness; that is, transforming xn into (35.17) preserves the martingale property. The various betting systems discussed in Section 7 give rise to various martingales, and these martingales are not in general sums of independent random variables-are not in general the special martingales of Example 35.1. If W" assumes only the values 0 and 1, the betting system is a selection system; see Section 7. If the game is unfavorable to the gambler-that is, if xn is a supermartin gale-and if W, is nonnegative, bounded, and measurable � - I • then the same argument shows that (35.17) is again a supermartingale, is again unfavorable. Betting systems are thus of no avail in unfavorable games. The stopping-time arguments of Section 7 also extend. Suppose that {X" } is a martingale relative to {�}; it may have come from another martingale tThere is a reversal of terminology here: a subfair game (Section 7) is against the gambler, while a submartingale favors him. t The notation has, of course, changed. The F,, and X of Section 7 have become Xn and An. n


464

via transformation by a betting system. Let r be a random variable taking on nonnegative integers as values, and suppose that (35 .18)

[r = n ] E � .

If r is the time the gambler stops, [ r = n] is the event he stops just after the nth play, and (35.18) requires that his decision is to depend only on the information � available to him at that time. His fortune at time n for this stopping rule is Xn* =

(35 .19)

(

X" XT

r.

Here X7 (which has value Xr( w )(w ) at w) is the gambler's ultimate fortune, and it is his fortune for all times subsequent to r. The problem is to show that Xd', X( , . . . is a martingale relative to .9Q, 9;, . . . . First, n- 1

E[I X; Ij = L 1

IXkl dP +

k = O [T = k )

: n)I Xn l dP
n] = f! - [ r n] E �' n

[ X"* E H ] = U [ r = k , Xk E H ] U [ r > n , X" E H ) E � . k=O

Moreover,

and

Because of (35.3), the right sides here coincide if A E �; this establishes (35.3) for the sequence X(, Xi, . . . , which is thus a martingale. The same kind of argument works for supermartingales. Since x; = XT for n 2. r, x; � X7• As pointed out in Section 7, it is not always possible to integrate to the limit here. Let X" = a + d I + . . . + d where the d" are independent and assume the values + 1 with probability � + d" = 1. Then (X0 = a ), and let r be the smallest n for which d 1 + E[ X; l = a and XT = a + 1. On the other hand, if the X" are uniformly bounded or uniformly integrable, it is possible to integrate to the limit: E[ XT ] = E[ X0 ]. n'

·

·

·

SECTION 35.

MARTINGALES

465

Functions of Martingales

Convex functions of martingales are submartingales: (i) If X I > X2 , • • • is a martingale relative to .9;, ,9S, . . . , tf cp is convex, and tf the cp(X,) are integrable, then cp(X1), cp(X2 ), . . . is a submartingale relative to .9;, ,9S. (ii) If Xp X2 , . . . is a submartingale relative to .9;, !7"2 , . . . , tf cp is nonde Theorem 35.1.

creasing and convex, and if the cp(X,) are integrable, then cp(X1 ), cp(X2 ), • • • is a submartingale relative to .9;, ,9S, . . . .

In the submartingale case, X, < E[Xn + 1 11 �1, and if cp is non decreasing, then cp(X,) < cp(E[X, + 1 11�]). In the martingale case, X, = E[X, + 1 11�1, and so cp(X,) cp(E[X, + 1 11�]). If cp is convex, then by Jensen's inequality (34.7) for conditional expectations, it follows that cp(£[ X, + 1 11 �]) < PRooF.

=

II

£[cp(X, + , )II�1.

Example 35.8 is the case of part (i) for cp(x) lxl. =

Stopping Times

Let r be a random variable taking as values positive integers or the special value oo. It is a stopping time with respect to {�} if [" = k 1 E .9\ for each finite k (see (35.18)), or, equivalently, if [ r < k 1 E .9\ for each finite k. Define (35 .20)

.9:, = [ A E .9": A n [ < k ] E .9\ ' 1 < k < oo] .

'

7'

This is a u-field, and the definition is unchanged if [ r < k] is replaced by [ r = k 1 on the right. Since clearly [ r = j] E .9:, for finite j, r is measurable

.9:,.

If r(w) < oo for all w and � = u(X 1 , . . . , X,), then Iiw) = Iiw') for all A in .9:, if and only if X;(w) X;(w') for i < r(w) r(w'): The information in .9:, consists of the values r(w), Xlw), . . . , XT( w )(w). Suppose now that r 1 and r2 are two stopping times and r 1 < r 2 • If A E .9:, 1 , then A n [r 1 < k1 E S\ and hence A n [r 2 < k1 = A n [r 1 < k1 n (r 2 < k 1 E .9\ : .9:,1 C .9:, • 2 =

=

If X 1 , • • • , X, is a submartingale with respect to .9;, . . . , .9,; and r r 2 are stopping times satisfying 1 < r < r 2 < n, then X7 1 , X7 2 is a submartingale with respect to .9:,1 .9;2 • Theorem 35.2. 1,

1


466

This is the optional sampling theorem. The proof will show that X7 1 , X7 2 is a martingale if Xp . . . , X" is. Since the X7. are dominated by EZ = 1 1Xkl, they are integrable. It is required to show that E[XT 2 liS:T ] > XT J , or PROOF.

I

1

A E -9;." .

(35 .21)

I

ButA E .9;1 implies thatA n [r l < k < r 2 ] = (A n [r l < k - l]) n [r 2 < k - lY lies in .9k_ 1 • If D.k = Xk - Xk _ 1 , then n

fA ( XT2 - XT J dP = fA kL= l I[ T J < k � ·21/), k dP n

= Lf D.k dP 2. 0 k = l A n [ T 1 < k ,;; T 2 ) by the submartingalc property.

•

Inequalities

There are two inequalities that are fundamental to the theory of martingales. Theorem 35.3.

If X1 , • • • , X" is a submartingale, then for a > 0,

(35 .22) This extends Kolmogorov's inequality: If S I > 5 2 , • • • are partial sums of independent random variables with mean 0, they form a martingale; if the variances are finite, then S�, Si, . . . is a submartingale by Theorem 35.1(i), and (35.22) for this submartingale is exactly Kolmogorov's inequality (22.9). Let r 2 = n; let r 1 be the smallest k such that Xk > a, if there is one, and n otherwise. If Mk = max k X; , then [M" > a] n [r 1 < k] = [Mk > a] E .9k, and hence [M" > a] is in .9;."1 • By Theorem 35.2, PROOF.

; ,;;

(35 .23)

a] < a - £[1Xnll · The second fundamental inequality requires the notion of an upcrossing. Let [a, f3] be an interval (a < 13) and let X 1 , X" be random variables. Inductively define variables r p r 2 , : •

•

•

•

•

•

•

•

,

•

r 1 is the smallest j such that 1 < j < n and Xi < a, and is n if there is no such j; rk for even k is the smallest j such that rk - l <j < n and Xi > /3, and is n if there is no such j; rk for odd k exceeding 1 is the smallest j such that rk <j < n and Xi < a, and is n if there is no such j. _1

The number U of upcrossings of [a, 131 by X 1 , . . . , X" is the largest i such that X < a < t3 < X . In the diagram, n = 20 and there are three upcrossmgs. T 21

T 2, - l .

I Jr\.

v �r\ /1\

a

'V

/1\.

\V

'

\

�

'

v

I

v\

\\JI

I\.

I

\

I I

T7

=

Tg

=

'

'

For a submartingale X p . . . , X" , the number U of upcross ings of [a, 131 satisfies Theorem 35.4.

(35 .24 ) Let Yk = max{O, Xk - a} and () = t3 - a. By Theorem 35.1(ii), Y1 , , Y,. is a submartingale. The rk are unchanged if in the definitions xj < a is replaced by 1j = 0 and xj � t3 by lj � (), and so u is also the number of upcrossings of [0, 0] by Yp . . . , Y,. . If k is even and rk- l is a stopping time, then for j < n, j- 1 PRooF. •

•

•

[rk = j) = U [ rk - l = i, Y;+ I < O , . . . , l:J - I < O , lj � O ] i= I

468

DERlVATIVES AND CONDITIONAL PROBABILITY

lies in � and [ rk = n] = [ r� < n - 1Y lies in �, and so rk is also a stopping time. With a similar argument for odd k, this shows that the rk are all stopping times. Since the rk are strictly increasing until they reach n, " = n. Therefore, "

n

Yn = YT, > YT" - YTI = L ( YT• - YT• - , ) k= 2

=

L e + La ,

where Le and La are the sums over the even k and the odd k in the range 2 < k < n. By Theorem 35.2, La has nonnegative expected value, and there fore, E[ Ynl � E[ LJ If YTV -I = 0 < () < Y7 2; (which is the same thing as X7 2; _ 1 < a < {3 < X7 2), then the difference Y72i - Y,. 2i - l appears in the sum Le and is at least 0. Since there are U of these differences, Le > OU, and therefore E[ Y"] � OE[U ]. In terms of the originai variables, this is

( {3 - a ) E[U ] < j

[ X, > a)

( X" - a) dP < E [ I X)] + lal.

•

In a sense, an upcrossing of [a, {3 ] is easy: since the Xk form a submartin gale, they tend to increase. But before another upcrossing can occur, the sequence must make its way back down below a, which it resists. Think of the extreme case where the Xk are strictly increasing constants. This is reflected in the proof. Each of le and La has nonnegative expected value, but for Le thr proof uses the stronger inequality E[ Le] > E[OU ]. Convergence Theorems

The martingale convergence theorem, due to Doob, has a number of forms. The simplest one is this:

Let X I > X2 , be a submartingale. If K = sup" E[I X"� < then X" � X with probability 1, where X is a ran dom variable satisfying E[ I X Il < K. Theorem 35.5.

•

•

•

oo ,

a and {3 for the moment, and let U" be the number of upcrossings of [ a, {3 ] by X1, . . . , X". By the upcrossing theorem, E[U"] < ( E[IX"I] + iai)j({3 - a) < ( K + lal)/({3 - a). Since Un is nondecreasing and PRooF.

Fix

E[U" ] is bounded, it follows by the monotone convergence theorem that sup" U" is integrable and hence finite-valued almost everywhere. Let X * and X * be the limits superior and inferior of the sequence X I> X2 , . . . ; they may be infinite. If X * < a < {3 < X * , then U" must go to infinity. Since U" is bounded with probability 1, P[ X* < a < {3 <X*] = 0.

SECTION 35.

MARTINGALES

469

Now [ X * < X * ] = U [ X * < a < {3 < X * ] ,

(35 .25 )

where the union extends over all pairs of rationals a and {3. The set on the left therefore has probability 0. Thus X * and X * are equal with probability 1, and X" converges to their common value X, which may be + oo. By Fatou's lemma, £[1 X I] < • lim inf" E[I X"Il < K. Since it is integrable, X is finite with probability 1. If the X" forn 1 a martingale, then by (35. 16) applied to the submartingale I X 1 1, I X2 1, . . the E[IXnll are nondecreasing, so that K = lim " E[IXnll· The hypothesis in the theorem that K be finite is es Xn 2 , is a martingale with respect to �� 1 , � 2 , . . . . Define Y,.k = Xnk - Xn, k - P suppose the Y,, k have second moments, and put un2k = E[ Yni II�. k 1 ](�0 = {0, f!}). The probability space may vary with n . If the martingale is originally defined only for 1 < k < rn , take Ynk = 0 and �k = �' for k > rn . Assume that r.; = 1Yn k and r.; = 1 un2k converge with probability 1. • •

•

•

_

Theorem 35.12.

Suppose that 00

2k � 2 " n f.... (F p (F '

( 35 .35)

k=l

where u is a positive constant, and that "'

L E [ Yni I!IY.k l <J) � 0

(36.36)

k=l

for each

€.

�

Then r.; = I Y,. k = uN.

PRooF oF

The proof will be given for t going to infinity through the integers.t Let Ynk = I1v. � kld Vn and �k = � . From [vn > k] = ['LJ.:lo/ < n ] E � - 2 follow E[ Ynkll�.k - l ] = 0 and un2k = E[ Y,.i ll �.k - l ] = I1v. � klufjn. If K bounds the I Yk l, then 1 < 'L; = I un2k = n - 1 '£'k"= 1 uf < 1 + K 2jn, so that (35.35) holds for u = 1. For n large enough that Kj /n < E, the sum in (35.36) vanishes. Theorem 35. 12 therefore applies, and '£'k"= 1 Yd Vn = '£; = 1 Y,.k = N. • THEOREM 35.1 1 .

t For the general case, first check that the proof of Theorem 35. 12 goes through without change if n is replaced by a parameter going continuously to infinity.

S ECTION 35.

MARTINGALES

477

PRooF oF THEOREM 35. 12. Assume at first that there is a constant

that

c

such

00

(35 .37)

which in fact suffices for the application to Theorem 35. 1 1. Write Sk = L:J= 1 Ynj (S0 = 0), Soo = r:;= I Y,.j , L k = L�� 1 un21 (l0 = 0), and 2 L00 = L:j= 1unj; the dependence on n is suppressed in the notation. To prove E[ e 1· 1 Sx] .-., e - ' 1 2 u2, observe first that I

= I E [ e l l x( 1 - e'l 2" and let

K1

bound

t2 and lt l3)

478


And (38.39)

where (use (27. 15) and increase K, )

Because of the condition E[ Ynk \ l �. k _ 1 1 0 and the definition of un2k , the right sides of (35.38) and (35.39), minus () and ()', respectively, have the same conditional expected value given �. k 1 • By (35.37), therefore, =

_

t E[l E k'Ynk - e - !t 2 u2, define A nk = [L:7= 1 un2j < c 1 and A noo = [L:j= I (J'n� < c 1, and take znk = Y,.k !Ank" From A,k E �. k - 1 follow E[ Znk l!�' k - l 1 = 0 and -r;k = E[ z;k il� k - 1 1 = / un2k . S ince L:j= 1 -r;j is L:,.k= I (J'n2j on A nk -- A n k + I and Lj = I un,·2 on A noo' the Z-array sa Usfie� ( 35.37). ' Now P( AnJ -7 1 by (35.35), and on Anoo' Tnk 2 - unk 2 for all k, so that the Z-array satisfies (35.35). And it satisfies (35.36) because I Znk I < I Y,.k 1 . There fore, by the case already treated, L:; = 1 Znk = uN. But since L; = 1 Y,.k com • cides with this last sum on A n""' it, too, is asymptoticaily normal. oo

'

A

nk

,

,

PROBLEMS 35.1. Suppose that tl 1, !l 2 , . . . are independent random variables with mean 0. Let XI = Il l and xn+ I = xn + tln+ dn(XI> . . . ' Xn), and suppose that the xn are

integrable. Show that {Xn} is a martingale. The martingales of gambling have this form.

Y1 , Y2, . . . be independent random variables with mean 0 and variance u 2• Let Xn = O:::f: = 1 Yk ) 2 - nu 2 and show that {Xn} is a martingale.

35.2. Let

SECTION 35.

MARTINGALES

479

35.3. Suppose that {Yn} is a finite-state Markov chain with transition matrix [ p1). Suppose that LjPtjx(j) = Ax(i) for all i (the x(i) are the components of a right n eigenvector of the transition matrix). Put Xn = A - x(Yn) and show that {Xn} is a martingale. 35.4. Suppose that Y1 , Y2 , are independent, positive random variables and that E[Y,,] = 1. Put Xn = Y1 Yn. (a) Show that {Xn} is a martingale and converges with probability 1 to an integrable X. (b) Suppose specifically that Yn assumes the values t and % with probability t each. Show that X = 0 with probability 1 . This gives an example where E[n; = I Yn] * n� = I E[Y,,] for independent, integrable, positive random vari ables. Show, however, that E[n;= I Yn] < n;= 1 E[ Yn] always holds. • • •

•

•

•

35.5. Suppose that X1 , X? , . . . is a martingale satisfying £[ X� ] = 0 and E[ Xj] < oo. Show that E[( Xn+r - Xn) 2 ] = r:,;; = 1 E[( Xn + k - Xn + k - l ) ] (the variance of the sum is the sum of the variances). Assume that L,n E[(Xn - Xn _ 1 )2 ] < and prove that Xn converges with probability 1. Do this first by Theorem 35.5 and then (see Theorem 22.6) by Theorem 35.3. oo

35.6. Show that a submartingale Xn can be represented as Xn Yn + Zn, where Yn is a martingale and 0 < Z 1 < Z2 < · · · . Hint · Take X0 = 0 and !ln =Xn - Xn _ 1 , and define zn = L,f: = 1 £[/l k ll�� - d (2k) - 1 [�p(y ), the sum extending over the 2k nearest neighbors y. Show for k = 1 and k = 2 that a bounded superharmonic function is constant. Show for k > 3 that there exist nonconstant bounded harmonic functions.


480

35.13. 32.7 32.9 i Let !T, P) be a probability space, let v be a finite measure on !T, and suppose that .9,; i .9;, c: !F. For n < oo, let Xn be the Radon Nikodym derivative with respect to P of the absolutely continuous part of v when P and v are both restricted to .9,;. The problem is to extend Theorem 35.7 by showing that Xn -> X, with probability 1. (a) For n < oo, let

(D.,

A E -9,; , be the decomposition of v into absolutely continuous and singular parts with respect to P on .9,;. Show that X1 , X2 is a supermartingale and converges with probability 1. (b) Let • • • •

A E'

.9,; ,

be the decomposition of a-00 into absolutely continuous and singular parts with fespect to P on .9,;. Let Yn = E[Xooll-9,;], and prove A E .9,;. Conclude that Yn + Zn = Xn with probability 1. Since Yn converges to X,, Zn converges with probability 1 to some Z. Show that fA Z dP < a-""( A) for A E .9;,, and conclude that Z = 0 with probability 1.

35.14. (a) Show that {Xn} is a martingale with respect to {.9,;} if and only if, for all n and all stopping times T such that T < n, E[ Xn11.9;] = X7• (b) Show that, if {Xn} is a martingale and T is a bounded stopping time, then

£[ XT ] = £[ Xd.

35.15. 31.9 i Suppose that .9,; i � and A E �, and prove that P[AII�] -> /A with probability 1. Compare Lebesgue's density theorem. 35.16. Theorems 35.6 and 35.9 have analogues in Hilbert space. For n < oo, let Pn be the perpendicular projection on a subspace Mn - Then Pnx Poox for all x if either (a) M1 c M2 c · · · and Moo is the closure of Un M2 -:::> • • and Moo = nn M,. --+

•

< oo

35.17. Suppose that e has an arbitrary distribution, and suppose that, conditionally on e, the random variables Y1 , Y2 , . . . are independent and normally dis tributed with mean e and variance a- 2 Construct such a sequence {e , Y1 , Y2 , • • • }. Prove (35.3 1). •

35.18. It is shown on p. 471 that optional stopping has no effect on likelihood ratios This is not true of tests of significance. Suppose that X1 , X2 , . . . are indepen dent and identically distributed and assume the values 1 and 0 with probabili ties p and 1 - p. Consider the null hypothesis that p = 4- and the alternative that p > t· The usual .05-Ievel test of significance is to reject the null

SECTION 35.

MARTINGALES

481

hypothesis if (35.40 ) For this test the chance of falsely rejecting the null hypothesis is approximately P[ N > 1.645] "" .05 if n is large and fixed. Suppose that n is not fixed in advance of sampling, and show by the law of the iterated logarithm that, even if p is, in fact, t, there are with probability 1 infinitely many n for which {35.40) holds.

35.19. (a) Suppose that (35.32) and (35.33) hold. Suppose further that, for constants s�, s; 2 ['l, = 1 ul -> p 1 and s; 2 ['l, = 1 E[ Y/I[! Yk l� s) -> 0, and show that S1� 1 r:,z = 1 Yk N. Hint: Simplify the proof of Theorem 35. 1 1 . (b) The Lir.deberg-Levy theorem for martingales Suppose that =

. . ·

, Y - 1 , Yo , Y1 , · · ·

is stationary and ergodic (p. 494) and that

Prove that L, k= l yk ;.,fii is asymptotically normal. Hint: Use Theorem 36.4 and the remark following the statement of Lindeberg's Theorem 27.2.

35.20. 24.4 j Suppose that the u-field � in Problem 24.4 is trivial. Deduce from n Theorem 35.9 that P[ AIIr .r] ----> P[ A II�] = P(A) with probability 1, and conclude that T is mixing.

CHAPTER 7

Stochastic Processes

SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM Stochastic Processes

A

stochastic process is a collection [X,: t E T] of random variables on

a

probability space (f!, !F, P). The sequence of gambler's fortunes in Section 7, the sequences of independent random variables in Section 22, the martin gales in Section 35-all these are stochastic processes for which T = {1, 2, . . . }. For the Poisson process [ N1: t > 0] of Section 23, T = [0, oo). For all these processes the points of T are thought of as representing time. In most cases, T is the set of integers and time is discrete, or else T is an interval of the line and time is continuous. For the general theory of this section, however, T can be quite arbitrary. Finite-Dimensional Distributions

A process is usually described in tenns of distributions it induces in Eu

clidean spaces. For each k-tuple (t 1 , . . . , tk ) of distinct elements of T, the random vector (X1 , , X1 k ) has over R k some distribution I""11 1 , /k' I

• • •

I

•

( 36.1 ) These probability measures J.L1 1 • 1k are the finite-dimensional distributions of the stochastic process [ X1: t E T]. The system of finite-dimensional distribu tions does not completely determine the properties of the process. For example, the Poisson process [N1: t > 0] as defined by (23.5) has sample paths (functions N1(w) with w fixed and t varying) that are step functions. But (23.28) defines a process that has the same finite-dimensional distributions and has sample paths that are not step functions. Nevertheless, the first step in a general theory is to construct processes for given systems of finite dimensional distributions. 482

S ECTION

36.

KOLMOGOROV' s EXISTENCE THEOREM

483

Now (36.1) implies two consistency propertie s of the system J.L1 1 , k . Suppose the H in (36.1) has the form H = H1 X · · x Hk (H; E .9i' 1 ), and consider a permutation 7T of (1, 2, . . . , k ). Since [( X, , . . . , X, ) E (H1 X · · X Hk )] and [( X, , . . . , X, ) E (Hrr 1 X · X Hrr k )] are t he sam� event, it follows by (36.1) that ..

•

·

1T

· ·

-rr k.

I

(36.2 ) For example, if 11- s, I = v X v ' , then necessarily The second consistency condition is

JL1, s

'

= v X v.

( 36.3 ) This is clear because (X,1, , X, k _ ) lies in H1 x · · · x Hk _ 1 if and only if (X, I , . . . , X,k - I , X, k ) lies in H 1 x · · x Hk _ 1 x R 1 Measures Jl- 1 1 1 k coming from a process [ X,: t E T ] via (36.1) necessarily satisfy (36.2) and (36.3). Kolmogorcv's existence theorem says conversely that if a given system of measures satisfies the two consistency conditions, then there exists a stochastic process having these finite-dimensional distributions. The proof is a construction, one which is more easily understood if (36.2) and into a single condition. (36.3) are combined k Define cprr: R � R k by • • •

·

•

..

applies the permutation 7T to the coordinates (for example, if 7T sends x 3 to first position, then 7T 1 1 = 3). Since cp;; 1 (H 1 X · · · X Hk ) = Hrr ! X · · · X Hrr k ' it follows from (36.2) that 'Prr

-

for rectangles H. But then

(36.4) Similarly, if cp: R k R k - 1 is the projection cp( x 1 , . . . , x k ) = ( x I > . . . , x � _ 1 ), then (36.3) is the same thing as �

(36.5 ) The conditions (36.4) and (36.5) have a common extension. Suppose that (u 1 , , u rn ) is an m-tuple of distinct elements of T and that each element of (tp . . . , tk ) is also an element of ( up . . . , um). Then (t 1 , . . . , t k ) must be the u rn ); that is, k m and there initial segment of some permutation of (u 1 , • • •

• • •

,

. . . , x k ) of real numbers, and so RT can be identified with k-dimensional Euclidean space R". If T = {1, 2, . . . }, a real function on T is a sequence {x 1 , x 2, } of real numbers. If T is an interval, RT consists of all real functions, however irregular, on the interval. The theory of RT is an elaboration of the theory of the analogous but simpler space S" of Section 2 (p. 27). Whatever the set T may be, an element of RT will be denoted x. The value of x at t will be denoted x( t) or x, depending on whether x is viewed as a function of t with domain T or as a vector with components indexed by the elements t of T. Just as R k can be regarded as the Cartesian product of k copies of the real line, RT can be regarded as a product space-a product of copies of the real line, one copy for each t in T. 1 For each t define a mapping Z, : RT � R by • • •

( 36.8 )

Z,(x ) = x ( t ) = x, .

The Z, are called the coordinate functions or projections. When later on a probability measure has been defined on RT, the Z, will be random variables, the coordinate variables. Frequently, the value Z,(x) is instead denoted Z(t, x ). If x is fixed, Z( · , x) is a real function on T and is, in fact, nothing other than x( · )-that is, x itself. If t is fixed, Z(t, · ) is a real function on R7 and is identical with the function Z, defined by (36.8).

SECTION 36.

KOLMOGORov's EXISTENCE THEOREM

485

There is a natural generalization to RT of the idea of the u-field of k-dimensional Borel sets. Let �T be the u-field generated by all the coordinate functions Z, t E T: �T = u[Z,: t E T ] . It is generated by the sets of the form

1 k for t E T and H E � • If T = {1, 2, . . . , k}, then �T coincides with jN . Consider the class �ci consisting of the sets of the form (36.9)

A = =

[ [

x

x

E RT: ( Z , J ) . . . , z, k( ) ) E H J x

E RT: (

,

x, 1 , . . . , x,

x

J E H] ,

where k is an integer, (t 1 , . . . , tk) is a k-tuple of distinct points of T, and H E �k. Sets of this form, elements of �[, are called finite-dimensional sets, or cylinders. Of course, IN[ generates �T. Now �l is not a u-field, does not coincide with �T (unless T is finite), but the following argument shows that it is a field.

y

a,

I,

If T is an interval, the cylinder [x e RT: a1 < x(t1 ) < {31, a2 <x(t2) < {32] consists of the functions that go through the two gates shown; y lies in the cylinder and z does not (they need not be continuous functions, of course)

The complement of (36.9) is RT - A = [x E RT: ( • • • , ) E Rk - H ], and so �l is closed under complementation. Suppose that A is given by (36.9) and B is given by x 1 1,

x,

(36.10) where / E �i. Let ( u 1 , . . . , u m ) be an m-tuple containing all the t,. and all the s13 • Now (t . . . , tk) must be the initial segment of some permutation of m ( u 1 , . . . , u rn ), and if 1/1 is as in (36.6) and H' = 1/1 - I H, then H' E � and A is �>

STOCHASTIC PROCESSES

486

given by

(36.1 1 ) as well as by (36.9). Similarly, B can be put in the form

(36. 12 ) where I' E �m. But then

(36.13) Since H' u I' E �m, A u B is a cylinder. This proves that �6 is a field such that � T = u(�l). The Z, are measurable functions on the measurable space ( R T, � T ). If P is a probability measure on � T, then [ 21 : t E T] is a stochastic process on (R T, � T, P), the coordinate-variable process. Kolmogorov's Existence Theorem

The existence theorem can be stated two ways:

If J.L 1 1 1 k are a system of distributions satisfying the consistency conditions (36.2) ana (36.3), then there is a probability measure P on � T such that the coordinate-variable process [Z 1: t E T] on (R T, � T, P) has the f.L 1 1 1k as its finite-dimensional distributions. Thecrem 36.2. Tf J.L 1 1 • 1 k are a system of distributions satisfying the consis tency conditions (36.2) and (36.3), then there exists on some probability space (f!, !F, P) a stochastic process [X1 : t E T] having the Jl- 1 1 • 1 k as its finite dimensional distributions. Theorem 36.1.

For many purposes the underlying probability space is irrelevant, the joint distributions of the variables in the process being all that matters, so that the two theorems are equally useful. As a matter of fact, they are equivalent anyway. Obviously, the first implies the second. To prove the converse, suppose that the process [X1 : t E T] on (f!, !F, P) has finite-dimensional distributions Jl- 1 1 •• 1 k, and define a map {: f! � R T by the requirement

(36.14 )

t E T.

For each w, {(w) is an element of R T, a real function on T, and the

SECTION

36.

KOLMOGOROV ' S EXISTENCE THEOREM

requirement is that

487

X1( w ) be its value at t. Clearly,

C 1 [ x E R T : (Z1lx ) , . . . , Z1 k( x )) E H ]

(36.15 )

= [ w E f! : (Z11( {(w)) , . . . , Z1 k( {( w)) ) E H] = [ w E n : ( X11( w ) , . . . , X1k( w)) E H ] ; since the X1 are random variables, measurable !F, this set lies in !F if k Thus � . C1A E !F for A E �J, and so (Theorem 13.1) g is measur E H able !Fj� T. By (36.15) and the assumption that [X1 : t E T] has finitedimensional distributions J..L 1 1 1�, PC 1 (see (13.7)) satisfies

( 36.16) pg- 1 [ x e R T : (Z1lx) , . . . , Z1 k( x )) E H]

= P [ w E f! : ( XJ w) , . . . , X1 k( w ) ) E H ] = 11- 1 1 1 k( H ) . Thus the coordinate-variable process [ Z 1 : t E T] on (RT, �T, PC 1) also has finite-dimensional distributions Jl-1 1 1k . Therefore, to prove either of the two versions of Kolmogorov's existence theorem is to prove the other one as well. that T is finite, say T = {l, 2, . . . , k}. Then Example 36. 1. Suppose k (R T, �T) is (R k , f!! ), and taking P 11- 1 ,2, , k satisfies the requirements of Theorem 36.1. • =

• •

Example 36.2. Suppose that T = {1 , 2, . . . } and (36.17) are probability distributions on the line. The consistency where /J- p 11-2, conditions are easily checked, and the probability measure P guaranteed by Theorem 36.1 is product measure on the product space (R T, �T ). But by Theorem 20.4 there exists on some (f!, !F, P) an independent sequence X 1 , X2, of random variables with respective distributions JJ- 1 , 11-2, ; then (36.17) is the distribution of (XI > . . . , X1 ). For the special case (36.17), Theorem 36.2 (and hence Theorem 36.1) was thus proved in Section 20. The existence of independent sequences with prescribed distributions was the measure-theoretic basis of all the probabilistic developments in Chapters 4, 5, and 6: even dependent processes like the Poisson were constructed from independent sequences. The existence of independent sequences can also be made the basis of a proof of Theorems 36.1 and 36.2 in their full generality; see the second proof below. • • • •

• • •

• • .

I

n


488

The preceding example has an analogue in 1the space S"' of sequences (2. 15). Here the finite set S plays the role of R , the zn ( ) are analogues of the Zn( ), and the product measure defined by (2.21) is the analogue of the product measure specified by (36. 17) with J.L; = J.L. See also Example 24.2. The theory for S"' is simple because S is finite: see Theorem • 2.3 and the lemma it depends on. Example 36.3.

·

·

If T is a subset of the line, it is convenient to use the order structure of the line and take the 11- s I s to be specified initially only for k-tuples ( s 1 , . . . , sk ) that are in increasing order: Example 36.4.

k

( 36.18)

It is natural for example to specify the finite-dimensional distributions for the Poisson processes for increasing sequences of time points alone; see (23.27). Assume that the J.L5 1 sk for k-tuples satisfying (36.18) have the consistency property

For given s1, • • • , sk satisfying (36.18), take (X5 1 , • • • , Xs) to have distribution J.Ls I s • If tl, . . . , tk iS a permutatiOn Of S I • • • • > Sk , take J.L1 I tO be the distribution of (X1 1 , • • • , X,): k

•

Ik

(36.20)

This unambiguously defines a collection of finite-dimensional distributions. Are they consistent? If t,. 1 , • • • , t,. k is a permutation of t 1 , . . . , tk , then it is also a permutation of is the distribution of s 1 , . . . , sk , and by the definition (36.20), J.L,; E > 0 for all n. The problem is to pose that A 1 ::> A 2 :::> show that n n An must be nonempty. Since An E �[, and since the index set involved in the specification of a cylinder can always be permuted and •

•

•


490

expanded, there exists a sequence

tI> t2,

•

•

•

of points in T for which

wheret H,. E ��. Of course, P( A ,.) = Jl-1 I t,.(H,. ). By Theorem 12.3 (regularity), there 1 exists inside H,. a compact set K,. such that Jl- 1 1 1 ( H,. - K,. ) 1< E / 2 " + • If B,. = [ x E RT: (x1 I , , x 1 ) E K,. ], then P( A,. - B,) < E / 2" + • Put C, = n z = I Bk . Then C,. c B, cA,. and P( A, - C, ) < E / 2, so that P(C,) > E / 2 > 0. Therefore, C, c C, _ 1 and C, is nonempty. n n Choose a point x < ) of RT in en " If n > k, then x < ) E c, c ckI c Bk and n n · Kk IS . boun d e d , the sequence { x (1 ), x;f2), . . . } hence ( x (1 ), , x 1( ) ) E Kk . smce is bounded for each k. By the diagonal method [A14] select an increasing sequence n n 2 , of integers such that lim; x��;l exists for each k. There is in RT some point x whose tk th coordinate is this limit for each k. But then, . . . , x�";l) and hence lies for each k, ( x 1 I , , x1 k ) is the limit as i � oo of (x��;l, k I in Kk . But that means that x itself lies in Bk and hence in A k . Thus • E n � = I A k , which completes the proof.:!: •

•

I

•

•

•

•

•

•

n

k.

k

•

1,

•

•

•

•

k.

•

•

X

The second proof of Kolmogorov's theorem goes in two stages, first for countable T, then for general T.*

T. The result for countable T will be proved in its second formulation, Theorem 36.2. It is no restriction to enumerate T as { t 1, t2, } and then to identify t,. with n; in other words, it is no restriction to assume that T = {1, 2, . . . }. Write p., , in place of p., By Theorem 20.4 there exists on a probability space (.fl, 5', P) (which can be taken SECOND PROOF FOR couNTABLE

•

1 2 , •

•

•

,n·

of random variables each to be the unit interval) an independent sequence U1, U2, uniformly distributed over (0, 1). Let F1 be the distribution function corresponding to p., 1 If the "inverse" g 1 of F1 is defined over (0, 1) by g 1(s) = inf[ x : s < F1(x )], then XI g I(UI) has distribution p., I by the usual argument: P[ g i(UI ) < X ] = P[ ul < Fl(x)] •

•

•

•

=

= Fl x ).

The problem is to construct X2, X3,

•

•

•

inductively in such a way that

(36.23) for a Borel function h k and ( X1, . . . , X,) has the distribution p., , . Assume tha t Xp . . . , X, _ 1 have been defined (n > 2): they have joint distribution p., , _ 1 and (36.23) holds for k < n - 1. The idea now is to construct an appropriate conditional distribu tion function F,.( x l x 1 , . . . , x, _ 1 ); here F,( xi X1(w), . . . , X, _ 1(w)) will have the value P[ X,. < x ll X1, , X,. _ dw would have if X,. were already defined. If g,.( · l x I > , x,. _ 1 ) •

•

•

•

•

•

t in general, An will involve indices t 1, . . . , ta ' w here a 1 < a2 < · · · . For notational simplicity a,. n is taken as n. As a matter of fact, this can be arranged anyway: Take A'a = A ,., A'k [x: k k a (x11, . . . , x,) e R ] = RT tor k < a1, and A'k = [x: (x11, . . . , x,) e H,. x R - :] = A ,. for a,. < , k < an + I· Now relabel A ,. as A,.. t The last part of the argument is, in effect, the proof that a countable product of compact sets i s compact. * This second proof, which may be omitted, uses the conditional-probability theory of Section 33.

=

.

SECTION 36.

KOLMOGOROV'S EXISTENCE THEOREM

491

is the "inverse" function, then Xn(w) = gn(Un(w)I X1(w), . . . , Xn _ 1(w)) will by the usual argument have the right conditional distribution given X 1 , . . . , Xn_ 1 , so that n (X1 , . . . , Xn_ 1 , Xn) will have the right distribution over R . the conditional distribution function, apply Theorem 33.3 in nTo!Jr,construct (R , 11-n) to get a conditional distribution of the last coordinate of (x 1 , . , . , xn) given the first n - 1 of them. This will have (Theorem 20. 1) the form v( H; i 1 , . . . , xn _ 1 ); it is a probability measure as H varies over � 1 , and

Since the integrand involves only x 1 , . . . , xn _ 1 , and since 11- n by consistency projects to 11- n - l under the map (x 1 , . . . , xn) --+ (x 1 , . . . , xn _ 1 ), a change of variable gives

JMv( H ; X I , . . . , xn - 1 ) dp.,n - l( x l , . . . , xn - 1 ) = 11- n [ X E R n : (X I >

.

•

.

' xn - 1 ) E M' Xn E H ] .

Define Fn(xlx 1 , . . . , xn _ 1 ) = v(( - oo, x]; x 1 , . . . , xn _ 1 ). Then Fn(- l x 1 , . . . , xn _ 1 ) nis a probability distribution function over the line, Fn(xl · ) is a Borel function over R - l , and

Put g,(ulx l , . . . ' Xn - 1 ) = inf[ x: u < Fn(xlx l , . . . ' xn - 1 )] for 0 • • • , xn _ 1 ) < x if and only if u < Fn(xlx 1 , . . . , xn_ 1 ). Set Xn = gn(Un1X1 , . . . , Xn _ 1 ). Since (X1 , . . . , Xn _ 1 ) has distribution 11-n - l and by (36.23) is indeper.dent oi Un> an application of (20.30) gives

P ( ( X1 , . . . , Xn _ 1 ) E M, Xn < x ) P [ ( X1 , . . . , Xn _ 1 ) E M, Un < Fn ( xiX 1 , . . . , Xn _ 1 ) ) =

= J Fn( xlx 1 , , xn _ J) dJtn - l( x 1 , , xn - l) M n = 11- n [ X E R : ( X I ' . . . ' Xn - I ) E M, xn < X ] . •••

• • •

Thus (X I , . . . ' Xn) has distribution 11-n · Note that xn, as a function of XI , . . . ' xn - 1 and Un, is a function of U1 , . . . , Un becau se (36.23) was assumed to hold for k < n. Hence (36.23) holds for k = n as well. •


492

SECOND PROOF FOR GENERAL r. Consider (RT, !JRT) once again. If s c T, let SIS = a- [ Z, : t E S]. Then SIS c !f = !JRT. Suppose that S is countable. TBy the case just treated, there exists a process [ X,:

t E S ] on some W, 5', P)-the space and the process depend on S-such that ,k for eve ry k-tuple (t I > , tk ) from S. Define a (X, I , ./:. . ,....X, ) has distribution p., , . I . map � : u -+k RT by requmng that • • •

Z, ( { { w ) ) =

{:

( ) , w

if t E S, if t � s.

Now (36.15) holds as before if t 1 , . . . , t k all lie in S, and so { is measurable S'j SIS. Further, (36.16) holds for t 1 , . . . , tk in S, Put Ps = PC 1 on SIS Then Ps is a probability measure on (RT, SIS), and (36,24 ) if H E gp k and t 1 , , t k all lie in S. (The various spaces (0, 5', P) and processes [ X,: t E S] now become irrelevant.) If 50 c S, and if A is a cylinder (36.9) for which the t 1 , , tk lie in 50, then Ps ( A ) and Ps ( A ) coincide, their common value being p., , , (H). Since these cylin�ers generate Y.s , Ps ( A ) = Ps ( A ) for all A in !F.s . If A 11iesk both in g:;.-> and g:;s , then 1 2 Ps( A ) = Ps u s ( A ) = Ps ( A ). Thus P( A ) = Ps( A ) consistently defines a set function 1 on the clas� U � SIS, the 2union extending over the countable subsets S of T. If A n lies in this union and A n E SIS (Sn countable), then s = u n sn is countable and u n An lies in SIS· Thus u s SIS is a a--field and so must coincide with !JRT. Therefore, p is a probability measure on !JR T, and by (36.24) the coordinate process has under P the • required finite-dimensional distributions. • • •

. • .

11

o

'

The Inadequacy of Theorem 36.3.

o

�T Let [ X1: t E T] be a family of real functions on !1.

lf A E u[X1 : t E T] and w E A, and if X1(w) = X:(w') for all t E T, then w' EA. {ii) If A E u[X1: t E T], then A E u [X1 : t E S] for some countable subset S of T. { i)

Define g: n --+ R T by Z/ (g(w)) = Xl(w). Let !F u [ Xl : t E1T]. By (36. 15), g is measurable !Fj�T and hence !F contains the class [C M: M E �T]. 'I he latter class is a u-field, however, and by (36.15) it contains th e sets [ w E !l : (X1(w), . . . , X1k(w)) E H], H E �\ and hence contains the I u-field !F they generate. Therefore PROOF.

(36.25)

This is an infinite-dimensional analogue of Theorem 20.10).

SECTION

' KOLMOGO ROV S EXISTENCE THEOREM

36.

493

1

As for {i), the hypotheses imply that w E A = C M and g( w) = {(w' ), so that w' E A certainly follows. For S c T, let .rs u[ X1: t E S]; {ii) says that !F= !F coincides with T .:#= U s .rs, the union extending over the countable subsets S of T. If A 1 , A 2, lie in .§, An lies in .rs for some countable Sn, and so U n A n lies n in .§ because it lies in .rs for S = U nSn. Thus .§ is a u-field, and since it contains the sets [X1 E H], it contains the u-field !F they generate. (This part of the argument was used in the second proof of the existence theorem.) • =

•

•

•

From this theorem it follows that various important sets lie outside the class � T. Suppose that T [0, oo). Of obvious interest is the subset C of R T consisting of the functions continuous over [0, oo). But C is not in � T. For suppose it were. By part (ii) of the theorem Oet !1 = RT and put [ 21: t E T] in the role of [X1 : t E T]), C would lie in u[ Z1 : t E S] for some countable S c [0, oo ). But then by part {i) of the theorem Oet f! = R T and put [Z1 : t E S ] in t he role of [ X1 : t E T]), if x E C and Z1( x ) = Z1( y) for all t E S, then y E C. From the assumption that C lies in � T thu!) follows the existence of a countable set S such that, if x E C and x( t ) = y(c) for all t in S, then y E C. But whatever countable set S may be, for every continuous x there obviously exist functions y that have discontinuities but agree with x on S. Therefore, cannot lie in � T. What the argument shows is this: A set A in R T cannot lie in � T unless there exists a countable subset S of T with the property that, if x E A and x( t ) = y(t) for all t in S, then y E A . Thus A cannot lie in � T if it effectivety involves all the points t in the sense that, for each x in A and each t in T, it is possible to move x out of A by changing its value at t alone. An d C is such a set. For another, consider the set of functions x over T = [0, oo) that are non decreasing and assume as values x(t ) only nonnegative integers: =

( 36 .26)

[ X E R[O, co) :

X

( S) < x ( t ) ,

X < t ; X ( t ) E { 0, 1 , . . . } , t > 0] .

This, too, lies outside �7. In Section 23 the Poisson process was defined as follows: Let X1 , X2, be independent and identically distributed with the exponential distribution (the probability space f! on which they are defined may by Theorem 20.4 be taken to be the unit interval with Lebesgue measure). Put S0 0 and + Xn . If Sn( w) < Sn + 1(w) for n > 0 and Sn( w ) � oo, put N(t, w) Sn X1 + = �( w ) = max[ n : Sn( w ) < t ] for t > 0; otherwise, put N( t, w) = �( w ) 0 for t > 0. Then the stochastic process [ � : t > 0] has the fini te-dimensional distributions described by the equations (23.27). The function N( w) is the path function or sample functiont corresponding to w, and by the construc tion every path function lies in the set (36.26). This is a good thing if the .

=

=

·

·

·

=

·,

t Other terms are realization of the process and trajectory.

•

•

494


process is to be a model for, say, calls arriving at a telephone exchange: The sample path represents the history of the calls, its value at t being the number of arrivals up to time t, and so it ought to be nondecreasing and integer-valued. According to Theorem 36.1, there exists a measure P on for T [0, oo) P) has the finite such that the coordinate process [Z 1: t > 0] on dimensional distributions of the Poisson process. This time does the path function Z( · x) lie in the set (36.26) with probability 1? Since Z( · x) is j ust x itself, the question is whether the set (36.26) has P-measure 1. But this set does not lie in and so it has no measure at all. An application of Kolmogorov ' s existence theorem will always yield a stochastic process with prescribed finite-dimensional distributions, but the process may lack certain path-function properties that it is reasonable to require of it as a model for some natural phenomenon. The special construc tion of Section 23 ger R"' 1 , byon (R"', a'"'{w) can= X(w) = X0(w), X1 ( w), ... ). The measure PC = PX-1 1 ( w ( be vi e wed as the distribution of X. Suppose t h at X is stationary i n , Xn+k) theall nsense that, f o r each > and H E P[ ( Xn+l• ... E H] is the same fo r 1 and Lemma p. = 0, ... . Then (use the shi f t preserves PC i s defi n ed The process X to be ergodic i f under PC the shift is ergodic in the sense of Secti o n In the ergodic case, it follows by the ergodic theorem that 2

+ 1, + 2,

24),

. . . , z _ J( x ) , Zo( x), ZJ( x), . . . ) .

�k �Tx) . Zk + x),

24.

1

. . . , X_

(36.14):

±

1,

k

1

)

k a' ,

(36.27) * This topic, which requires Section 24, may be omitted.

+ I, . . _

(.fl, !7, P),

_ 1,

24.

k

1

(36. 16)

1,

31 1).

SECTION 36.

KOLMOGOROV 'S EXISTENCE THEOREM

495

on a set of PC 1 -measure 1, provided f is measurable !JR"' and integrable. Carry (36.27) back to (fl, .9', P) by the inverse se t mapping C 1 • Then (36.28) with probability 1: (36.28) holds at w if and only if (36.27) holds at x = t w = X(w). It is understood that on the left in (36.28), Xk is the center coordinate (the Oth coordinate) of the argument of f, and on the right, X0 is the center coordinate: For stationary, ergodic X and iategrable f, (36.28) holds with probability 1 . then the Zk are independent under P{ - 1 • In this case, If the Xk are independent, n limn PC 1 (A n r B) = PC 1 ( A)PC 1 (B) for A and B in !JR0, because for large enough n the cylinders A and r nB depend on disjoint sets of time indices and hence are independent. But then it follows by approximation (Corollary 1 to Theorem 1 1 .4) that the same limit holds for all A and B in !JR"'. But for invariant B, this implies PC 1 (SC n B) = PC 1 ( W)PC 1 (B), so that P{ - 1 (8) is 0 cr 1, and the shift is ergodic under PC 1 : If X is stationary and independent, then it is ergodic. If f depends on just one coordinate of x, then (36.28) is in the indept!ndent case a consequence of the strong law of large numbers, Theorem 22. 1 . But (36.28) follows by the ergodic theorem even if f involves all the coordinates in some complicated way. Consider now a measurable real function on R"'. Define !{i: R"' R"' by �

r/J (X) = ( . . . , ( r 1 X ) , ( X), ( Tx ), . . . ) ; here (x) is the center coordinate: Zk (r/J(x)) = (T k x). It is easy to show that r/J is measurable !JR"' 1!JR"' and commutes with the shift in the sense of Example 24.6. Therefore, T preserves PC 1 rfi - I if it preserves PC \ and it is ergodic under PC 1 rfi - 1 if it is ergodic under PC 1 . This translates immediate!y into a result on stochastic processes. Define Y = ( . . . , Y _ J > Y0, Y1, ) in terms of X by •

.

(36.29 )

.

Y,

=

( . . . ' xn - 1 ' xn ' xn + I ' . . ) '

that is to sa), Y(w) = rfi ( X(w)) = rfi{w. Since PC 1 is the distribution of X, P{ - I t/1 - 1 = P(r/J{) - I = py - I is the distribution of Y:

36.4. If X is stationary and ergodic, in particular tf the Xn are indepen dent and identically distributed, then Y as defined by (36.29) is stationary and ergodic. Theorem

This theorem fails if Y is not defined in terms of X in a time-invariant way-if the in (36.29) is no t the same for all n: If n(x) = Z - n(x) and is replaced by n in (36.29), then Yn X0 ; in this case Y happens to be stationary, but it is not ergodic if the distribution of X0 does not concentrate at a single point. =

The autoregressive model. Let (x) = 'f:k =of3 k Z _ k (x) on the set < 1 and where the series converges, and take (x) = 0 elsewhere. Suppose that Example

36.5.

1 {3 1 that the Xn are independent and identically distributed with finite second moments.

Then by Theorem 22.6, Y, = 'f:k=of3 k Xn - k converges with probability 1, and by Theorem 36.4, the process Y is ergodic. Note that Yn + l f3 Yn + Xn + l and that Xn + l is independent of Yn . This is the linear autoregressive model of order 1. • =

496


The Hewitt-Savage Theorem*

Change notation: Let (R"", !JR"") be the product space with {1,2, . . . } as the index set, the space of one-sided sequences. Let P be a probability measure on !JR"". If the coordinate variables Zn are independent under P, then by Theorem 22.3, P(A ) is 0 or 1 for each A in the tail a--field !J'. If the Zn are also identically distributed under P, a stronger result halos. Let J,; be the class of !JR""-sets A that are invariant under permutations of the first n coordinates: if rr is a permutation of {1, . . . , n}, then x lies in A if and only if (Z,ix), . . . , Z,.n(x), Zn + l(x), . . . ) does. Then ../,; is a a--field. Let ../ n � = 1 J,; be the a--field of !JR""-sets invariant under all finite permutations of coordinates. Then .../ is larger than !T, since, for example, the x-set where r::;; = 1Zk(x) > en infinitely often lies in .../ but not in !T. The Hewitt-Savage theorem is a zero-one law for .../ in the independent, identi cally distributed case.

If the Zn are independent and identically distributed under P, then 1 for each A in .../.

Theorem 36.5.

P( A) is 0

or

By Corollary 1 to Theorem 1 n1.4, there are for giver. A and *' an n and a , Zn) E H) (H E gp ) such that P(A U) < t: . Let V = set U = [(Z1 [(Zn + l • . . . ' Zzn) E H ]. If the zk are independent o.nd identically distributed, then P( A "' U) is the same as PROOF.

•

.

.

!>

.

P ( [ ( Zn + ] • • . . , Zzn , Z ] , . . . , zn . z2n + l • z2n + 2 • . . ) E A ]

But if A E J;n, this is in turn the same as P(A "' V). Therefore, P(A "' U) = P(A "' V). But then, P(A "' V) < t: and P(A "' (U n V)) < P(A "' U) + P(A "' V) < 2t:. Since U and V have the same probability and are independent, it follows that P( A) is within 2 2 *' of P(U) and hence P (A) is within 2t: of P (U) = P(U)P(V) = P(U n V), which is in turn within 2t: of P(A). Therefore, I P 2 (A) - P(A)I < 4t: for all t:, and so P(A) • must be 0 or 1.

PROBLEMS

Suppose that [ X,: t E T] is a stochastic process on en, 5', P) and A E !T. Show that there is a countable subset S of T for which P[ All X, t E T] P[ AII X, t E S ] with probability 1. Replace A by a random variable and prove a similar result.

36.1. i

=

T be arbitrary and let K(s, t) be a real function over T x T. Suppose that K is symmetric in the sense that K(s, t) = K(t, s) and nonnegative-definite in the sense that I:L = 1 K(ti, ti)xi xi > O for k � 1, t1 , , tk in T, and x1, , x k real. Show that there exists a process [ X,: t E T] for which (X, , . . . , X, ) ha s

36.2. Let

•

the centered normal distribution with covariances

* This topic may be omitted.

•

•

•

•

K(ti, t), i, j =1 1, . . . , k�

•

SECTION 36.

KOLMOGORov 's EXISTENCE THEOREM

497

36.3. Let L be a Borel set on the line, let ..£' consist of the Borel subsets of L, and let LT consist of all maps from T into L. Define the appropriate notion of cylinder, and let JT be the a-field generated by the cylinders. State a version of Theorem 36. 1 for (LT, JT ). Assume T countable, and prove this theorem not by imitating the previous proof but by observing that LT is a subset of R T and lies in !JRT. 36.4. Suppose that the random variables X1, X2 , assume the values 0 and 1 and P [ Xn = 1 i.o.] = 1. Let p., be the distribution over (0, 1] of Ln = I Xn ;zn. Show that on the unit interval with the measure p., , the digits of the nonterminating dyadic expansion form a stochastic process with the same finite-dimensional distributions as X1, X2 , . . . . . • .

36.5. 36.3 i There is an infinite-dimensional version of Fubini's theorem. In the construction in Problem 36.3, let L = I = lO, 1), T = {1, 2, . . }, let J consist of the Borel subsets of I, and suppose that each k-dimensional distribution is the k-fold product of Lebesgue measure over the unit interval. Then I T is a countable product of copies of (0, 1), its elements are sequences x = (x1, x 2 , . . . ) of points of (0, 1 ), and Kolmogorov's theorem ensures the existence on (I T, JT ) of a product probability measure 7T: 7T[ X: X; < a;, i < n] = a 1 · · · an for 0 < a; < 1. Let r denote the n-dimensional unit cube. (a) Define r/J: r X I T -+ I T by .

n Show that r/J is measurable y X JT1JT and r/J - I is measurable J T1yn l X J T. Show that r/J - p,n X 7T) = 7T, where An is n-dimensional Lebesgue measure restricted to r. (b) Let f be a function measurable J T and, for simplicity, bounded. Define

in other words, integrate out the coordinates one by one. Show by Problem 34. 18, martingale theory, and the zero-one law that (36 .30 ) except for x in a set of 7T-measure 0. (c) Adopting the point of view of part (a), let gn(x1, , xn) be the result of integrating the variable (yn+ J, Yn + z , . . . ) out (with respect to 7T) from f(x 1, . , . , xn, yn + 1, . . . ). This may suggestively be written as • • •

Show that gn(x1,

• • •

, xn) -+ f(x1, x 2 , . . . ) except for x in a set of 7T-measure 0.

36.6. (a) Let T be an interval of the line. Show that !JR T fails to contain the sets of: linear functions, polynomials, constants, nondecreasing functions, functions of bounded variation, differentiable functions, analytic functions, functions con-


498

tinuous at a fixed t0, Borel measurable functions. Show that it fails to contain the set of functions that: vanish somewhere in T, satisfy x(s) < x(t) for some pair with s < t, have a local maximum anywhere, fail to have a local maximum. (b) Let C be the set of continuous functions on T = [0, oo). Show that A E gp r and A c C imply that A = 0. Show, on the other hand, that A E !JRT and C cA do not imply that A = R T.

36.7.

Not all systems of finite-dimensional distributions can be realized by stochastic processes for which !l is the unit interval. Show that there is on the unit interval with Lebesgue measure no process [ X, : t > 0] for which the X, are independent and assume the values 0 and 1 with probability i each. Compare Problem 1.1.

36.8.

Here is an application of the existence theorem in which T is not a subset of the line. Let (N, J, v) be a measure space, and take T to consist of the ..,Y-sets of finite v-measure. The problem is to construct a generalized Poisson process, a stochastic process [ XA: A E T] such that (i) XA has the Pois�on distribution with mean v( A) and (ii) XA , . . . , XA are independent if A 1 , , An are disjoint. Hint: To define the fi nite-dimensional distributions, generalize this construction: For A, B in T, consider independent random variables Y1 , Y2 , Y3 having Poisson distributions with means v( A n Be), v(A n B), v(Ac n B), take ll- A . B to be the distribution of (Y1 + Y2 , Y2 + Y3 ). •

.

.

SECTION 37. BROWNIAN MOTION Definition

A Brownian motion or Wiener process is a stochastic process [ W, : t > 0], on some (f!, !F, P), with these three properties: {i) The process starts at 0:

P[ W0 = 0] = 1 . {ii) The increments are independent : If

( 37.1)

(37 .2)

then

( 37 .3)

P [ W,. - W, .

0] corresponding to the Jl-1 1 Taking w; = 0 for t = 0 shows that there exists on some (f!, !F, P) a process [w;: t > 0] with the finite-dimensional distributions specified by the conditions (i), (ii), and (iii). ,

•

k

·

•

•

' . k

Continuity of Paths

If the Brownian motion process is to represent the motion of a particle, it is natural to require that the path functions W( · , w ) be continuous. But Kolmogorov 's theorem does not guarantee continuity. Indeed, for T = [0, oo), the space (f!, !F) in the proof of Kolmogorov's theorem is (RT, ar), and as shown in the last section, the set of continuous functions does not lie in .§RT.

SECTION

37.

501

BROWNIAN MOTION

A special construction gets around this difficulty. The idea is to use for dyadic rational t the random variables � as already defined and then to

redefine the other � in such a way as to ensure continuity. To carry th is through requires proving that with probability 1 the sample path is uniformly continuous for dyadic rational arguments in bounded intervals. Fix a space (f!, !F, P) and on it a process [ � : t > 0] having the finite motion. Let D be the set dimensional distributions prescribed for Brownian n of nonnegative dyadic rationals, let In k = [k2 - ,(k + 2)z - ], and put n

Mn k ( w ) = sup I W( 1 , w ) - W( k 2 - n , w ) l r e In k n D

(37.8 )

Suppose it is shown that f P[Mn > n - 1 ] converges. The first Borel - Cantelli i.o.] ha� probability 0. But lemma will then imply that B = [Mn > n suppose w lies outside B. Then for every t and E there exists an n such that t < n, 2n - 1 < E, and Mn(w) < n - 1 . Take 8 = 2 - n . Suppose that r and r' are dyadic rationals in [0, t ] and l r - r'l < 8. Then rn and r' must for some k < n2 n lie in a common interval In k (length 2 X 2 - ), in which case iW(r, w) - W(r', w)l < 2Mn k(w) < 2Mn( w) < 2 n - 1 < €. Therefore, uJ $. B implies that W( r, w) is for every t uniformly continuous as r ranges over the dyadic rationals in [0, t ], and hence it will have a continuous extension to [0, oo). To prove L:P[Mn > n - 1 ] < oo, use Etemadi's maximal ineq�ality (22.10), which applies because of the independence of the increments. This, together with Markov's inequality, gives -1

_P [ �� I W( t + 8i2 - m ) - W( t ) I > a ) m ) - W( t) l :? a/3] < 3 imax P[ I W( t + 8i2 s; m 2

(see (21.7) for the moments of the normal distribution). The sets on the left here increase with m , and letting m � oo leads to

P sup I W( t + r8) - W( t) l > a

(37.9 )

O :s; r :s; l reD

Therefore,

n)2 X 2K( 2 P [ Mn > n - 1 ] < n2 n 1 4 (n- )

and L:P[ M > n - 1] does converge. n

4 Kn 5


502

Therefore, there exists a measurable set B such that P(B ) = 0 and such that for w outside B, W(r, w ) is uniformly continuous as r ranges over the dyadic rationals in any bounded interval. If w $. B and r decreases to t through dyadic rational values, then W(r, w) has the Cauchy property and hence converges. Put

�, ( w ) = W' ( t, w ) =

lim W( r, w )

if w $. B,

0

if

r

!1

(J)

E B'

where r decreases to t through the set D of dyadic rationals. By construc tion, W'(t, w) is continuous in t for each w in fl. If w $. B, then W(r, w ) = W'( r, w ) for dyadic rationals, and W'( · , w ) is the continuous extension to all of [0, oo). The next thing is to show that the �, have the same joint distributions as the � . It is convenient to prove this by a lemma which will be used again further on.

Let Xn and X be k-dimensional random vectors, and Let Fn( x) be the distribution function of Xn. If Xn � X with probability 1 and Fn(x) � F(x) for all x, then F(x) is the distribution function of X. Lemma 1.

PROOF. t Le t X have distribution function Theorem 4. 1, if h > 0, then n

< lim infn Fn (

X1

+

H. By two applications of

h , . . . , Xk + h )

It follows by continuity from above that F and H agree.

•

Now, for 0 < t 1 < · · · < tk , choose dyadic rationals r1(n) decreasing to the t;. Apply Lemma 1 with (W,.1 < n > ' . . . , W,.k ) and (� ; , . . . , �: ) in the roles of Xn and X, and with the distribution function with density (37.6) in the role of F. Since (37.6) is continuous in the t,., it follows by Scheffe's theorem that Fr.(x) � F(x ), and by construction Xn � X with probability 1. By the lemma, ( W,', . . . , �,k ) has distribution function F, which of course is also the distribution function of ( � , . . . , W, k ). Thus [�': t > 0] is a stochastic process, on the same probability space as [ � , t > 0], which has the finite-dimensional distributions required for Brown ian motion and moreover has a continuous sample path W'( · , w ) for every w. J

1

t The lemma is an obvious consequence of the weak-convergence theory of Section 29; the point of the special argument is to keep the development independent of Chapters 5 and 6.

SECTION 37.

BROWNIAN MOTION

503

By enlarging the set B in the definition of �'(w) to include all the w for which W(O, w) =I= 0, one can also ensure that W'(O, w) = 0. Now discard the original random variables � and relabel �, as � . The new [ �: t > 0] is a stochastic process satisfying conditions (i), (ii), and (iii) for Brownian motion and this one as well: (iv) For each w, W(t, w) is continuous in t and W(O, w) = 0. From now on, by a Brownian motion will be meant a process satisfying (iv") as well as {i), (ii), and (iii). What has been proved is this:

There exist processes [ � : t > 0] satisfying conditions (i), (ii), (iii), and (iv)-Brownian motion processes. Theorem 37.1.

In the construction above, W,. for dyadic r was used to define � in general. F0 1 that reason it suffices to apply Kolmogorov's theorem for a countable index set. By the second proof of that theorem, the space (f!, !F, P) can be taken as the unit interval with Lebesgue measure. The next section treats a general scheme for dealing with path-function questions by in effect replacing an uncountabk time set by a countable one. �easurable Processes

Let T be a Borel set on the line, let [X, : t E T] be a stochastic process on an (f!, !F, P), and consider the mapping (37. 10)

1

( t, w ) � X, ( w ) = X( t, w )

carrymg T X f! into R • Let !T be the u-field of Borel subsets of T. The process is said to be measurable if the mapping (37.10) is measurable !Tx sz-;!Rt. In the presence of measurability, each sample path X( · , w) is measurable !T by Theorem 18.1. Then, for example, J:cp(X(t, w)) dt makes sense if (a, b) c T and cp is a Borel function, and by Fubini' s theorem if

jb E[ lcp( X, )I] dt a

< oo .

Hence the usefulness of this result: Theorem 37.2. PROOF.

Brownian motion is measurable.

If for

k 2-n

0] define

( 37 . 1 1 ) where c > 0. Since t � c 2 t is an increasing function, it is easy to see that the process [w;': t > 0] has independent increments. Moreover, w;' - �, = c - 1 (Wc2 , - Wc 2), and for s < t this is normally distributed with mean 0 and variance c - 2 (c 2 t - c 2s) = t - s. Since the paths W'( · w) all start from 0 and are continuous, [ w;' : t > 0] is another Brownian motion. In (37. 1 1) rhe time scale is contracted by the factor c 2 , but the other scale only by the factor c. That the transformation (37.1 1) preserves the properties of Brownian motion implies that the paths, although continuous, must be h igh ly irregular. It seems intuitively clear that for c large enough the path W( · w) must with probability nearly 1 have somewhere in the time interval [0, c] a chord with slope exceeding, say, 1. But then W'( · w) has in [0, c- 1] a chord with slope exceeding c. Since the w;' are distributed as the w; , this makes it plausible that W( · w) must in arbitrarily small intervals [0, 8] have chords with arbitrarily great slopes, which in turn makes it plausible that W( · w) cannot be differentiable at 0. More generally, mild irregularities in the path will become ever more extreme under the transformation (37. 1 1) with ever larger values of c. It is shown below that, in fact, the paths are with probability 1 nowhere differentiable. Also interesting in this connection is the transformation ,

,

,

,

,

if t > 0, if t = 0.

( 37.12)

Again it is easily checked that the increments are independent and normally distributed with the means and variances appropriate to Brownian motion. Moreover, the path W"( · w) is continuous except possibly at t = 0. But (37.9) holds with w;' in place of � because it depends only on the finite-dimen sional distributions, and by the continuity of W"( · w) over (0, oo) the supremum is the same if not restricted to dyadic rationals. Therefore, 1 P[sup s :s; n -J I W;'I > n - ] < Kjn 2 , and it follows by the first Borel-Cantelli lemma that W"( · w) is continuous also at 0 for w outside a set M of probability 0. For w E M, redefine W"(t, w) 0; then [ w;": t > 0] is a Brown ian motion and (37. 12) holds with probability 1. ,

,

,

=

SECTION 37.

BROWNIAN MOTION

505

The behavior of W( · , w) near 0 can be studied through the behavior of W"( · , w ) near oo and vice versa. Since (W/' - w;)/t = W1 1 , , W"( · , w) cannot have a derivative at 0 if W( · , w) has no limit at oo. Now, in fact, inf Wn =

(37. 13)

n

- oo

sup wn =

'

n

+ oo

with probability 1. To prove th is, note that Wn = X 1 xk = wk - wk - 1 are independent. Consider

[ sup wn < ] oo

n

=

"'

"'

[

· · ·

+

+

Xn , where the

< ]

u n "!ax W; u ;

u = l m= l

t_m

this is a tail set and hence by the zero-one law has probability 0 or 1. Now -X1 , -X2 , have the same joint distributions as X1, X2 , , and so th is event has the same probability as •

•

•

•

[ inf W, > ] = - oo

n

"'

"'

[

u n '!l:s;ax ( - W;) r m

u=l m � l

•

•

< uJ .

If these two sets have probability 1, so has [supni W,I < oo], so that P[ sup n i Wn l <x] > O for some x. But P[1Wnl <x] = P[IW1 I <xjn 1 1 2 ] � o. This proves (37. 13). Since (37. 13) holds with probability 1, W"( , w) has with probability 1 upper and lower right derivatives of + oo and - oo at t = 0. The same must be true of every Brownian motion. A similar argument shows that, for each fixed t, W( , w) is nondifferentiable at t with probability 1 . In fact, W( w) is nowhere differentiable: ·

·

Theorem 37.3.

differentiable.

· ,

For w outside a set of probability 0, W( , w) is nowhere ·

The proof is direct-makes no use of the transformations (37. 1 1 ) and (37.12). Let PROOF.

By independence and the fact that the differences here have the distribution of 2 - n i 2 W1 , P[Xn k €] = P 3[IW1 1 2 n 1 2E ]; since the standard normal den sity is bounded by 1, P[Xnk €] (2 X 2 n 12E)3. If Yn = min k :s; n 2" Xn k then

0.

It is easily checked that [ �, : t > 0] has the finite-dimensional distributions

SECTION

37.

BROWNIAN MOTION

509

appropriate to Brownian motion. As the other properties are obvious, it is in fact a Brownian motion. Let

(37.18 ) The random variables (37.17) are independent of .9;0. To see this, suppose that O < s 1 < < t k . Put u ; = t0 + t,. Since the < si < t0 and 0 < t 1 < increments are independent, ( w;', w;' - w;', . . . , W,' - W,' ) = ( Wu w; 0, Wu - W, , W,, - Wu ) is independent of (Ws) , Ws - Ws , . . . , Ws W.S. ). But then ( w;', w;' , . . . , w;' ) is independent of \ W.S , � , . . . , W.SJ By Theorem 4.2, ( W,' , . . . , U!;' ) is independent of .9;0 Thus 0 0 0

2

J - l

I

.

.

· ·

0

,

k

I

I

(37. 19 )

k -

2

I

1

0

2

I

'

k

0

k

k

I

2

k-

2

I

1

I

J

j

P ( [ ( w;' , w;: ) E H] n A) P [ ( w;' , , w,: ) E H ] P ( A ) P ( ( w; , w;J E H ] P( A ) , 1

• • • ,

• 0 •

=

1

=

1, • • •

where the second equality follows because (37.17) is a Brownian motion. This holds for all H in IN k. The problem now is to prove all this when t0 is replaced by a stopping time r- a nonnegative random variable for which

(37 .20 )

[ w : r ( w) < t]

E

t > 0.

.9;,

It will be assumed that r is finite, at least with probability 1. Since [ r = t] [ r < t] - U J r < t - n - 1 ], (3 7.20) implies that

(37 .21 )

[ w : r ( w ) t] =

=

t > 0.

E .9;,

The conditions (37.20) and (37.21) are analogous to the conditions (7. 18) and (35.18), which prevent prevision on the part of the gambler. Now .9;0 contains the information on the past of the Brownian motion up to time t0, and the analogue for r is needed. Let .9:, consist of all measurable sets M for which

(37.22)

M n [ w: r ( w )

< t)

E .9;

for all t. (See (35.20) for the analogue in discrete time.) Note that .9:, is u-field and r is measurable .9:,. Since M n [ r t] M n [ r t] n [ r < t ], =

(37.23)

M n [ w: r ( w)

for M in .9:,. For example, [infs w.r > - 1] is in .9:,. ,s

7

r

=

t)

= inf[t: w;

=

a

=

E .9; =

1] is a stopping time and


510 Theorem 37.5.

Let r be a stopping time, and put

(37.24 )

t > 0.

Then [ � * : t > ] is a Brownian motion, and it is independent of .9:,-that is, u[ � * : t > 0] is independent of 5':,: (37.25) P ( [ ( � 7 , . . . , � : } E H ] n M ) =

for H in

�

k

P[( � 7 , . . . , �: ) E H ) P( M ) = P ((�1, . . . , �J E H ] P(M)

and M in .9:,.

That the transfmmation (37.24) preserves Brownian motion is the strong Markov property .t Part of . the conclusion is that the � * are random vari ables. Suppose first that general point of V. Since PROOF.

[ w : �* ( w )

E H]

=

r has countable range V and let t0 be the

U [ w : � 0 + 1 ( w ) - �i w) E H, r ( w ) = t0 ] ,

t0E V

� * is a random variable. Also,

P( [ ( � 7 , . . . , �: ) E H ] n M) = L P( [ ( � 7 , . . . , �: ) E H] n M 0 [ r = t01) . t0 E V

If M E �' then M n [ = tol E g-;0 by (37.23). Further, if = to , then �* coincides with �, as defined by (37.17). Therefore, (37.19) reduces th is last sum to T

T

L P[( �

t0E V

1

, . . . , �J E H ] P(M n [ r = t0 ])

This proves the first and third terms in (37.25) equal; to prove equality with the middle term, simply consider the case M = fl. t since the Brownian motion has independent increments, it is a Markov process (see Examples 33.9 and 33.!0); hence the terminology.

S ECTION

37.

Sll

BROWNIAN MOTION

Thus the theorem holds if r has countable range. For the general r, put

{ k2 -n

n n k - 1 }2- < < k 2- , k = 1 , 2, . . . if ( (37.26) n= 0 if r = O. n If k r n < t < (k + 1 )T , then [rn < t] = [r < k T n ] E !Fk2-" c !Jl;. Th us each n n + 1)2 - . and k r M E t < (k .9; rn is a stopping time. Suppose that < n Then M n [ Tn < t l = M n [ < kr ] E !F..k 2 -n c sz;. Thus s: c .9"_ Let n (n ) - �< >(w) = WT n(w ) +t(w) - WT"(w )(w) -that is, let � be the � * corresponding to the stopping time rn . If M E .9; then M E .9; and by an application of (37.25) to the discrete case already treated, 7'

7'

7'

T

n

Tn

•

,

But rn(w) � r(w) for each w, and by continuity of the sample paths, �< n >(w) � � * (w) for each w. Condition on M and apply Lemma 1 with n n (w,< >, . . . , w,< > ) for Xm ( �* , . . . , �"' ) for X, and the distribution function of (� , . . . , � n ) for F = Fn . Then (37.25J! 0], then according to (37.25) (and Theorem 4.2) the u-fields .9; and !F * are independent:

(37.28)

P ( A n B ) = P( A ) P( B ) ,

For fixed t define rn by (37.26) but with t2 - n in place of 2 - n at each occurrence. Then [ WT < x 1 n [ r < t] is the limit superior of the sets [ WT < x ] n [ r < t ], each of which lies in SZ";. This proves that [ � < x] lies in .9'; and hence that � is measurable g;;. Since r is measurable .9;, n

(37.29) for planar Borel sets H. The Reflection Principle

For a stopping time r, define

(37.30 )

if t < if t >

7' ' 7' .

The sample path for [ �": t > 0] is the same as the sample path for [ � : t > 0] up to r, and beyond that it is reflected through the point WT . See the figure.

512

STOCHASTIC PROCESSES w WT

------------ ----- -----

1

-------I r

: \ {'\..__ ../ I I I

I I I

...

'

W"

T

The process defined by (37.30) is a Brownian motion, and to prove it, one need only check th� finite-dimensional distributions: ?[( � 1 , , �) E H] = E H ]. By the argument starting with (37.26), it is enough to P [( �" , . . . , �") k consider the case where r has countable range, and for this it is enough to check the equation when the sets are intersected with [ r = t0 ]. Consider for notational simplicity a pair of points: • • •

I

(37.31 ) If s < t < t0 , this holds because the two events are identical. Suppose next that s < t0 < t. Since [ r = t0] lies in !F, , it follows by the independence of the increments, symmetry, and the definition (37.30) that 0

P [ r = t0 , (Ws , W, J E / , � - W,0 E l] = P [ r = t0, ( �, �0 ) E l , - (� - �J E l]

K

If

=

I X J, this is

P [ r = to , ( � , � � - � ) E K ] = P [ r = t 0, ( w;' , �� , �, - �� ) E K ] , o,

o

and by 1r-A it follows for all K E !JP3• For the appropriate K, this gives (37.31). The remaining case, t0 < s < t, is similar. These ideas can be used to derive in a very simple way the distribution of M = sup s � · Suppose that x > O. Let r = inf[s: � > x], define W" by and (37.30), ati"d put r" = inf[s: w;' > x] and M;' = SUPs I w;' . Since W" is another Brownian motion, reflection through the point W7 = x shows ,

< r

:S

r" = T

SECTION

that

37.

BROWNIAN MOTION

5 13

P[ M, > X ] = P[ T < t] = P[ T < t ' w, < X ] + P[ T < t ' w, > X ] = P[ < t ' W," < X ] + P[ T < t ' w, > X ] P [ r" < t, W, > X ] + P[ r < t , W, > X ] = P[ T < t ' w, > X] + P[ T < t' w, > X ] = 2P[ w, > X ] . T11

=

Therefore,

(37 .32)

!"'

2 e - u 2 12 du . -/2 7r x;{r

P[ M, > x ] =

-

This argument, an application of the reflection principle,t becomes quite transparent when referred to the diagram. Skorohod Embedding •

Suppose that Xp X2 , are independent and identically distributed random variables with mean 0 and variance u 2 A powerful method, due to Skoro hod, of studying the partial sums sn = XI + . . . + xn is to construct an increasing sequence r0 = O, r 1 , r 2 , of stopping times such that W(rn) has the Same distribution aS Sn. The differenCeS Tk - Tk - l Will turn OUt tO be independent and identically distributed with mean u 2 , so that by the law of large numbers n - l rn = n - I [Z = 1( rk - rk _ 1 ) is likely to be near u 2 • But if rn is near nu 2 , then by the continuity of Brownian motion paths W(rn) will be near W(nu 2 ), and so the distribution of Sn/uVn, which coincides with the distribution of W(rn)/uVn, will be near the distribution of W(na 2 )juVn -that is, will be near the standard normal distribution. The method will thus yield another proof of the central limit theorem, one independent of the characteristic-function arguments of Section 27. of max k < n Sk fuVn But it will also give more. For example, the distribution is exactly the distribution of max k p W( rk )ju{n , and this in turn is near the distribution of sup, ,;n u W(t)juvn , which can be written down explicitly because of (37.32). It will thus be possible to derive the limiting distribution of max k ,; n Sk . The joint behavior of the partial sums is closely related to the behavior of Brownian motion paths. The Skorohod construction involves the class !T of stopping times for which •

•

•

•

•

•

•

2

(37.33 ) (37.34 )

E[WT] = 0, E[r] = E [ W72 ] ,

t see Problem 37. 18 for another application. * The rest of this section, which requires martingale theory, may be omitted.

514


and

Lemma

0. Suppose t and A E !f;. Since Brownian motion has independent increments,

PROOF.

that s

f Y9 , t dP = f e 9 Ws - 9 2s/2 dP . E [ e 9( W, - Ws) - 92(1 - s)/2 ] ' A

A

and a calculation with moment generating functions (see Example 21.2) shows that

s < t ' A E �-

(37.36)

This says that for 0 fixed, [ Y9, 1 : t > 0] is a continuous-time martingale adapted to the u-fields !F,. It is the moment-generating-function martingale associated with the Brownian motion. Let f(O, t) denote the right side of (37.36). By Theorem 16.8,

:O f( 0 , t ) = £ Y9 ( W, - 0 t ) dP, a2 2 - t ] dP, Ot Y t = O, ) f( ) ( j W, , 9 [ ao 2 ,t

A

•

iJ 4 , [ ( W, - Ot ) 4 - 6( W, - O t) 2 t + 3t 2 ] dP. Y f( f ) , t 8 9 ao 4 A ·

Differentiate the other side of the equation (37.36) the same way and set 0 = 0. The result is

f � dP = f W, dP, A

A

fA ( Ws4 - 6W/s + 3s 2 ) dP = f ( W,4 - 6W, 2 t + 3t 2 ) dP, A

s < t, A E �' s < t , A E g;;,

s < t , A E g;;,

This gives three more martingales: If Z, is any of the three random variables

(37 .37)

w, ,

SECTION

then

37.

BROWNIAN MOTION

Z0 = 0, Z,

is integrable and measurable !F,, and

j Zs dP = j Z, dP,

(37.38)

5 15

A

A

s < t, A E �-

In particular, E[ Z,] = [Z0] = 0. If r is a stopping time with finite range (37.38) implies that

{t 1 ,

• • •

, tm}

bounded by

t,

then

Suppose that r is bounded by t but does not necessarily have finite range. n n n n Put Tn = k2 - t if (k - l)T t < T ::;; k2- t, 1 k 2 , and put Tn = 0 if r =- 0. Then rn is a stopping time and E[ Z-r" ] = 0. For each of the three possibilities {37.37) for Z, SUPs :s;riZsl is integrable because of (37.32). It therefore follows by the dominated convergence theorem that £[27] = lim" E[ ZT " ] = 0. Thus £[ Z7 ] = 0 for every bounded stopping time -r. The three cases {17.37) give

<
E [ WT4 ] - 6£ 1 !2 [ WT4 ] £ 1 ! 2 [ r 2 ] + 3£(r 2 ]. If C = E 1 1 2 [ W74] and x = E 1 1 2 [ r 2], the inequality is 0 > q(x) = 3x 2 - 6Cx + C 2 . Each zero of q is at most 2C, and q is negative only between these two zeros. Therefore, x < 2C, which implies (37.35). •

Suppose that r and rn are stopping times, that each rn is a member of !T, and that rn � r with probability 1. Then r is a member of Y if (i) E[ W74" ] < E[ W74] < oo for all n, or if {ii) the W74" are uniformly integrable. Lemma 3.

PROOF. Since Brownian motion paths are continuous, W7 " � W7 with probability 1. Each of the two hypotheses {i) and {ii) implies that £[ �4] " is bounded and hence that £[ r,;J is bounded, and it follows (see (16.28)) that the sequences { rn }, {W7" }, and {W72" } are uniformly integrable. Hence (37.33) and (37.34) for r follow by Theorem 16.14 from the same relations for the rn The first hypothesis implies that lim infn £[ �4] " E[W74], and the second implies that limn E[ W74] = E[ W74]. In either case it follows by Fatou's lemma that £[ r 2 ] lim infn El r,;] 4 lim infn E[ W74] • " 4E[ W74].

and put .9;< 1 > = u[ �< l ): 0 < s < t ] and 1 sr-< > = u[ w,( l >: t > 0]. Let 8 1 be the stopping time of Theorem 37.6, so that W8\1 > and X1 have the same distribution. Let �� I ) be the class of M such that M n [81 < t ] E .9;< 1> for all t. Now put w. = w,li) and (8 , W8�2>) are independent. PROOF.

I

•

2

t This is obvious from the weak-convergence poi nt of view.

-

520


class of M such that M n [8 < t] E .9; < 2> for all t. If 2 and sz- is the u-field generated by these random variables, then again SZ";,�2> and sz-(3) are independent. These two u-fields are contained in sz-( 2), which is independent of �� I )· Therefore, the three u-fields SZ";,� 1>, ��2>, sz-(3) are independent. The procedure therefore extends inductively to give independent, identically distributed random vectors (8n ' W,(8"n ) ). If Tn 8 1 + . . . + 8n ' then W ( l ) W,(81l ) + . . . + W,(8"n ) has the dis• tribution of XI + . . . + xn .

sz-;, be the WI(3 ) w./l2 l - w./l2< 2> Let

=

T"

=

=

In variance •

If E[ xn CT 2' then, since the random variables Tn - Tn I of Theorem 37.7 are independent and identically distributed, the strong law of large numbers (Theorem 22. 1) applies and hence so does the weak one: =

( 37 .40) (If E[X:] < oo, so that the rn - rn l have second moments, this follows immediately by Chebyshev 's inequality.) Now Sn has the distribution of W(rn), and rn is near nu 2 by (37.40); hence Sn should have nearly the distribution of W(nu 2 ), namely the normal distribution with mean 0 and variance nu 2• To prove this, choose an increasing s�quence of integers Nk such that P[ln - 1 rn - u 2 1 > k- 1 ] < k- 1 for n > Nk , and put En k - 1 for Nk < n < Nk + 1 • Then En � 0 and P[ln - 1 rn - u 2 1 > t: n ] < En. By two applications of (37.32), =

s.( < )

�

[ """:!f,;W(

P I W(

'•) I >
En]

+P

]

[ -nu

sup

lr

nn

2 1 :S
w·./n

< En + 4P [ I W( Enn) I � EuVn] ,

and it follows by Chebyshev's inequality that limn 8n(E) distributed as W(rn),

P

[ W��

2)

]

0. Since

[ u� < x ] < p [ w�{,( ) < x + E l + 8n( E ) .

< x - E - 8n(E) < P

*This topic may be om itted.

=

sn

] IS

SECTION

37.

521

BROWNIAN MOTION

Here W(nu 2 )ju..[rl can be replaced by a random variable N with the standard normal distribution, and lett ing n � oo and then € � 0 shows that

This gives a new proof of the central limit theorem for independent, identi cally distributed random variables with second moments (the Lindeberg -Uvy theorem-Theorem 27.1). Observe that none of the convergence theory of Chapter 5 has been used. This proof of the central limit theorem is an application of the invariance principle: Sn has nearly the distribution of W(nu 2 ), and the distribution of the latter does not depend on { ary with) the distribution common to the Xn . More can be said if the Xn have fourth moments. For each n, define a stochastic process [Y,.{t): 0 < t < 1] by Y,.(O, w) = 0 and v

(37.41 ) Yn( t , w ) =

1 k 1 . k S < t f < C (w) 1 n - n' uvn k

k = 1, . . . , n.

If kjn = t > 0 and n is large, then k is large, too, and Y,.(t) = t 1 1 2Sk /u..fk is by the central limit theorem approximately normally distributed with mean 0 and variance t. Since the Xn are independent, the increments of 07.41) should be approximately independent, and so the process should behave approximately as a Brownian motion does. Let rn be the stopping times of Theorem 37.7, and in analogy with (37.41) put Zn(O) = 0 and

(37.42) Zn( t ) = 1r W(rk ) I. f k n- 1· < t < nk- , uvn

k = 1 , . . . , n.

By construction, the finite-dimensional distributions of [ Y,.(t ): 0 < t < 1] coin cide with those of [Zn{t): 0 < t < 1]. It will be shown that the latter process nearly coincides with [ W(tnu 2 )juVn : 0 < t < 1], which is itself a Brownian motion over the time interval [0, 1]-see (37.11). Put Wn(t) = W(tnu 2 )ju..[rl . Let Bn (8) be the event that I rk - ku 2 1 > 8nu 2 for some k < n. By Kolmogorov's inequality (22.9),

(37.43) If (k -

1)n - 1 < t < kn - 1

and

n > 8 - 1 , then


522

on the event ( Bn( 8 )Y, and so I Zn( t ) - W,.( t ) l = wn

( n�2 ) - Wn( t )
€ ts

I

(

 E .

1 :$ 1 1.< - r l :$ 2/l

Let n � oo and then 8 � 0; it follows by (37.43) and the continuity of Brownian motion paths that

[

]

lim P sup i Zn( t ) - W ( t ) l > € = 0

( 37.44)

n

n

r :S I

for positive E. Since the processes (37.41) and (37.42) have the same finite dimensional distributions, this proves the following general invariance princi ple or functional central Limit theorem.

Suppose that X1 , X2 , . . . are independent, identically dis tributed random variables with mean 0, variance u 2 , and finite fourth mo ments, and define Y ( t) by (37.41). There exist (on another probability space), for each n, processes [Zn(t): 0 < t < 1] and [W,.(t): 0 < t < 1] such that the first has the same finite-dimensional distributions as [ Yn( t ): 0 < t < 1], ihe second is a Brownian motion, and P[sup, 5 1 1 Zn(t) - W,.{t)l > €] � 0 for positive €. Theorem 37.8.

n

As an application, consider the maximum Mn = max k 5 n Sk . Now Mn fuVn = sup, Yn( t ) has the same distribution as sup, Zn(t ), and it follows by (37.44) that

[

]

P sup Zn( t ) - sup Wn( t ) > € � o . 1 :$ 1

1 :$ 1

But P[sup, s: 1 Wn(t) > x] = P[sup, 5 1 W(t) > x] = 2P[ N > x] for x > O by (37.32). Therefore, ( 37.45)

]

< x � 2 P [ N < x ],

X > 0.

SECTION 37.

BROWNIAN MOTION

523

PROBLEMS

37.1. 36. 2 j Show that K(s, t) = min{s, t} is nonnegative-definite; use Problem 36.2 to prove the existence of a process with the fi nite-dimensional distributions prescribed for Brownian motion. 37.2. Let X(t) be independent, standard normal variables, one for each dyadic rational t (Theorem 20.4; the unit interval can be used as the probability space). Let W(O) = O and W(n) = Ef:= 1 X(k). Suppose that W(t) is already defined for dyadic rationals of rank 'l, and put

Prove by induction that the W( t) for dyadic t have the finite-dimensional distributions prescnbed for Brownian motion. Now construct a Brownian motion with continuous paths by the argument leading to Theorem 37. 1. This avoids an appeal to Kolmogorov' s existence theorem.

37.3.

n

j

n

For each n define new variables Wn(t) by setting Wn(kj2 ) = W(kj2 ) for dyadics of order n and interpolating linearly in between. Set Bn = sup, , n1Wn + l(t) - Wn(t)l, and show that

The construction in the preceding problem makes it clear that the difference here is normal with variance 1/2n + xnl both converge, and conclude that outside a set of probability 0, Wn(t, w) converges uniformly over bounded intervals. Replace W(t, w) by lim n Wn( t, w ). This gives another construction of a Brownian motion with continuous paths.

37.4. 36.6 i Let T = [0, oo), and let P be a p1 0bability measure on ( RT, �T) having the finite-dimensional distributions prescribed for Brownian motion. Let C consist of the continuous elements of RT. (a) Show that P* (C) = 0, or P*(RT- C) = 1 (see (3.9) and (3.10)). Thus completing (RT, �T, P) will not give C probability 1. (b) Show that P*( C) = l. 37.5. Suppose that [W,: t > 0] is some stochastic process having independent, sta tionary increments satisfying E[ W,] = 0 and E[ W, 2 ] = t. Show that if the finite-dimensional distributions are pr eserved by the transformation (37. 11), then they must be those of Brownian motion. 37.6. Show that n I > oul W,: s > t 1 contains only sets of probability 0 and 1 . Do the same for n , >Ou( W,: 0 t €]; give examples of sets in this u-field.

a ]. Show that the distri6ution of Ta has over (0, oo) the density

a l_ - a 2 f2 t h a ( t ) - r;;- _ 32 v2rr t / e

(37 .46)

Show that £['Ta l = oo. Show that 'Ta has the same distribution as a 2 IN 2 , where N is a standard normal variable.

37.12. i (a) Show by the !.trong Markov property that Ta and Ta + /3 - Ta are independent and that the latter has t he same distribution as Tw Conclude that h a * h13 = h a + f3 · Show that {3Ta has the same distribution as Ta " (b)

,_fjj

Show that each h a is stable-see Problem 28. 10.

37.13. i Suppose that X 1 , X2 , • • • are in dependent and each has the distribution (37.46). (a) Show that (X1 + · · +Xn)ln2 also has the distribution (37.46). Contrast this with the law of large numbers. (b) Show that P[n - 2 max k !; n Xk < x] -+ exp( -a yf2/rr x ) for x > O. Relate this to Theorem 14.3. ·

37.14. 37. 1 1 i Let p(s, t) be the probability that a Brownian path has at least one zero in (s, t). From (37.46) and the Markov property deduce (37 .47 )

,; p(s, t ) = - arccos !.. .

2

t

1T

Hint: Condition with respect to Hj.

37 .15. i (a) Show that the probability of no zero in ( t, 1) is (21 rr) arcsin ..[t and hence that the position of the last zero preceding 1 is distributed over (0, 1) with density rr - 1 (t(l - t)) - 1 12 • (b) Similarly calculate the distribution of the position of the first zero follow ing time 1 . (c) Calculate the joint distribution of the two zeros in (a) and (b). 37.16. i (a) Show by Theorem 37.8 that inf s u s r Yn(u) and inf s u s r Zn(u) both converge in distribution to inf s u s 1 W(u) for 0 < s < t < 1. Prove a similar result for the supremum. (b) Let An(s, t) be the event that the position at time k in a symmetric random walk, is 0 for at least one k in the range sn < k < tn, and show that P(An(s, t )) -+ (21rr) arccos ..fi7i. s

s

Sk,

s

SECTION 37. (c) Let

BROWNIAN MOTION

525

Tn be the maximum k such that k < n and Sk = 0. Show that Tnfn has

asymptotically the distribution with density 7T- 1 (t(l - t)) - 1 12 over (0, l). As this density is larger at the ends of the interval than in the middle, the last time during a night's play a gambler was even is more likely to be either early or late than to be around midnight.

37.17. i Show that p(s, t) =p(t- ' , s - ') = p(cs, ct). Check this by (37.47) and also by the fact that the transformations (37. 1 1) and (37.12) preserve the properties of Brownian nlotion. 37.18. Deduce by the reflection principle that ( M, w;) has density

[

1

2 2(2y -x) (2y x ) --'---i=:==-. 0 for all w and that V has a continuous distribution. Define Example 38. 7.

Il l { f( t ) = �

if t * 0, if t = 0,

and put X(t, w) = f(t - V(w )). If [X; : t > 0] is any separable process with the same finite-dimensional distributions as [X,: t > 0], then X'( · , w) must with probability 1 assume the value 1 somewhere. In this case (38.11) holds for • a < 0 and b = 1, and equality in (38.12) cannot be avoided. If

( 38 .1 4)

sup i X ( t , w ) l < oo , I,

w

then (38.11) holds for some a and b. To treat the case in which (38. 14) fails, it is necessary to allow for the possibility of infinite values. If x(t) is oo or - oo, replace the third condition in (38.2) by x(tn ) � oo or x(t) � - oo. This extends the definition of separability to functions x that may assume infinite values and to processes [X,: t > 0] for which X(t, w) = + oo is a possibility.

If [X, : t > 0] is a finite-valued process on (f!, !F, P), there exists on the same space a separable process [X;: t > 0] such that P[X; X, ] 1 for each t. Theorem 38.1.

=

=


532

It is assumed for convenience here that X(t, w) is finite for all t and w, although this is not really necessary. But in some cases infinite values for certain X'(t, w) cannot be avoided-see Example 38.8. PROOF. If (38.14) holds, the result is an immediate consequence of Lemma 2. The definition of separability allows an exceptional set N of probability 0; in the construction of Lemma 2 this set is actually empty, but it is clear from the definition this could be arranged anyway. The case in which (38.14) may fa il could be treated by tracing through the preceding proofs, making slight changes to allow for infinite values. A simple argument makes this unnecessary. Let g be a continuous, strictly increasing mapping of R 1 onto (0, 1). Let Y(t, w) g(X(t, w)). Lemma 2 applie� to [Y,: t > 0]; there exists a separable process [ Y,': t > 0] such that P[ Y,' Y,] 1. Since 0 < Y(t, w) < 1, Lemma 2 ensures 0 Y'(t, w) .::;; 1. Define =

- oo X'( t, w) = g - 1 ( Y '( t , w ) ) + oo Then [X;: each t.

t > 0]

0] is separable and has the finite-dimensional distributions of [X,: t > 0], • then X'( · , w) must with probability 1 assume the value oo for some t. Combining Theorem 38.1 with Kolmogorov' s existence theorem shows that

fior any consistent system offinite-dimensional distributions 11- r k there exists a separable process with the 11- r k as finite-dimensional distributions. As shown in Example 38.4, this leads to another construction of Brownian motion with I

1

I

1

continuous paths.

Consequences of Separability

The of a fact The

next theorem implies in effect that, if the finite-dimensional distributions process are such that it "should" have continuous paths, then it will in have continuous paths if it is separable. Example 38.4 illustrates this. same thing holds for properties other than continuity.

SECTION

38.

NONDENUMBERABLE PROBABILITIES

533

Let RT be the set of functions on T = [O, oo) with values that are ordinary reals or else oo or -oo. Thus R T is an enlargement of the R T of Section 36, an enlargement necessary because separability sometimes forces infinite values. Define the function Z, on R T by Z,(x) Z(t, x) = x(t ). This is just an extension of the coordinate function (36.8). Let �T be the u-field in R T generated by the Z, t > 0. Suppose that A is a subset of R T, not necessarily in � T. For D c T [0, oo), let A0 consist of those elements x of R T that agree on D with some element y of A : =

=

A D = u n [ x E R T: x(t ) y ( t )] .

(38. 15)

=

y EA t E D

Of course, A c A 0. Let S0 denote the set of x in R T that are separable whh respect to D. In the following theorem, [X,: t > 0] and [X;: t > 0] are processes on spaces (0, !F, P) and (0', ::7', P'), which may be distinct; the path functions are X( · , w) and X'( - , w').

Suppose of A that for each countable, dense subset T [0, oo), the set (38.15) satisfies Theorem 38.2.

D

of

=

(38.16 )

-T

A0 E � ,

If [X,: t > 0] and [X;: t > 0] have the same finite-dimensional distributions, if [w: X( · , w) EA] Lies in !Y and has ?-measure 1, and if [X;: t > O] is separable, then [w': X'( · , w') EA] contains an !F'-set of P'-measJ,tre 1. If (0', .'F', P' ) is complete, then of course !F'-set of ?'-measure 1.

[w': X'(· , w') EA]

is itself an

[X;: t > 0] is separable with respect to D. The [w': X'( · , ) EA0] - [w': X'( · , w') EA] is by (38.16) a subset of [w': X'( · , w') E R T - S0], which is contained in an !7'-set of N' of ?'-mea sure 0. Since the two processes have the same finite-dimensional distribu tions and hence induce the same distribution on (R T, � T ), and since A 0 lies in �T, it follows that P'[w': X'( · , w') EA0] = P[w: X(· , w) EA0] > P[w: X{ · , w) EA] = l. Thus the subset [w': X'( · , w') EA0] - N' of [w': X'( ·, w') E A ] lies in !F' and has P'-measure 1. • PROOF. difference

Suppose that '

w

Consider the set C of finite-valued, continuous functions on T. If x E S 0 and y E C, and if x and y agree on a dense D, then x and y agree everywhere: x y. Therefore, C0 n S0 c C. Further, Example 38.9.

=

C0 = n U n ( x E RT: I x(s ) l < oo, i x (t ) l < oo, i x( s ) - x( t) I < E ] , €, 1

li

s

.


534

where E and 8 range over the positive rationals, t ranges over D, and the inner intersection extends over the s in D satisfying Is - t I < 8. Hence C0 E �T. Thus C satisfies the condition (38.16). Theorem 38.2 now implies that if a process has continuous paths with probability 1, then any separable process having the same finite-dimensional distributions has continuous paths outside a set of probability 0. In particular, a Brownian motion with continuous paths was constructed in the preceding section, and so any separable process with the finite-dimensional distribu tions of Browni�r. motion has continuous paths outside a set of probability 0. The argument in Example 38.4 now becomes supererogatory. • There is a somewhat similar argument for the step functions of the Poisson process. Let z + be the set of nonnegative integers; let E consist of the nondecreasing functions x in R T such that x(t) E Z for all t and such that for every n E Z + there exists a non empty interval J su�h that x(t) = n for t E /. Then EY.ample 38.10.

+

n

s, t E D, s < l

tED 00

() n u n�O

I

n

r eDn/

[A : x(s) < x ( t )]

[x: x(t ) = n],

where I ranges over the open intervals with rational endpoints. Thus E0 E !JRT. Clearly, E0 n S0 c E, and so Theorem 38.2 applies. In Section 23 was constructed a Poisson process with paths in E, and therefore any separable process with the same finite-dimensional distributions will have paths in E except for a set of probability 0. • For E as in Example 38. 10, let E0 consist of the elements of E that are right-continuous; a function in E need not lie in E0, although at each t it must be continuous from one side or the other. The Poisson process as defined in Section 23 by N, = max[ n : Sn < t] (see (23.5)) has paths in E0 • But if N; = m ax[ n : sn < t], then [N,': t � 0] is separable and has the sam� finite-dimensional distributions, but its paths are not in E0• Thus E0 does not satisfy the hypotheses of Theorem 38.2. Separabil ity does not help distinguish between continuity from the right and continu ity from the left. • Example 38.JI.

The class of sets A satisfying (38. 16) is closed under the forma tion o f countable unions and intersections but is not closed under complementation. Define X, and Y, as in Example 38. 1, and let C be the set of continuous paths. Then [Y,: t > 0] and [ X, : t � 0] have the same finite-dimensional distributions, and the latter is separable; Y( · , w ) is in R T - C for each w, and X( · , w ) is in R T - C for • no w. Example 38.12.

1

As a final example, consider the set of functions with disconti nuities of at most the first kind: x is in if it is finite-valued, if x(t + ) = lim s 1 x(s) exists (finite) for t � 0 and x(t - ) = lim s l 1 s(s) exists (finite) for t > 0, and if x{t) lies between x(t + ) and x(t - ) for t > 0. Continuous and right-continuous functions are special cases. Example 38./3.

1

SECTION 38.

Let

NOND ENUMBERABLE PROBABILITIES

535

V denote the general system

{38.17) where k is an integer, where the

r,., s,., and a,. are rational, and where

Define k

1( D , V, t:) = n [ x : a,. <x ( t) < a,. + t: , t E ( r,. , s;) n D ] i= I

n

%-m k

k

n

i= 2

[ x : min{a,. _ 1 , a;} < x(t) < max{ a ,._ 1 , a,.} + t:,

6

t o::

( s,. _ 1 , r; )

be the class of systems (38.17) that have a fixed value for r,. - s,._ 1 . m. lt will be shown that Let

00

(38.18)

00

1v = n n u n

k

nD)

and satisfy

U 1 (D, V, t:),

where € and fj range over the positive rationals. From this it will follow that 10 E !JRT. It will also be shown that 10 n S0 c1, so that 1 satisfies the hypothesis of Theorem 38.2. Suppose that y E 1. For fixed t: , let H be the set of nonnegative h for which there exist finitely many points t; such that 0 = t0 < t 1 < < t , = h and I y(t) -y(t')l < € for t and t' in the same interval (t,._ 1 , t,.). If hn E H and hn i h, then from the existence of y(h - ) follows h E H. Hence H is closed. If h E H, from the existence of y(h + ) it follows that H contains points to the right of h. Therefore, H = [O, oo). From this it follows that the right side of (38.18) contains 10. Suppose that x is a member of the right side of (38. 18). It is not hard to deduce that for each t the l imits ·

(38. 1 9)

x s ), ( sED

lim

s ! I,

lim

·

s j r, s e D

·

x(s)

exist and that x(t) lies between them if t E D. For t E D take y(t) = x(t), and for t � D take y(t) to be the first limit in (38.19). Then y E 1 and hence x E 10. This argument also shows that 10 n S0 c1. •

Appendix

Gathered here for easy reference are certain definitions and results fro m set theory and real analysis required in the text. Although there are many newer books, HAUSDORFF (the early sections) on set theory and HARDY on analysis are still excellent for the general background assumed here. Set Theory

Ai. The empty set is denoted by 0. Sets are variable subsets of some space that is space is denoted either fixed in any one definition, argument, or discussion; this k generical ly by n or by some special symbol (such as R for Eucl idean k-space). A singleton is a set consisting of just one point or element. That A is a subset of B is expressed by A c B. In accordance with standard usage, A c B does not preclude A = B; A is a proper subset of B if A c B and A -=F B. The complement of A is always relative to the overall space !l; it consists of the points of !l not contained in A and is denoted by Ac. The difference between A and B , denoted by A - B, is A n SC; here B need not be contained in A , and if it is, then A - B is a proper difference. The symmetric difference A ,), B = (A n Be) U (Ac n B) consists of the points that lie in one of the sets A and B but not in both. Classes of sets are denoted by script letters. The power set of n is the class of all subsets of !l; it is denoted 2 °. A2.

The set of w that lie in A and satisfy a given property p(w) is denoted [ w E A: p (w )]; if A = !1, this is usually shortened to [w: p( w )].

A3. In this book, to say that a collection [ A 6: (J E e] is disjoint always means that it is pairwise disjoint: A 6 n A 6, = 0 if e and e' are distinct elements of the index set e. To say that A meets B, or that B meets A, is to say that they are not disjoint: A n B -=F 0. The collection [ A 6: e E 0] covers B if B c U 6 A 6• The collection is a decomposition or partition of B if it is disjoint and B = U 6 A 6• A4. By A n i A is meant A 1 c A 2 c A I ::J A 2 ::J . • • and A = n n An. AS.

·· ·

and A = U n A,; by A n l A is meant

The indicator' or indicator function ' of a set A is the function On n that assumes the value 1 on A and 0 on A c; it is denoted !A ' The alternative term "characteristic function" is reserved for the Fourier transform (see Section 26).

536

537

APPENDIX A6.

De Morgan's laws are (U 6A6Y = n 6A� and ( n 6 A 6 Y = U 6A�. These and the

other facts of basic set theory are assumed known : a countable union of countable sets is countable, and so on.

If T: !l --+ !l' is a mapping of !l into !l' and A' is a set in !l', the inverse image of A' is T- 1,4' = [w E !l: Tw E A']. It is easily checked that each of these statements is equivalent to the next: w E n - T- 1A'' w $. T- 1A', Tw $.A'' Tw E !l' - A', w E T- 1 (0' - A'). Therefore, !l - r t.4' = r 1 (0' - A'). Simple considerations of this kind show that U 6r t.4'6 = r 1 ( U 6A�) and n 6r t.4'6 = r 1( n 6 A'6 ), and that A' n B' = 0 implies r 1A' n T- 1 B' = 0 (the reverse implication is false unless T!l = !l'). I f f maps n into another space, f(w ) is the value of the function f at an unspecified value of the argument w. The function f itself (the rule defining the mapping) is sometimes denoted f( · ). This is especially convenient for a functio n f(w , t ) of two arguments: For each fixed t, f( · , t ) denotes the function on !l with value f(w, t) at w. A7.

AS.

The axiom of choice.

Suppose that [ A 6: (J E E>] is a decomposition of !l into nonempty sets. The axiom of choice says that there exists a set (at least one set) C that contains exactly one point from each A6: C n A 6 is a singleton for each e in e. The existence of such sets C is assumed in "eve tyday" mathematics, and the axiom of choice may even seem to be simply true. A careful treatment of set theory, however, is based on an explicit list of such axioms and a study of the relatio nships between them; see HALMOS 2 or DuDLEY. A few of the problems require Zorn's lemma, which is equivalent to the axiom of choice; see Du DLEY or KAPLANSKY. The Real Line

The real line is denoted by R 1 ; x v y = max{x, y} and x 1\ y = min {x , y}. For real x, l x J is the integer part of x, and sgn x is + 1, 0, or - 1 as x is positive, 0, o r negative. It is convenient to be explicit about open, closed, and half-open intervals:

A9.

( a, b) = [x: a < x < b), [a, b] = [x: a < x < b), ( a, b] = [x: a <x < b ), [a, b) = [x: a < x < b ]. AlO. Of course xn --+ x means lim n x, = x; xn i x means x1 <x 2 < · · · and xn --+x; xn l x means x 1 x 2 · · · and xn x. A sequence {xn} is bounded if and only tf every subsequence {xn } contains a further subsequence {xn . } that converges to some x: lim,. x, . = x. If { ;n} is not bounded, then for each k there is an n k for which lxn I > k; no subsequence of {xn } can ' t{ >

k(j )

>

--+

.{ ( J )

converge. The implication in the other direction is a simple consequence of t e fact that every bounded sequence contains a convergent subsequence.

If {x n} is bounded, and if each subsequence that converges at all converges to x, then lim n x n x. If x n does not converge to x, then I x n, - x I > € for some positive € and some increasing sequence {n k } of integers; some subsequence of {xn } converges, but ' the limit cannot be x. Al l. A set G is defined as open if for each x in G there is an open interval I such that x E I c G. A set F is defined as closed if pc is open. The interior of A, denoted =

538

APPENDIX

x closure

I

such that A0, consists of the in A for which there exists an open interval E c A . The of A , denoted A - consists of the for which there exists a sequence of A is aA = A - - A0• The basic facts in A with The of real analysis are assumed known: A is open if and only if A = A0; A is closed if and only if A = A - ; A is closed if and only if it contains all limits of sequences in it; lies in aA if and only if there is a sequence in A c such in A and a sequence

x I

that

{xn}

Xn --+ x.

xn --+ x and Yn --+ x; and

so

x

,

boundary

{yn}

{xn}

on.

x

Every open set G on the line is a countable, disjoint union of open intervals. To see this, define points x and y of G to be equivalent if x < y and [ x, y] c G or y < x and [ y , x ] c G. This is an equivalence relation. Each equivalence ci n 1, 2, : • • .

X2

{3)

n 1. 1 , X 2• n , x 2 n I 2

•

I '\ ,

x1 n •

I k

exists. Look next at

•• • •

As a subsequence of the second row of (1), (3) is bounded. Select from it a convergent subsequence

here {n 2 k } is an increasing sequence of integers, a subsequence of {n l.k }, and lim k X 2, n: eXiStS.
0 for rational r0 > x0• It is thus no restriction to assume that g(x0) > 0. Let l be an open interval in which g is bounded above. Given a number M, choose n so that ng(x0) > M, and then choose a rational r so that nx0 + r lies in /. If r > 0, then g(r + nx0) = g(r) + g(nx0) = g(nx0) = ng(x0). If r < 0, then ng(x0) g(nx0) = g((-r) + (nx0 + r)) = g( -r) + g(nx0 + r) = g(nx0 + r ). In either A20.

=

APPENDIX

541 •

case, g(nx0) + r) = ng(x0); of course this is trivial if r = 0. Since g(nx0 + r) = ng(x0) > M and M was arbitrary, g is not bounded above in I, a contradiction. • Obviously, the same proof works if f is bounded below in some interval.

Let U be a realfunction on (0, oo) and suppose that U( x + y) = U( x )U( y) for x, y > 0. Suppose further that there is some interval on which U is bounded above. x A Then either U(x) = 0 for x > 0, or else there is an A such that U(x) = e for x > 0. n PROOF. Since U( x ) = U 2 (xj2), u is nonnegative. If U(x ) = 0, then U(xj2 ) = 0 Corollary.

and so U vanishes at points arbitrarily near 0. If U vanishes at a point, it must by the functional equation vanish everywhere to the right of that point. Hence U is identically 0 or else everywhere positive. In the latter case, the theorem applies to f(x) = log U( x ), this function being II bounded above in some interval, and so f( x) =Ax for A = log U(l).

A21. A

number-theoretic fact.

Suppose that MIS a set ofpositive integers closed under addition and that M has greatest common divisor 1. J'hen M contains all integers exceeding some n • Theorem.

0

PRooF. Let M1 consist of all the integers m, -m, and m - m' with m and m' in M. Then M1 IS closed under addition and subtraction (it is a subgroup of the group of integers). Let d be the smallest positive element of M1 • If n E M1, write n = qd + r, where 0 < r < d. Since r = n - qd lies in Mt> r must actually be 0. Thus M1 consists of the multiples of d. Since d divides all the integers in M1 and hence all the integers in M, and since M has greatest common divisor 1, d = 1. Thus M1 contains all the integers. Write 1 = m - m' with m and m' in M (if 1 itself is in M, the proof is easy), and take n0 = (m + m')2 • Given n > n0, write n = q(m + m') + r, where O < r < m + m'. From n > n0 > (r + 1Xm + m') follows q = (n - r)j(m + m') > r. But n = q(m + m') + r(m - m') = (q + r)m + (q - r)m', and since q + r > q - r > O, n lies in M. •

A22.

One- and two-sided derivatives.

Suppose that f and g are continuous on [0, oo) and g is the right-hand derivative off on (0, oo) r (t) = g(t) for t > 0. Then r (0) = g(O) as well, and g is the two-sided derivative off on (0, oo). Theorem.

:

PRooF. It suffices to show that F(t) = f(t) -f(O) - fJg(s) ds vanishes for t > 0. By assumption, F is continuous on [O, oo) and F + (t) 0 for t > 0. Suppose that F(t0) > F(t1), where 0 < t0 < t 1 • Then G(t)+= F(t) - (t - t0XF(t1) - F(t0))j(t 1 - t0) is continuous on [O, oo), G(t0) = G(t1), and c (t) > 0 on (O, oo). But then the maximum of G over [t0, t d must occur at some interior point; since G + < 0 at a local maximum, this is impossible. Similarly F(t0) < F(t1) is impossible. Thus F is constant over (0, oo) and by continuity is constant over [0, ) Since F(O) = 0, F vanishes on [0, ) • =

oo .

co .

A23. A differential equation. The equation f'(t) = Af(t) + g(t) (t > 0; g continuous) has the particular solution f0(t) = e A 'JJg(s)e- A s ds; for an arbitrary solution f, 0 and hence equals f(O) identically. All solutions (f(t) -f0(t))e- A1 has derivative A thus have the form f(t) = e '[f(O) + JJg(s)e- As ds].

APPENDIX

542

A24. A

trigonometric identity. If z * 1 and z * 0, then

and hence m

I

-I

L:

L:

/=0 k = -1

zk =

z 1-z- z L:

m-

I

-

I

1+ I

1=0

[ -z

1 1 = 1 -z 1

_

-m

z- I

m] -z -z 1 -z 1

( z m f2 _ z - m f 2 ) 2 _ - I) ( z 1/2 z - 1/2 ) 2

1 _ z-m + 1 - zm ( 1 - Z)( 1 Take z = eix. If

x

x

Z

_

is not an integral multiple of 21r, then 2 ) ( sin � mx i L L e kx = 2 • ) ( Sin 2 x 1 = 0 k = -1

m-I

If

-

I

.

1

= 21rn, the left-hand side here is m 2, which is the limit of the right-hand side as

x --+ 2 1rn .

Infinite Series

A25. Nonnegative series. Suppose x 1 , x 2 , • • • are nonnegative. If E is a finite set of integers, then E c {1, 2, . . . , n} for some n, so that by nonnegativity [k E E x k < [ f: = 1 x k . The set of partial sums [f:= 1 x k thus has the same supremum as the larger set of sums [ k E x k ( E finite). Therefore, the nonnegative series L,; 1 x k converges if and only if the sums [k E E x k for finite E are bounded, in which case the sum is the supremum: L,; = ! x k = sup E [k E E x k . E

=

A26. Dirichlet 's theorem. Since the supremum in P..25 is invariant under permuta tions, so is [k= ! x k : If the x k are nonnegative and Yk = x f< k > for some one-to-one map f of the positive integers onto themselves, then [k x k and L k Yk diverge or converge together and in the latter case have the same sum.

A27. Double series. Suppose that x ij i, j 1, 2, . . . , are nonnegative. The ith row gives a series [1xu, and if each of these converges, one can form the series L;LJ xiJ. Let the terms x;1 be arranged in some order as a single infinite series [ifxiJ; by Dirichlet's theorem, the sum is the same whatever order is used. Suppose each [jx11 converges and [;[1x;1 converges. If E is a finite set of the pairs (i, j), there is an n for which L( l, j ) E e X ij < L; s, n Lj s, n X ij < L; s, n LJ X ij < L; LjX ij; hence [;1 x if converges and has sum at most L;[1xif. On the other hand, if [if x U converges, then L; s, m LJ s. n xil < [if x;1; letting n --+ oo and then m --+ oo shows that each [j xiJ converges and that [;[1xil < L;J xif. Therefore, in the nonnegative case, Lu X ;l-. converges if and only if the [jxiJ all converge and L;Lj x iJ converges, in which case Lux;j - [;[J...xiJ. . By sy� metry, 2.. if x11 Ll-.L; Xif. Thus the order of summation can be reversed in a • nonnegative double senes: 2.. 1[1x;1 = [1 [1 xiJ. •

A28.

The Weierstrass M-test.

=

543

APPENDIX

Suppose that lim n x n k = xk for each k and that lxnkl <Mk> where f.k Mk < oo. Then f.k x k and all the f.k x nk converge, and lim n f.k x nk = f.k x k . PRooF. The series of course converge absolutely, since f.k Mk < oo. Now lf.k xnk - f.k x k l < f.k s k �lxnk - x k l + 2f.k > k Mk. Given e, choose k0 so that f.k > k Mk < e / 3, and th n choose n0 so that n > n0 implies lxnk - xkl < e/ 3k0 for • k < k�-. Then n > n0 implies If." xnk - f.k x"l < e. A29. Power series. The principal fact needed is this: If f(x) = Lk= oak x " co nverges in the range I xI < r, then it is differentiable there and f'(x) = L kak x " - 1 • (8) Theorem.

k=l

For a simple proof, choose r0 and r1 so that lxl < r0 < r1 < r. If lhl < r0 - lxl, so that I x ± hi < r0 , then the mean-value theorem gives (here 0 < (Jh < 1)

2kr$ - I j f goe to 0, it is bounded by some M, and if Mk = lak I · Mrf, then f." Mk < oo and Ia " I times the left member of (9) is at most Mk for I hi < r0 - I xl. By the M·test [A28] (applied with h -+ 0 instead of n -+ oo ), Since

r

Hence (8). Repeated application of (8) gives [ 0>( x ) =

00

[ k ( k - 1 ) ' " ( k - j + 1 ) a k x k -j.

k =j

For x = 0, this is aj = JU >(O)jj!, the formula for the coefficients in a Taylor series. This shows in particular that the values of f(x) for lxl < r determine the coefficients

ak .

Cesaro averages. If xn -+ x, then n- 1 f.Z = 1 x k -+ x. To prove this, let M bound lxkl, and given e, choose k0 so that l x - x k i < E/2 for k > k0• If n > k0 and n > 4k0Mje, then ku - 1 n n 1 x - -n1 L x" < -n L 2M + n-1 L !.2 <e. k�l k = ko k= l A30.

A3l. by

Dyadic expansions. Define a mapping T of the unit interval !l = (0, 1] into itself

{ 2w Tw = 2w - 1

if O < w < L if � < w < 1.

APPENDIX

544

Define a function

d1 on !l by if O < w < L if � < w < 1,

and let

( 10)

d;(w) = d 1(T' - 1 w ). Then ;.. d; ( w ) < L. i= l

2'

·

W

I Dx ( x' - x ) 1 - if31 x' - x I > � {31x' - xl.

Thus

2

/x' - x/ < {3 /Tx' - Tx /

( 19 )

for x , x'

E

Q -.

This shows that T is one-to-one on Q - . Since x0 does not lie in the compact set aQ, infx E aQ /Tx - Tx0/ = d 0. Let Y be the open ball with center y0 = Tx 0 and radius dj2 (Figure (vi)). Fix a y in Y. The problem is to show that y = Tx for some x in Q0, which means finding an x such that rp(x) = / y - Tx / 2 = [i(y,. - t;(x )) 2 vanishes. By compactness, the minimum of rp on Q - is achieved there. If x E aQ (and y E Y), then 2/ y - y01 < d < /Tx - y0/ < /Tx y/ + / y - Yo/, so that / y - Txol < I y - Tx /. Therefore, rp(xo) < rp(x) for X E aQ, and so the minimum occurs in Q0 rather than on aQ. At the minimizing point, a rp ;a xj = - [,. 2( y; - t ,.( x ))t;i x ) = 0, and since Dx is nonsingular, it follows that y = Tx : Each y in Y is the image under T of some point x in Q0• By (19), this x is unique (although it is possible that y = Tz for some z outside Q). Let X = Q0 n T- 1 Y. Then X is open and T is a one-to-one map of X onto Y. Now let T- ! denote the inverse point transformation on Y. By (19), T- 1 is continu ous. To prove differentiability, consider in Y a fixed point y and a variable point y' such that y ' -+ y and y' -=/= y. Let x = T- 1 y and x' = T- 1 y'; then x' is a function of y', x' -+ x, and x' -=/= x. Define u by Tx' - Tx Dx(x' - x) + u ; then u is a function of x' and hence of y', and /v/j/ x' - x / -+ O by (17). Apply Dx- 1 : D; 1(Tx' - Tx) = x' - x + Dx- l u , or r- 1 y' - T- 1 y = Dx- 1( y' - y) - D; 1 u . By (18) and (19),

>

=

I r I y' - r I Y - Dx- I ( y' - Y) I / y' - y/

l

l / x' - x/

vx- 1 v - -7--;:--;--

ix' - x/

.

Iv

'

2 /vl/{3 . - y/ - I x' - x/ f3 ·
2. Since 0 < 1j(a a k (x) = ak for + t) < 1, the induction hypothesis (use (26)) gives n n � l k < n and T - x = 1j(an + t ). Now apply the case n = 1 to Tn - x. (If a, = 1 and t = O, then ak(x) = a k for k < n - 2, a n _ 1(x) = a n ! + 1, and T - lx = 0.) Consider now the infinite case. Assume that -

(34) converges, where the

( 35 )

an are positive integers. Then n >

1.

To prove this, let n --+ oo in (25): the continued fraction t = .![ii""; + .![i2; + · · · con verges and x = 1j( a1 + t ). It follows by induction (use (26)) that

( 36 )

n>

2.

Hence 0 < x < 1, and the same must be true of t. Therefore, a1 and t are the integer and fractional parts of 1jx, which proves (35) for n = 1. Apply the same argument to

APPENDIX

551

n Tx, and continue. The x defined by (34) is irrational: otherwise, T x

=

0 for some

n,

which contradicts (35) and (36). Thus the value of an infinite simple continued fraction uniquely determi nes the partial quotients. The same is almost true of finite simple continued fractions. Since (3 1) and (32) imply (33), it follows that if x is given by (30), then any continued fraction of n terms that represents x must indeed match (30) term for term. But, for example, !f1 + l.f3= !f1 + 1.[4 + ![1. This is always possible: replace a n,( x) in (30) (where an ( x ) > 2) by a n ( x ) - 1 + ![1. Apart from this ambiguity, the representation is unique-and the representation (30) that results from repeated application of T to a rational x never ends with a partial quotient of l.t x

'

'

t see RocKEIT & Szusz for more on continued fractions.

Notes on the Problems

These notes consist of hints, solutions, and references to the literature. As a rule a solution is complete in proportion to the frequency with which it is needed for the solution of subsequent problems

Section l l.l. (a) Each point of the discrete space lies in one of the four sets A 1 n A 2 , A� nA 2 , A 1 nA�, A� n A� and hence would have probability at most 2 - 2 ; continue. (b) If, for each i, 81 is A, or A�, then 81 n · · · n Bn has probability at most n;= ,(1 - a, ) < exp[ - !:7= , a , ] Suppose A is trifling and let A - be its closure. Given e choose intervals (ak , bd, k = 1, . . . , n , such that A c 1J f: = 1(ak> bd and [f: = l bk - ak ) < ej2. If ' xk = ak - ej2n, then A - c U k = l xk • bk l and ['£. = 1(bk - x k ) < e . For the other parts of the problem, consider the set of rationals in (0, 1).

1.3. (b)

1.4.

(a) Cover Ar(i) by (r - l ) n intervals of length ,- n. (c) Go to the base r k . Identify the digits in the base r with the keys of the typewriter. The monkey is certain eventually to reproduce the eleventh edition of the Britannica and even, unhappily, the fifteenth.

1.5. (a ) The set A 3(1) is itself uncountable, since a point in it is specified by a sequence of O's and 2's (excluding the countably many t hat end in O's). (b) For sequences u 1 , consist of the points , un of O's, l's, and 2's, let M 1 in (0, 1] whose nonterminating base-3 expansions sta rt out with those digits. Then A 3(1) = (0, 1] - U M where the union extends over n > 1 and the " � sequences u 1 , . . . , u n contai ing at least one 1. The set described in part (b) is [0, 1] - U M�1 , where the union is as before, and this is the closure of A 3(1). From this representation of C, it is not hard to deduce that it can be defined as the set of points in [0, 1] that can be written in base 3 without any 1 's if terminating expansions are also allowed. For example, C contains t = . 1222 · · · = .2000 . . . because it is possible to avoid 1 in the expansion. (c) Given e and an w in C, choose w ' in A 3(1) within e/2 of w; now define w" by changing from 2 to 0 some digit of w' far enough out that w" differs from w' by at most ej2. •

•

u

•

u

u

552

u ,

u

NOTES ON THE PROBLEMS

553

1.7. The interchange of limit and integral is justified because the series [k rk (w)2 - k converges uniformly in w (integration to the limit is studied systematically inn Section 16). There is a direct derivation of (1 .40): let n --+ oo in sin t = n k 2 sin 2 - t · n ;: 1 COS 2 - t, which follows by induction from the half-angle for mula. =

1.10. (a) Given m and a subinterval (a, b] of (0, 1], choose a dyadic interval I in (a, b ], and then choose in I a dyadic interval of order n m such that In - 'sn( w )I � for w E This is possible because to specify is to specify the first n dyadic digits of the points in J, choose the first digits in 5uch a way that c I and take the following ones to be 1, with n so large that n- 1sn(w) is near 1 for w E (b) A countable union of sets of the first category is also of the first category; (0, 1] = N U Nc would be of the first category if Nc were. For Baire's theorem, see RoYDEN, p 139.

1

>

1

1.

> 1

1.

l.ll. (a) If x =p0jq0 -=F pjq, then

k ) has denominator 2 a( n ) and approximates 1/ 2 a( (c) The rational !:Z 1 n within 2j2a( + l>. =

X to

Section 2 2.3. 2.4.

2.5.

Let n consist of four points, and let sr consist of the empty set, and all six of the two-point sets. (b)

n

itself,

For example, take n to consist of the integers, and let .9,; be the a--field generated by the singletons {k} with k < n. As a matter of fact, any example in which .9,; is a proper subclass of .9,; + 1 for all n will do, because it can be shown that in this case U n-9,; necessarily fails to be a a--field; see A. Broughton and B. W. Huff: A comment on unions of sigma-fields, Amer. Math. Monthly, 84 (1977), 553 - 554. (b)

The class in question is certainly contained in f(d) and is easily seen to be closer under the formation of finite intersections. But ( u /!. I n }!. tA ;)c = n ;'!. 1 U }!. 1 A �j• and U }!. 1 A�j = U }!. 1[Af; n n t, 11 A;d has the required form. (b)

2.8. If d? is the smallest class over d closed under the formation of countable unions and intersections, clearly %c a-( d). To prove the reverse inclusion, first show that the class of A such that A c E % is closed under the formation of countable unions and intersections and contains d and hence contains %. 2.9. Note that U nBn E a-( U n .w'B ). "

2.10. (a) Show that the class of A for which Iiw) = IA (w') is a a--field. See Exam ple 4.8. 2.11.

Suppose that !f is the a--field of the countable and the cocountable sets in !l. Suppose that !f is countably generated and !l is uncountable. Show that :F

(b)


554

is generated by a countable class of singletons; if n 0 is the union of these, then sr must consist of the sets 8 and 8 u n() with 8 c no, and these do not include the singletons in n(), which is uncountable because n is. (c) Let g;_ consist of the Borel sets in n (0, 1], and let g; consist of the countable and the cocountable sets there.

=

2.12. Suppose that A 1, A 2, is an infinite sequence of distinct sets in a a--field 5', and let .:# consist of the nonempty sets of the form n �= I Bn , where Bn = An or Bn = A�, n 1, 2, . . . . Each An is the union of the .#sets it contains, and since the An are distinct, .:# must be infinite. But there are uncountably maily distinct countable unions of .#sets, and they all lie in !f. •

•

•

=

2.18. For this and the subsequent problems on applications of probability theory to arithmetic, the only number theory required is the fundamental theorem of arithmetic and its immediate consequences. The other problems on stochastic arithmetic are 4.15, 4.16, 5.19, 5.20, 6.16, 18.17, 25.15, 30.9, 30.10, 30. 1 1, and 30. 12. See also Theorem 30.3. (b) Let A consist of the even integers, let Ck = [m: u k < m � u,_ + d, a nd let B consist of the even integers in c, u c3 u . . . together with the odd integers in C2 U C4 U · · · ; take uk to increase very rapidly with k and consider A n B. (c) If c is the least common multiple of a and b, then Ma n Mb = Me. From Ma E � conclude in succession that Ma n Mb E �. Ma 1 n · · · n Ma . n Mg n · · · n Mg E .9, f(�') c �- By the same sequence of steps, show how D o � .L determi� es D on f(.-R). (d) If 81 = Ma - U p ,;; /Map• then a E 81 and (the inclusion - exclusion formula requires only finite additivity) 1 ·· D ( B;) = a1 - I: a1P + then A c B , B E 5', and P*(A) = P(B). For (3.l0), argue by completnentation. (b) Suppose that P * (A ) = P* ( A ) and chose .9=sets A 1 and A in such a way 2 that A 1 c A cA 2 and P(A 1 ) = P( A 2 ). Given E, choose an .9=set B in such a way that E c B and P'" (E) = P( B). Tht!n P*(A n E) + P*(Ac n E) < P(A 2 n B ) + P( A'{ n B). Now use (2.7) to bound tht! last snm by P(B) + P(A 2 - A 1 ) = P* (E).

3.3. First note the general fact that P* agrees with P on 90 if and only if P is countably, additive there, a condition not satisfied in parts (b) and (e). Using Problem 3.2 simplifies the analysis of P* and ./I(P* ) in the other parts. Note in parts (b) and (e) that, if P* and P* are defined by (3.1) and (3.2), then, since P* ( A ) = 0 for all A, (3.4) holds for all A and (3.3) holds for no A. Countable additivity thus plays an essential role in Problem 3.2. 3.6 (c) Split Ee by A: P0 (E ) = 1 - P ( £C ) = 1 - r (A n Ee ) - P (Ae n £C ) = 1 0 P ( A () Ee) - P( Ae) = P( A ) - P (A - E).

3.7.

(b)

Apply (3. 13): For A E 90, Q( A ) = P (H n A ) + P0 (He n A ) = P(H nA)

+ P(A ) - P0 ( A - ( He n A )) = P(A). (c) If A 1 and A 2 are disjoint 90-sets, then by (3.12),


556

Apply (3. 13) to the three terms in this equation, successively using A 1 U A 2, A 1 , and A 2 for A: Po ( H e n ( A 1 U A 2 ) ) = Po ( He n A 1 ) + Po ( He n A 2 ) .

But for these two equations to hold it is enough that H n A 1 n A 2 = 0 in the first case and He n A 1 n A 2 = 0 in the second (replacing A 1 by A 1 n A� changes nothing).

3.8. By using Banach limits (BANACH, p. 34) one can similarly prove that density D on the class 9 (Problem 2.18) extends to a finitely additive probability on the class of all subsets of n = {1, 2, . . . }. 3. 14. The argument is based on cardinality. Since the Cantor set e has Lebesgue measure 0, 2c is contained in the class ...i' of Lebesgue sets in (0, 1]. But e is uncountable: card � = card(O, 1] < card 2 c < card ...£'. 3.18. (a) Since the A Ell r are disjoint Borel sets, I:, A( A Ell r) < 1, and so the common value A(A) of the A( A Ell r) must be 0. Similarly, if A is a Borel set contained in some H Ell r, then A( A) = 0. (b) If the E n (H Ell r) are all Borel sets, they all have Lebesgue measure 0, and so E is a Borel set of Lebesgue measure 0. 3,19.

Given A 1 , B 1 , , A , _ 1 , 8, _ 1 , note that their union en is nowhere dense, so that In contains an interval In disjoint from en. Choose in In disjoint, nowhere dense sets An and Bn of positive measure. (c) Note that A and Bn are disjoint and that An U Bn c G. (b)

•

•

•

3,20. (a) If In are disjoint open intervals with union G, then b - lA( A) > f.n A(In) > Lnb - lA( A () In) > b - lA( A ).

Section 4 4.1. Let r be the quantity on the right in (4.30), assumed finite. Suppose that x < r; then x < V 'k=n x k for n > 1 and hence x < x k for some k > n: x < xn i.o. Suppose that x < x n i.o.; then x < V k_ =,, x k for n > 1: x < r. It follows that r = sup[ x : x < x n i.o.], which is easily seen to be the supremum of the limit points of the sequence. The argument for (4.31) is similar. .9'

is the u-field generated by .:#U {H} (Problem 2.7(a)). If (H n G 1 ) U (He n G2) = (H n G'1 ) u (He n G;), then G 1 � G'1 c He and now follows because A * (H) G2 � G'2 nc H; consistency n = A * (He) = O. If A!i) = n (H n G\ >) u (He n G� >) are disjoint, then G\m> n G\ > c He and G \m> n G� nc (see nProblem 2.17) P( U nA n ) = zM U nG\ >) H for m -=fonn, and therefore n + tM U ,, G� >) = Ln(tMG\ >) + -!MG� >)) = Ln P( An). The intervals with ratio nal endpoints generate .:#.

'4.10. The class

4.14. Show as in Problem l. l(b) that the maximum of P(B1 n n Bn), where B1 is A1 or A�, goes to 0. Let A , = [w: f.n iA (w)2 - " < x ], show that P(A n Ax) is continuous in x, and proceed as in Problem 2.19(a). ·

·

·


557

4.15. Calculate D(F1) by (2.36) and the inclusion - exclusion formula, and estimate Pn( F1 - F) by subadditivity; now use 0 < Pn(F1) - Pn(F) = Pn(F1 - F). For the calculation of the infinite product, see HARDY & WRIGHT, p. 246. Section 5 5.5. (a) lf m = O, a > O, and x > O, then P[ X > a ] < P[( X + x ) 2 � (a + x F ] < E[( X + x ) 2 ]j(a + x )2 = (u 2 + x 2 )j(a + x) 2 ; minimize over x. 5.8. {b) It is enough to prove that cp\t ) = f(t ( x', y') + (1 - t X x, y )) is convex in t (O ::; t < l ) for ( x , y) and ( x', y' ) in C. If a = x' - x and {3 = y' - y, then (if [, , > 0) cp" = fu a 2 + 2 ft2 a f3 + fn/3 2 1 = - Un a + !! 2 /3 ) 2 + 1-1 ( fu /22 - flz) f3 2 > 0. �11 II Examples like f( x, y ) = y 2 - 2 xy show that convexity in each variable sepa rately does not imply convexitv.

5.9. Check (5.39) for f( x, y) = - x 11P y 1 fq . 5.10 . Check (5.39) for f( x, y) = - ( x 1 1P + y 1 1P ) P. 5.19. For (5.43) use (2.36) and the fundamental theorem of arithmetic · since the P ; k are distinct, the P; ; individually divide m if and only if their product does. For (5.44) use inclusion - exclusion. For (5.47), use (5.29) (see Problem 5. 12)).

k 5.20. (a) By (5.47), En[a P ] < f.� = 1 p - < 2 jp. And, of course, n - ! log n! = En[log] = LpEn[a P ] log p. k (b) Use (5.48J and the fact that En[a P - o P ] < Lk = p - . 2 (c) By (5.49), n
Ox ( log e - 1 - 2K ) .

9x x 1 12 for large x, and hence log '7T(x) :: d og x and '7T(x ) :::: xjlog '7T(x). Apply this with x = p, and note t hat '7T( p,) = r.

Section 6 6.3. Since for given values of Xn1(w), . . . , Xn , k - l(w) there are for Xn(w ) the k possible values 0, 1, . . . , k - 1, the number of values of (Xn1(w), . . . , Xnn( w)) is n!. Therefore, the map w -+ (Xnlw), . . . , Xm,(w)) is one-to-one, and the Xnk(w) determine w. It follows t hat if 0 < X; < i for 1 < i < k, then t he number of permutations w satisfying Xn ;(w) = X;, 1 < i < k, is j ust (k + 1Xk + 2) · · · n, so that P[ xni = X;, 1 < i < k ] = ljk!. It now follows by induction on k that Xn1 , . . . , Xnk are independent and P[ Xn k = x ] = k - 1 (0 < x < k). Now calculate E [ Xnd =

k-1 2

,

0 + 1 + · · · + {n - 1 )

-)

n(n - 1) 4 , 4 2 k - 1 2 k2 - 1 0 2 + f + · " ' + ( k - 1 )2 - 2 Var[ Xn d = - 12 k n 2n 3 + 3n 2 - 5n 1 i: Var[ Sn l = rr (e - 1) = 72 E[S,. ] =

(

k=l

Apply Chebyshev 's inequality.

6.7. (a) If k 2 < n < (k + 1) 2 , let a n = e; if M bounds the lxnl, then

n - an _!_ _!_ s .!_ _!_ -+ 0. = a 2M M + (n nM < n s ) .!_n n a n a, - n a an an n

6.16. From (5.53) and (5.54) it follows that an = LP n - 1ln /P J (6.8) is

- + co,

The ieft side of

1 _!_ + _!_ , .!. . .!. . .!. . (.!. . ) (! ! ) < < .!_n l.!!. p n q n pq .._ l _ n l!!. p . j n l q J pq np nq !!..

Section 7

7.3. If one grants that there are only countably many effective rules, the result is an immediate consequence of the mathematics of t his and the preceding sections: C i s a countable intersection of .r-sets of measure 1 . The argument proves in particular the nontrivial fact that collectives exist. 7.7. If n < T, then Wn = Wn - I - Xn - 1 = WI - Sn - 1 > and T is the smallest n for Which Sn - l = W1 • Use (7.8) for the question of whether the game terminates. Now T- 1

FT = F0 + I: ( W1 - sk _ I ) Xk k= l

=

F0 + W1S7_ 1 - i(s;_ 1 - ( T - 1 )) .


559

7.8. Let x 1 , . . . . x,. be the initial pattern and put I.0 = x 1 + · · · + x,.. Define I-n = I,n - 1 - wn xn, L o = k, and Ln Ln - 1 - (3Xn + l)j2. Then T is the smallest n such that Ln < 0, and T is by the strong law finite with probability 1 if E[3Xn + 1] = 6( p - �) > 0. For n < -r, I.n is t he sum of t he pattern used to determine Wn + 1 . Since Fn - Fn _ 1 = I.n _ 1 - I.n, it follows that Fn = F0 + !. 0 - I-n and F7 = F0 + !. 0• =

"Section 8

8.8.

With probability 1 the population either dies out or goes to infinity. If, for example, Pko = 1 - Pk , k + l = 1jk 2, then extinction and explosion each have positive probability. (b)

8.9. To prove t hat x,. 0 is the only possibility in the persistent case, use Problem 8.5, or else argue directly: If X; = f.1 * ;0 P;1 x1 ' i -=1= i0, and K bounds t he I ..that P and N are both nonempty. For i 0 E P and j0 E N choose n so that P! } 0. Then CIJ 0

0
j, and apply Theorem 8.8. =

560


In FELLER, Volume 1, the renewal theorem is proved by purely analytic means and is then used as the starting point for the theory of Markov chains_ Here the procedure is the reverse.

8.19. The transition probabilities are P or = 1 and P; r- i 1+ I = p, P; r - i =1 q, 1 < i < r ; the stationary probabilities are u 1 = . . = u, ='q - u0 = ( r +'q)- • The chance of getting wet is u0p, of which the maximum is 2r + 1 - 2 -Jr( r + 1) _ For r = 5 this is .046, the pessimal value of p being .523. Of course, u 0 p < 1 /4 r . In more reasonable climates fewer umbrellas suffice: if p = .25 and r = 3, then u0p = .0�0; if p . 1 and r = 2, then u0p = .031 . At the other end of the scale, if p = .8 and r = 3, then u0p = .050; and if p = .9 and r = 2, then u 0 p = .043. =

8.22. For t he last part, consider the chain with state space Cm and transition probabilities P;1 for i, j E Cm (show that they do add to 1)_ 8.23. Let C' = S - (T U C) , and take U = T U C' in (8.51)_ The probability of absorp tion in C is the probability of ever entering it, and for initial states i in T U C these probabilities are the minimal solution of Y; =

L P; Yj + jL

jE T

O < y,. < 1 ,

;

EC

Pij

+

L Pij j , j ' e

C

V

i E T U C' ' i E T U C' .

Since the states in C' (C' = 0 is po ssible) are pe1 sbtent and C is closed, it is impossible to move from C' to C. Therefore, in the minimal solution of the system above, y,. = 0 for i E C'. This gives the system (8.55). It also gives, for the minimal solution, '[j T Pij Yj + '[j c p,.1 = 0, i E C'. This makes probabilistic sense: for an i in C', not only is it Impossible to move to a j in C, it is impossible to move to a j in T for which there is positive probability of absorption in C. e

e

8.24. Fix on a state i, and let Sv consist of those j for whkh pfj lm> 0 for some n congruent to v modulo t. Choose k so that p)/l > 0; if p� ) and pfj'l are positive, then t divides m + k and n + k, so that m and n are congruent modulo t_ The S, are thus well defined. 8.25. Show that Theorem 8.6 applies to the chain with transition probabilities p�>. 8.27. (a) From PC = CA follows Pc ; = A 1 c; , from RP = A R follows r P = A ; r,. , and n ; n from RC = I follows r; cj = oij- Clearly N is diagonal and p = CA R. Hence pfj l = Lu, C;u A: o u, R ,j = Lu A: (c u ru )ij

=

Lu A:( A u );j •

By Problem 8.26, there are scalars p and y such that r1 = pr0 = p ('7T 1 , . . '7T5 ) and c1 = yc0, where c0 is the column vector of 1 ' s. From r 1 c 1 = 1 follows p y = 1, and hence A 1 = c1r1 = c0r0 has rows ('7T 1, . . . , '7T5). Of course, (8.56) gives the exact rate of convergence. It is useful for numerical work; see P;({ r < n ] n ( Xk > O, k > 1]) --+ 1 [,.0 > 0, there is an n of the kind required in the last part of the problem. And now E;[ f( XT)] < P;[ T < n = u, ]f( i + n ) + 1 - P; [ T < n = un ] =

1 - P;[ T < n = un ]f; + n .o·

8.37. If i > 1, n 1 < n 2, (i, . . . , i + n 1 ) E /n , and (i . . . . , i + n 2) E /n2 , then P;(r = n1, T = n2] > P,.( Xk = i + k, k < n 2 ] > 0, �hich is impossible. Section 9 9.3. See

BAHADUR.

9.7. Because of Theorem 9.6 there are for P( Mn > a ] bounds of the same order as the ones for P(Sn > a ] used in the proof of (9-.36).

562


Section 10 10.7. Let JLJ be counting measure on the a--field of all subsets of a countably infinite n, let JL z = 2JL 1 , and let 9J consist of the cofinite sets. Granted the existence of Lebesgue measure A on @ 1 , one can construct another example: let JL 1 = A and JL z = 2A, and let 9J consist of t he half-infinite intervals ( - oo, x ]. There are similar examples with a field .9Q in place of f?J. Let n consist of the rationals in (0, 1 ], let JL1 be co unting measure, let JL z = 2JL 1 , and let .9Q consist of finite disjoint unions of " intervals" [ r E !l: a < r < b ]. Section l l 11.4. (b) If ( f, g ] c U k ( fk , gd, then ( f( w ), g(w)] c U k (fk(w), fk( w )] for all w, and Theorem 1 .3 gives g(w) - f(w ) < 'Lk(gk(w) - fk(w)). If h m = (g - f - Lk s n (gk - fk )) V 0, then h, l 0 and g - f < LH ,,( gk - fk ) + h n. The positivity and continuity of A now give v0(f, g ] < '[k v0(fk , g k ]. A similar, easier argument shows that Lk v0(fk , g d < v0(j, g ] if ( fk , gd are disjoint subsets of (f, g ]. 11.5. (b) From ( 1 1 .7) it follows t hat [f > 1 ] E .9Q for f in ..£' Since ..£' is linear, [ f > x ] and [ f < -x] are in .9Q for f E ...i' and x > 0. Since the sets (x, oo) and ( - x) for x > 0 generate .9f' 1 , each j in ..£' is measurable a-( .9ij). Hence .r a-(.9Q). It is easy to show that .9Q is a semiring and is in fa�t closed under the formation of proper differences. It can happen that n � .90-for example, in the case where n = {1, 2} and../ consists of the f with f(l ) = 0. See Jurgen Kindler: A simple proof of the Danieli-Stone representation theorem. Amer. Ma th . Monthly, 90 (1983), 396-397_) - co,

Section 12 12.4. (a) If en = e,.. , then en - m = 0 and n = m because e is irrational. Split G into finitely many intervals of length less than t:; one of them must contain points 62n and (}2m with 6z n < 6zm • If k = m - n , then 0 < (}2m - 6zn = (}2m e 6z n (J2k < €, and the points 6zkl for 1 < l < [62/ J form a chain in which the distance from each to the next is less than t: , the first is to the left of t:, and the last is to the right of 1 - £. (c) If S 1 e Sz = 6zk + 1 G IJ 2n Ell 6 zn lies in the subgroup, then s 1 = Sz and (J2 H 1 = =

2

I

(J 2( n , - n2 )

'

12.5. (a) The S Ell em are disjoint, and (2n + l)u + k = (2 n + l)u' + k' with lkl, lk'l < n is impossible if u -=1= u'. ( b) The A Ell (J• are disjoint, contained in G, and have the same Lebesgue measure. 12.6. See Example 2.10 (which applies to any finite measure). 12.8. By Theorem 12.3 and Problem 2.19(b), A contains two disjoint compact sets of arbitrarily small positive measure. Construct inductively compact sets Ku , . . . u, (each U ; is 0 or 1) such that 0 < JL(Ku , u ) < 3 - n and Ku , u ,. O and Ku , u ,.l are disjoint subsets of Ku u . Take K = n n U u u Ku u . The Cantor set is a special case. ,

I

n

l

u

i

,


563

Section 13 13.3. If f = L;X;IA , and A ; E T- 1 Y', take A'; in Y' so that A ; = T- 1A';. and set 1 1p = L.;x,.IA' · For the general f measurable T - :T', there exist simple functions fn, measurable T- 1 :T', such that fn(w) --+ f(w) for each w. Choose 'Pn• measurable Y', so that fn = 'PnT. Let C' be the set of w' for which 'Pn(w') has a finite limit, and define ip(w') = limn ifJn(w') fo r w' E C' and �p(w') = 0 for w' � C'. Theorem 20.1(ii) is a special case. 13.7. The class of Borel functions contains the continuous functions and is closed under pointwise passages to the limit and hence contains ff. By imitating the proof of the rr-A theorem, show that, if f an d g lie in f?l:, then so do f + g, fg, f- g, f V g (note that, for example, [ g : f + g E ff] is closed under passages to the limit). If fn(x) is 1 or 1 - n(x - a) or 0 as x < a or a < x < a + n - 1 or a + n - 1 < x, then fn is continuous and fn(x) --+ 1< - oo. al x). Show that [ A: lA E ff] is a A -system. Conclude that g; contains all indicators of Borel sets, all simple Borel functions, all Borel functions. 13.13. Let B = { b p . . . , bk }, where k < n, E; = C - b;- 1A, and E = U 7= 1 E;. Then E = C - U 7= 1 b,:- 1A. Since JL is invariant under rotation&, JL( E;) = 1 - JL(A) < n - I ' and hence JL(E) < 1. Thaefore c - E = n ;= I b;- 1A is nonempty. Use any (J in C - E.

Section 14 14.3. (b ) Since u < F(x) is equivalent to �p(u) < x, it follows that u < F(:p(u)). And since F(x) < u is equivalent to x < �p(u ), it follows further t hat F(�p(u) - € ) < u for positive €. 14.4. (a) If 0 < u < v < 1, then P[u < F( X ) < v, X E C] = P[ �p(u) < X < �p(u ), X E C]. If �p(u) E C, this is at most P[ �p( u ) < X < �p(u )] = F(�p(u) - ) - F(�p(u) - ) = F(�p(u) - ) - F(�p(u)) < u - u; if �p(u) � C, it is at most P[ �p( u ) < X < �p( u )] = F(�p(u) - ) - F(�p(u)) < u - u. Thus P[ F( X) E [u, v), X E C] < A[u, u) if 0 < u < u < 1. This is true also for u = 0 (let u ! C and note that P[F( X) = 0] = 0) and fm u = 1 (let u i 1), The finite disjoint unions of intervals [u, u ) in [0, 1) form a field there, and by addition P[F(X) E A , X E C] < A ( A ) for A in this field. By the monotone class theorem, the inequality holds for all Borel sets in [0, 1). Since P[ F(X) = 1, X E C] = 0, this holds also for A = { 1}. 14.5. The sufficiency is easy. To prove necessity, choose continuity points X; of F in such a way that x0 < x 1 < < x k , F(x0) < £ , F(x k ) > 1 - £, and X; - x;_ 1 < € . If n exceeds some n0, IF(x;) - Fn(x;)l < £/2 for all i. Suppose that X;_ 1 < x < x;. Then F,,(x) < Fn(x;) < F(x;) + Ej2 < F(x + E) + Ej2. Establish a simi lar inequality going the other direction, and give special arguments for the cases x < x0 and x > x k . ·

·

·

Section 15 15.1. Suppose there is an .%partition s uch that I;[supA JlJL( A;) < oo, Then a ; = supA · f < oo for i in the set I of indices for which JL(A ; ) > 0. If a = ' max1 a ;, then JL[f > a ] = I1JL ( A ; n [ f > a ]) < I1JL ( A ; n [ f > a ; ]) = 0. And A; n [ f > 0] = 0 for i outside the set 1 of indices for which JL( A; ) < oo, so that JL[f > 0] = IJL( A , n [/> O]) < I1JL( A;) < oo,

564


15.4. Let W, ;r-, JL + ) be the completion (Problems 3.10 and 10.5) of W, 5', JL). If g is measurable 5', [f* g] cA, A E 5', JL(A) = 0, and HE .!JR 1 , then [fE H] = (Ac n [f E H]) U (A n [fE H]) = (Ac n [g E H]) U (A n [fE H]) lies in /r", and hence f is measurable !r". (a) Since f is measurable !r", it will be enough to prove that for each (finite) /r"-partition {B) there is an .9=partition {A;} such that I ,[infA , f]JL(A;) > I} inf8I]JL + (B), and to prove the dual relation for t he upper sums. Choose (Problem 3.2) Y-sets A; so that A; c B; and JL(A;) = JL * (B;) = JL + (B;). For the partition consisting of the A; together with ( U ;A;Y, the lower sum is at least I;[inf8'I ]JL(A;) I;[ inf 8'I ]JL + (B;). (b) Choose successively finer .9=partitions {A11;} in such a way that the corresponding upper and lower sums differ by at most ljn 3• Let gn and fn have values in fA I and sup A I on An;· Use Markov ' s inequality-since JL(fi) is finite, it may as well be l-to show that JL[fn - gn > ljn] < ljn 2 , and then use the first Borei-Cantelli lemma to show that fn - gn -+ 0 almost everywhere. Take g =- limn gn. =

Section 16

16.4. (a) By Fatou's lemma,

and

< lim inf n J (bn -fn) djL =

JbdJL - lim sup Jfn dJL . n

Therefore

16.6. For

w E A and small enough complex h, I f( w , z0 + h ) -f( w, z0 ) I

16.8. Use the fact that fA i fl dJL

=

fzu+hf, (w, z) dz < lhlg(w, z0). zo

< aJL( A) + fr1r1

�

a1 1!1 dJL.


565

JL( A) < o im �lies fA Ifni dJL < t: for all n , and if a - I sup nf If) d JL < o, then JL[Ifnl > a] < a- f Ifn i d JL < o and hence fut a]lfnl d JL < € for all n. For the

16.9. If

.. i

;;,

reverse implication adapt the argument in the preceding note.

16.10. (b) Suppose that fn are nonnegative and satisfy condition (ii) and JL is nonatomic_ Choose o so that JL(A) < o implies fA fn dJL < 1 for all n . If JL[ fn = oo] > 0, there is an A such that A c [ fn = oo] and 0 < JL( A) < o; but then fA fn dJL = oo_ Since JL[f,. = oo] = 0, t here is an a such that JL[fn > a] < o < JL[fn > a]. Choose B c [fn = a] in such a way that A = [fn > a] U B satisfies JL( A) = o. Then a o = a JL( A) < fA fn dJL < 1 and ffn dJL < 1 + aJL(AC) < 1 + 0 - IJL (D,)_ 16.12. (b) Suppose that f E ..£' and f > 0. If fn = (1 - n - I )f V 0, then fn E ..£' and fn i f, so that v(fn, f] = A(f- fn) t 0. Since v(f1 , f] < oo, it follows that v[(CL• , I): f(w) = I ] = 0. The disjoint union

increases to

B, where B c (O, f] and (0, f] - B c [(w, 1 ):

((Ct.•)

= I ]. Therefore

Section 17 17.1. (a) Let A . be t he set of x such that for every o t here are points y and z satisfying I y - xi < o, lz - xl < o, and lf(y) -f(z )I > E. Show that A. is closed and Dr is the unior. of the A (c) Given t: and TJ , choose a partition into intervals I1 for which the corre sponding upper and lower sums differ by at most O J . By considering those i1 whose interiors meet A., show that O J > t: A(A.). (d) Let M bound 1!1 and, giver: t:, find an open G such that D c G and 1 A(G) < t:/M. Take C = [0, 1] - G and show by compactness that there is a o such that lf(y) -f(x )I < F if x (but perhaps not y) li es in C and ly - xl < o. If [0, 1] is decomposed into intervals I; with A( l1 ) < o, and if x1 E I1, let g be the function with valu e f(x1) on I1• Let r: denote summation over those i for which I1 meets C, and let '[" deuote summation over the other i. Show that •.

ft( x) dx - L f( x ; )A (l; ) < f it( x ) - g ( x )i dx

< [' 2 EA( l1 ) + [" 2MA( l1 ) < 4t: .

17.10. (c) Do not overlook the possibility t hat points in (0, 1) - K converge to a point in K. 17.11. (b) Apply the bounded convergence theorem to fn(x) (1 - n dist(x , [s, I]))+. (c) The class of Borel sets B in [ u , v] for which f = I8 satisfies (17.8) is a A-system. =

566


(e) Choose simple fn such that 0 < fn i f To 07.8) for f = fn, apply the monoton e convergence theorenl on the right and the dominated convergence theorem on the left. is the distance from x to [a, b ], then fn = (1 - ng) V 0 l Ir a. b 1 and fn E ...t"'; since the continuous functions are measurable !JR 1 , it follows that sr !JR' . If fn(x) ! O for each X, t hen the compact sets [x: fn(x) > € ] decrease to 0 and hence one of them is 0; thus the convergence is uniform.

17.12. If

g(x)

17.13. The linearity and positivity of A are certainly elementary facts, and for the continuity property, note that if 0 < f < t and f 'vanishes outside [a, b], then elem entary cons:derations show that 0 < A(f) < E(b - a). Section 18 18.2. First, .9l'x :Jt' is generated by the sets ot the forms {x} X X and X X {x}. If the diagonal E lies in 8rx [J[', then there must be a countable S in X such that E lies in the u-field .:T generated by the sets of these two forms for x in S. If .9 consists of sc and the singletons in S, then !F is the class of unions of sets in t he partition [ P1 X P2 : P1 , P2 E .9"]. But E E !F is impossible. 18.3. Consider A X B, where A consists of completion of !JR 1 with respect to .\ ,

a

singl e point and B lies outside the

f = p - I log p, and put fn = 0 if n is not a prime In the notation of (18.175, F(x) = log x + �p(x), where ip is bounded because of (5.51). If G(x) = - 1 flog x, then

18.17. Put

.!_

(x F( t ) dt F( x ) L P = log x + }2 t log z t p :s, x 'P( X )

= 1 + log x

00 tp { l ) dt f oo �p( t ) dt dt ! + zt . 2 t log t 2 t log z t t log x

� +

x

Section 19 19.3. See BANACH, p. 34. 19.4. (a) Take f = 0 and [,, = I(O, I /n )' (b ) Take f = 0, and let Un> be an infinite orthonormal set. u�e the fact that

Ln< fm g)Z < ll g ll 2• 19.5. Take fn ni 1j3, then B0(6s) < 1 < 3B£(s). In either case , B0(6s) < 3BE(s).

Section 23 23.3. Note that A, cannot exceed t. If 0 

0, then P[ A, > u, 81 > v ] =

·

23.4. (a) Use (20.37) and the distributions of A1 and 81 • (b) A long interarrival interval has a better chance of covering one does_ 23.6. The probability that Ns

n + J..

t

t han a short

- N:S" = j is

"' ( f3 )1 a k a kf3j ( J' + k - 1) ! -I dx { e -(3 x x k --'ax e - k +j--: j ! (k - 1) ! j! f( k ) x Jo (a + {3-)� _

.

23.8. Let M1 be the given process and put �p(t ) = £[ M1 ]_ Since there are no fixed discontinuities, �p(t ) is continuous_ Let r/J(u) = intl t : u < �p(t)], and show that Nu = MI/J( u ) is an ordinary Poisson process and M, = N"'

A

C. ) :5

-

A.

a =

=

A

A, A,) > A,

=

E

E

A, )


575

2 it /3")• 31 15 noon � l (.!.2 + .!.e 2 •

•

31. 18. For x fixed, let u11 and V11 be the pair of successive dyadic rationals of order n (un - U 11 = 2 - ") for which U11 < x < V11 • Show that

k =0

=

n-1

a k- ( x ), L k=O

where a;; is the left-hanJi derivative. Since af:(x) = + Liar all x and k, the difference ratio cannot have a finite limit

31 .22. Let A be the x-set where (31.35) fails if f is replaced by f�p; then A has Lebesgue measure 0. Let G be the union of all open sets of tt·measure 0; represent G as a countable disjoint union of open intervals, and let B be G together with any endpoints of zero wrneasure of these intervals. Let D be t he set of discontinuity points of F. If F(x) r;t: A, x ff: B, and x l;t: D, then F(x - h) < F(x) < F(x + h ), Hx ± h) -+ F(x), and

Now x - € < �p(F(x)) < x follows from F(x - t:) < F(x), and hence �p(F(x)) = x. If A is Lebesgue measure restricted to (0, 1), then tt = A 1p - I , and (31.36) follows by change of variable. But (36.36) is easy if x E D, and hence it holds outside B (De n F - 1A). But tt ( B ) = 0 by construction and tt(D' n F- 1A) = 0 by Problem 14.4.

U

Section 32

n n I' S 32 . 7. Define Jt11 an d V11 as ·I n (32. 7), an d Write V - V3(0 ) + v5( ), Where V3(r.) c r l is singular with respect to absolutely con tinuous with respect to tt n and v5" (11) '\' '\' (II) lt n · Ta ke Vac - L.n Vac an d Vs - L. nVs ' Suppose that V3c(E) + v5(E) = v�c(E) + v�(E) for all E in :F. Choose an S in :T that supports v5 and v� and satisfies tt(S J = 0. Then v&c(E) ·

_

=

Vac(E () sc) = vac( E () sc) + J's(E () sc) = v�c(E () sc) + v�(E () sc) = v�c(E () sc) = v�c( E). A similar argument shows that v5(E) = v�(E).

32.8. (a) Show that � is closed under t he formation of countable unions, choose �-sets Bn such that Jt(B11 ) --+ sup a- tt(B) ( < oo), and take 80 = U11 B11 • (b) The same argument. (c) Suppose tt(D0) > 0. The maximality of 80 implies that 80 u D0 contains an E such that tt(E) > 0 and v(E) < oo. Since 80 n E cB0 E �, tt(B0 n E) = 0 (v(E) < oo rules out v(B0 n E) = oo). Therefore, JL(D0 n E) > 0 and v(D0 n E) < oo, which contradicts the maximality of C0. (d) Take the density to be oo on fl(j. 32.9. Define f and v5 as in (32.8), and let r and v� be t he corresponding func tion and measure for !T0 : v(E) = fEr dtt + v� (E) for E E !T0, and ther e is an g;-o -set So such that v� (fl - So) = 0 and tt(So ) = 0. If E E g;-o , it fol lows that fEr dtt = IE - S" r dtt = IE - S" r dtt0 = V0(E - so ) = v(E - S0) > IE- S" fdtt = IEfdtt ·

576


It is instructive to consider the extreme case g-o = {0, fl}, in which v0 is absolutely continuous with respect to p.,0 (provided p.,( fl) > 0) and hence v� vanishes.

Section 33 33.2. (a) To prove independence, check the covariance. Now use Example 33. 7. (b) Use the fact that R and 0 are in dependent (Example 20.2). (c) As the single event [ X = Y] = [ X - Y = 0] = [0 = 7T/4] u [ 0 = 57T/4] has probability 0, the conditional probabilities have no meaning, and strictly speaking there is nothing to resolve But whether it is natural to regard the degrees of freedom as one or as two depends on whether the 45o line through the origin is regarded as an element of the decomposition of the plane into 45° lines or whether it is regarded as the union of two elements of the decoJnposi tion of the plane into rays from the origin. Borel's paradox can be explained the same way: The equator is an element of the decornpo�ition of the sphere into lines of constant latitude; the Green wich meridian is an element of the decomposition of the sphere into great circles with common poles. The decomposition matters, which is to say the u-field matters. 33.3. (a) If the guard says, " 1 is to be exect.ted," then the conditional probability that 3 is also to be executed is 1 /0 + p). The " paradox" comes from assumin g that p must be 1, in which case the conditional probability is indeed t. But if p -=/= t, then the guard does give prisoner 3 some information. (b) Here "one" and "other" are undefined, and the problem ignores the possibility that you have been introduced to a girl. Let the sample space be a

bbo 4 , bby

1-a 4 '

f3 • bgo 4 1 - {3 bgy 4 '

'Y gbo 4 , 1 - ')' gby 4 '

0

ggo 4' 1 -8 ggy 4 .

For example, bgo is the event (probability f3 14) that the older child is a boy, the younger is a girl, and the child you have been introduced to is the older; and ggy is the event (probability (1 - o)j4) that both children are girls and the one you have been introduced to is the younger. Note that the four sex· distributions do have probability t· If the child you have been introduced to is a boy, then the conditional probability that the other child is also a boy is p = 1/(2 + f3 - y ). If f3 = 1 and y = 0 (the parents present a son if t hey have one), then p = t· If f3 = y (the parents are in different), then p = t . Any p between t and 1 is possible. This problem shows again that one must keep in mind the entire experi ment the sub,.-field .:# represents, not just one of the possible outcomes of the experiment.

33.6. There is no problem, un less the notation gives rise to the illusion that p(A ix) is P(A n [ X = x])jP[ X = x ]. 33 .15. If N is a standard normal variable, then


577

Section 34 34.3. If (X, Y ) takes the values (0, 0), (1, - 1), and (1, 1) with probability t each, then X and Y are dependent but E[ YIIX] = E[ Y] 0. If (X, Y) takes the values ( - 1, 1), (0, - 2), and (1, 1) with probability { each, t hen E[ X] = E[Y] = E[ XY ] = 0 and so E[XY] = E[ X]E[ Y], but E[ Y II X ] = Y -=1= 0 = E[ Y ]. 0{ course, this is another example of dependent but uncorre lated random variables. =

34.4. First s how t hat ffdP0 = f8 fdPjP( B) and that P[ BII&"] > 0 on measure 1 . Let G be the general set in &". (a) Since

=

a

set of

P0-

P(B) fr/cP0[ All&"] dP0 = P( B )P0(A n G)

= j P[ A n BII&"] dP, G

it follows that

holds on a set of ?-measure 1. (b) If P,(A) = P(AI B,), then

=1

P( All&"v .;y] dP.

G n B;

Therefore, fc· l8 P;[ A II&"]dP = fc /8 P[AII&"v �] dP if C = G rtB;, and of course this holds for C = G n 81 if j -=1= i. But C's of this form constitute a 'IT-system generating .§'v �. and hence /8 P;[ AII&"] = 18' P[AII&"v �] on a set of ?-measure 1. Now use t he result in part (a).

34.9. All such results can be proved by imitating the proofs for t he unconditional case or else by using Theorem 34.5 (for part (c), as generalized in Problem 34.7). For part (a), it must be shown that it is possible to take the integral measurable &". 34.10. (a) If Y =2 X - E[ XII&"d, t hen X - E[ XII&"2 ] = Y - E[YII&"2 ], and E[(Y 2 E[ YII&"2 ]) II&"2 ] = E[ Y II&"2 ] - E 2 [ Y II&"2 ] < E[ Y 2 II&"2 ]. Take expected values. 34.11. First prove that

578


From this and (i) deduce (ii). From

(ii), and the preceding equation deduce

The sets A 1 n A 2 form

a

'IT-system generating .§'1 2 .

34. 16. (a) Obviously (34. 18) implies (34. 17). If (34. 17) holds, then clearly (34. 18) holds for X simple. For the general X, choose simple Xk such that lim k Xk = X and IXk l < lXI. Note that

f XdP - afXdP An

let n -+ oo and then let k -+ oo. (b) If .!1 E .9, then the class of E satisfying (34. 17) is a A-system, and so by the 7T-A theorem and part (a), (34. 18) holds if X is measurable u(9'). Since A n E u (.9 ), it follows that

j XdP = j E[ XIIu (.9)] dP -+ ajE[XIiu (.9)] dP An

An

= a jxdP. (c) Replace X by XdP0jdP in (34 .18). 34. 17. (a) The Lindeberg-Uvy theorem. (b) Chebyshev's inequal ity. (c) Them em 25.4. (d) Independence of the Xn (e) Problem 34. 1 6(b). (f) Problem 34. 16(c). (g) Part (b) here and the t:-8 defin ition of absolute continuity. (h) Theorem 25.4 again. Section 35 35.4. (b) Let snn be the number of k such that 1 < k < n and yk = i- Then Xn = 35"j 2 . Take logarithms and use the strong law of large numbers.

579


35.9. Let K bound l X I I and the iXn - xn - I I· Bound IXT I by KT. Write IT $ k XT dP = L 7= 1 fT =i X, dP = [7= 1( fT ""- i X, dP - fT ""- l + l X,. dP). Transform the last integral by the martingale property and reduce the expression to £[X I ] - IT > k xk + I dP. Now

fT > kxk + l dP

< K ( k + 1 ) P[ r > k ] < K( k + l)k - 1 f k T dP -+ 0. T>

is a supermartingale. Si nce 35.13. (a) By the result in Problem 32.9, X1 , X2 , E[ IXnl] = E[ Xn] < v(fl), Theorem 35.5 applies. (b) If A E .9,;, then fiYn + Zn) dP + a-�( A ) = fA X., dP + a-J A) = v( A ) fA Xn dP + a-n( A). Since the Lebesgue decomposition is unique (Problem 32. 7), Yn + Zn = Xn with probability 1. Since Xn and Yn converge, so does Zn. If A � !T, and n > k, then fAZn dP < a-J A), and by Fatm:'s lemma, the limit Z satisfies fA Z dP < a-.,( A ). This holds for A l n U k !T, and hence (monotone class theorem) for A in �. Choose A so that P(A) = 1 and a-.,( A ) = 0: E[ Z] = fAZdP < a-.,( A ) = 0. It can happen that a-n(fl) = 0 aud a-J !l) = v(fl) > 0, in which case a-n does not converge to (.Too and the xn cannot be integrated to the limit. •

•

=

35.17. For a very general result, see J. L. Doob: Appl ication of the theory of martingales, Le Calcul des Probabilites et ses Applications (Colloques Interna tionaux du Centre de Ia Recherche Scientifique, Par:s, 1949). Section 36 36.5. (b) Show by part (a) and Problem 34.18 that fn is the conditional expected value of f with respect to the a--field .9';; + 1 generated by the coordinates x n + 1 , x n + z • . . . . By Theorem 35.9, (36.30) will follow if each set in n n:T, has 'IT-measure either 0 or 1, and here the zero-one law applies. (c) Show that gn is the conditional expected value of f wit h respect to the a--field generated by the coordinates x " . . . , x n , and apply Theorem 35.6. 36.7. Let ..£' be the countable set of simple functions L;a ; IA for a; rational and {A;} a finite decomposition of the unit interval into subi�tervals with rational endpoints. Suppose that the X, exist, and choose (Theorem 17. 1) Y, in ..£' so that E[I X, - Y, l] < i. From E[IXs - X, l] = t, conclude that E[ I Y, - ¥; 1] > 0 for s -=1= t. But there are only countably many of the y; . It does no good to replace Lebesgue measure by some other measure on the unit interval. t Section 37 37. 1. If t 1 , . . . , t k are in increasing order and t0 = 0, then min(i,j)

[, K( t; , tJ x; xi = [, x ; xi [, Ut - tt d 1= 1 t,} i,j

(

= [, ( It - lt - d [, x; I

i ?! I

f > 0.

580


37.4 (a) Use Problem 36.6(b). (b) Let [ W,: t > 0] be a Brownian motion on ( fl, !f. P0 ), where W( · , w) E C for every w. Define {: n --+ RT by Z,({(w)) = w;(w). Show that { is mea surable S'j!JRT and P = P0C 1 • If C c A E .9P T, then P(A) = P0(C 1A) =

Po(fl) = 1.

37.5. Consider Since

W(l) = EZ

=

nj

1(W(k jn ) - W((k - l)jn)) for notational convenience.

I W( I /�)1 ? W 2 ( 1) dP --+ O,

the Lindeberg theorem applies.

37.14. By symmetry,

[

p(s, t ) = 2 P W. > 0 . inf ( Wu - W5 ) < S $; U :$, 1

W. ] ;

W, and the infimum here are independent because of t he Markov property,

and so by (20.30) (and symmetry aga in)

oo p ( s, t ) - _,- P[ Tx s t - s ]

l

o

1 e - x Z ;zs dx v2Trs r;

Reverse t he integral, use J;xe - x 2 r ;z dx

=

1 jr, and put = (s j(s + u)) 1 12 : u

1 � t -s 1 P( s, t ) = Tr o u + s 2

'J

du

s 1 12 du u l f2

. = Tr z .[0 VI - u

Bibliography

HALMos and SAKS have been the strongest measure-theoretic and Doos and FELLER the strongest probabilistic influences on this book, and the spirit of KAc's small volume has been very important. AUBREY: Brief Lives, John Aubrey; ed., 0. L. Dick. Seker and Warburg, London, 1949. BAHADUR: Some Limit Phi ladelphia, 1971.

Theorems in Statistics,

BANACH: Theorie des Operations atyczne, Warsaw, 1932. BERGER: Statistical Decision Verlag, New York, 1985.

R. R. Bahadur. SIAM,

Lineaires, S. Banach.

Theory,

Monografje Matem

2nd ed., James 0. Berger. Springer

BHATTACHARYA & WAYMIRE: Stochastic Processes with Applications, Rabi N. Bhattacharya and Edward C. Waymire. Wiley, New York, 1990.

I

, BILLINGSLEY 1 : Convergence Wiley, New York, 1968. • +

of Probability Measures,

Patrick Billingsley.

BILLINGSLEY 2 : Weak Convergence of Measures: Applications Patrick Billingsley. SIAM, Philadelphia, 1971.

�BIRKHOFF & MAc LANE:

in Probability,

A Survey of Modern Algebra,

4th ed., Garrett Birkhoff and Saunders Mac Lane. Macmillan, New York, 1977. B, 536 A c B, 536 J':set, 20 �0• 20 u(N), 21 J, 22 �. 22 (.!1, !7, P), 23 A n t A , 536 A n ! A , 536 A, 25, 43, 168 1\ , 537 v , 53 7 f(.W), 33 P/ A ), 35 D( A ), 35 �. 35 P* , 37, 47 p * , 37, 47 S'', 27, 3 1 1 9, 41 ../, 41 A* 44 ,

P(Bi A ), 5 1 lim SllPn A n• 52 lim infn A n, 52 lim n A n• 52 i.o., 53 g-, 62, 287 R I , 537 R k , 53 9 [ X = x ], 67 O"( X ), 68, 255 J.L , 73, 160, 256 E[X], 76, 273 Var[ X ], 78, 275 En[ f], 87 s/a), 93 T , 99, 1 33, 464, 508

PiJ•

S,

a; ,

7T;,

111

11 J

Ill

1 24 M(t ) 146, 278, 285 gp k , 158 .Wn .!10, 159 x k t x, 160, 537 x = [k xk , 1 60 J.L* , 1 65 .II( J.L* ), 165 A 1 , 168 A k , 171 F, 1 75, 1 77 AA F, 1 76 r- 'A', 537 sz-;sz-', 1 82 Fn = F , 1 9 1 , 327, 378 dF(x ), 228 X X Y, 23 1 !!Z" x �. 231 J.L X v , 233 ,

585

586 *,

266 Xn -> pX, 70, 268, 330 IIJ II, 249 lifllp, 241 L P, 241 J.Ln => J.L , 327, 378 Xn = X, 329, 378 xn = a, 331 JA , 538 tp( t ), 342 J.Ln -> 1 J.L, 371 F. , 414

LIST OF SYMBOLS Fac' 4 1 4 v « J.L, 422 d vfdJ.L, 423 v. , 424 v.c , 424 P[ A II&"], 428, 4 30 P[ A IIX" t E T], 433 E[ XIi&"], 445 RT, 484 !JPT, 485 w, 498

Index

Here An refers to paragraph n of the Appendix (p. 536): u . v refers to Problem o in Section u , or else to a note on it (Notes on the Problems, p. 552); the other references are to pages. Greek letters are alphabetized by their Roman equivalents (m for J.L, and �o on). Names in the bibliogr3phy are not indexed separatt"ly. Absolute continuity, 413, 422 Absolutely continuous part, 425 Absolute moment, 274 Absm bing state, 1 1 2 Adapted u-fields, 458 Additive set function, 420 Additivity: countable, 23, 161 finite, 23, 161 Admissible, 248, 252 Affine transformation, 172 Algebra, 1 9 Almost evel)where, 60 Almost surely, 60 a-mixing, 363, 29.10 Aperiodic, 125 Approximation of measure, 168 Area over the curve, 79 Area under the curve, 203 Asymptotic equipartition property, 91, 1 44 Asymptotic relative frequency, 8 Atom, 271 Autoregression, 495 Axiom of choice, 21, 45

Beppo Levi theorem, 1 6.3 Bernoulli-Laplace model of diffusion, 1 12 Bernoulli shift, 3 1 1 Bt"rnoulli trials, 75 Bernstein polynomial, 87 Betting system, 98 Binary digit, 3 Binomial distribution, 256 Blackweii-Rao theorem, 455 Bold play, 102 Hoole's inequality, 25 Borel, 9 Borel-Cantelli lemmas, 59, 60 Borel function, 183 Borel normal num ber theorem, 9 Borel paradox, 441 Borel set, 22, 158 Boundary, Al l Bounded convergence theorem, 210 Bounded variation, 415 Branching process, 461 Britannica, 552 Brownian motion, 498 Burstin's theorem, 22.1 4

Baire category, 1 10, Al 5 Baire function, 1 3.7 Banach lim its, 3.8, 19.3 Banach space, 243 Banach-Tarski paradox, 180 Bayes estimation, 475 Bayes risk, 248, 25 1 Benford law, 25.3

Canonical measure, 372 Canonical representation, 372 Cantelli ineq uality, 5.5 Cantelli theorem, 6.6 Cantor function, 31 .2, 3 1 . 1 5 Cantor set, 1 .5 Cardinal ity of u·fields, 2.12, 2.22 Cartesian product, 231

587

588 Category, A15, 1 10 Cauchy distribution, 20 1 4, 348 Cauchy equation, A20, 1 4.7 Cavalieri principle, 18.8 Central limit theorem, 291, 357, 385, 391, 34 1 7, 475 Cesaro averages, A30, 20.23 Change of variable, 215, 224, 225, 274 Characteristic function, 342 Chebyshev inequality, 5, 80, 276 Chernoff theorem, 1 5 1 Chi-squared distribution, 20. 15 Chi-squared statistic, 29.8 Circular Lebesgue measure, 1 3 12, 3 1 3 Class of sets, 18 Closed set, Al l Closed set of states, 8 21 Closed support, 1 2.9 Closure, Al l Cocountable set, 21 Cofinite set, 20 Collective, 1 09 Compact, A13 Complement, Al Completely normal number, 6. 1 3 Complete space or measure, 44, 10.5 Com pletion, 3 10, 10.5 Complex functions, integration of, 218 Compound Poisson distribution, 28.3 Compound Poisson process, 32.7 Concentrated, 1 6 1 Conditional distribution, 439, 449 Conditional expected value, 1 33, 445 Conditional probability, 5 1 , 427, 33.5 Congruent by dissection, 1 79 Conjugate index, 242 Conjugate spact", 244 Consistency conditions for finite-dimensional distributions, 483 Contt:nt, 3.15 Continued·fraction transformation, 3 1 9, A36 Con tinuity from above, 25 Continuity of paths, 500 Continuum hypothesis, 46 Conventions involving oo, 160 Convergence in distribution, 329, 378 Convergence in mean, 243 Convergence in measure, 268 Convergence in probability, 70, 268, 330 Convergence with probability 1, 70, 330 Convergence of random series, 289 Convergence of types, 1 93 Convex functions, A32 Convolution, 266

INDEX Coordinate function, 27, 484 Coordinate variable, 484 Countable, 8 Countable additivity, 23, 1 6 1 Countable subadditivity, 25, 1 62 Countably generated u-field, 2 1 1 Countably infinite, 8 Counting measure, 1 6 1 Coupled chain, 126 Coupon problem, 362 Covariance, 277 Cover, A3 Cramer-Wold theorem, 383 Cylinder, 27, 485 Daniell-Slone theorem, ! I 14, 16 12 Darboux-Young definition, 15.2 A-distribution, 1 92 Decision theory, 247 Decom position, A3 de Finetti theorem, 473 Definite integral, 200 Degenerate di stribution function, 193 Delta method, 359 DeMoive-Laplace theorem, 25. 1 1 , 358 DeMorgan !aw, A6 Dense, Al5 Density of measure, 213, 422 Density point, 31.9 Density of random variable or distribution, 257, 260 Density of set of integers, 2. 18 Denumerable probabilities, 51 Dependent random variables, 363 Derivatives of integrals, 402 Diagonal method, 29, A14 Difference equation. A19 Difference set, Al Diophantine approximation, 13, 324 Dirichlet theorem, 13, A26 Discontinuity of the first kind, 534 Discrete measure, 23, 1 61 Discrete random variable, 256 Discrete space, 1 . 1 , 23, 5.16 Disjoint, A3 Disjoint supports, 410, 421 Distribution: of random variable, 73, 187, 256 of random vector, 259 Distribution function, 175, 188, 256, 259, 409 Dominated convergence theorem, 78, 209 Dominated measure, 422 Double exponential distribution, 348 Double integral, 233

INDEX Double series, A27 Doubly stochastic matrix, 8.20 Dual space, 245 Dubins-Savage theorem, 102 Dyadic expa nsion, 3, A31 Dyadic interval, 4 Dyadic transformation, 3 1 3 Dynkin 's rr-A theorem, 42 E-.5 definition of absolute continuity, 422

Egorov theorem, 1 3 9 Eigenvalues, 8.26 Empirical distribution function, 268 Empty set, A 1 Entropy, 57, 6. 1 4, 8.31, 3 1 . 1 7 Equkontin uous, 355 Equivalence class, 58 Equivalent measures, 422 Erdos-Kac central limit theorem, 395 Ergodic theorem, 314 Erlang density, 23.2 Essential supremum, 241 Estimation, 2'51, 452 Etemadi, 282, 288, 22. 15 Euclidean distance, A16 Euclidean space, A1, A16 Euler function, 2. 18 Event, 18 Excessive function, 1 34 Exchangeable, 473 Existence of independent sequences, 73, 265 Existence of Markov chains, 1 15 Expected value, 76, 273 Exponential convergence, 131, 8.18 Exponential distribution, 189, 258, 297, 348 Extension of measure, 36, 166, 1 1. 1 Extremal distribution, 1 95

Factorization and sufficiency, 450 Fair game, 92, 463 Fatou lemma, 209 Field, 1 9, 2.5 Filtration, 458 Finite additivity, 20, 23, 2.15, 3 8, 161 Finite or countable, 8 Finite-dimensional distri butions, 308, 482 Finite-dimensional sets, 485 Finitely additive field, 20 Finite subadditivity, 24, 162 First Borel-Cantelli lemma, 59 First category, 1 . 10, A15 First passage, 1 18 fixed discontinuity, 303 Fourier representation, 250

589 Fourier series, 35 1 , 26.30 Fourier transform, 342 Frequency, 8 Fubini theorem, 233 Functional central limit th eorem, 522 Fundamental in probability, 20 21 Fundamental set, 320 Fundamental theorem of calculus, 224, 400 Fundamental theorem of Diophantine approximation, 324 Gambling policy, 98 Gamma distribution, 20. 1 7 Gamma function, 18 1 8 Generated u· field, 21 Glivenko-Cantelli theoreM, 269 Goncharov's theorem, 361 Hahn decomposition, 420 Hamel basis, 14 7 Hardy-Ramanujan theorem, 6. 16 Heine-Bore! theorem, A 1 3, A17 Hewitt-Savage zero-one law, 496 Hilbert space, 249 Hitting time, 1 36 Holder's inequality, 80, 5 9, 242, 276 Hypothesis testing, 1 5 1 Inadequacy of !JPT, 492 Inclusion-exclusion formula, 24, 163 Indefinite integral, 400 Identically distributed, 85 Independent classes, 55 Independent events, 53 Independent increments, 299, 498 Independent random variables, 71, 261 Independent random vectors, 263 Indicator, AS Infini tely divisible distributions, 371 Infinitely often, 53 Infinite series, A25 Information, 57 Initial digit problem, 25.3 Initial probabilities, 1 1 1 Inner boundary, 64 Inner measure, 37, 3.2 Integrable, 200, 206 In tegral, 1 99 Integral with respect to Lebesgue measure, 221 Integrals of derivatives, 412 Integration by parts, 236 Integration over sets, 212 Integration with respect to a density, 214

590 Interior, Al l Interval, A9 lnvariance principle, 520 Invariant set, 313 Inverse image, A7, 182 Inversion formula, 346 Irreducible chain, 1 1 9 Irregular paths, 504 Iterated integral, 233 Jacobian, 225, 261, 545 Jensen inequality, 80, 276, 449 Jordan decomposition, 421 Jordan measurable, 3.15 k-dimensional Borel set, 1 58 k-dimensional Lebesgue measure, 1 71, 1 77, 1 7 14, 20.4 Kolmogorov existence theorem, 483 Kolmogorov zero-one law, 63, 287 Landau notation, Al8 Laplace distribution, 348 Laplace transform, 285 Large deviations, 148 Lattice distribution, 26. 1 Law of the iterated logarithm, 1 5 3 Law of large numbers: strong, 9, 1 1 , 85, 282 weak, 5, 1 1 , 86, 284 Lebesgue decomposition, 414, 425 Lebesgue density theorem, 3 1 .9, 35. 15 Lebesgue function, 3 1 .3 Lebesgue in tegrable, 221 , 225 Lebesgue measure, 25, 43, 167, 171, 177 Lebesgue set, 45 Leibniz formula, 17.8 Levy distance, 14.5, 25.4, 26.1 6 Likelihooct ratio, 461, 471 Limit inferior, 52, 4.1 Limit of sets, 52 Lindeberg condition, 359 Lindeberg-Uvy theorem, 357 Linear Borel set, 158 Linear functional, 244 Linearity of expected value, 77 Linearity of the integral, 206 Linearly independent reals, 14.7, 30.8 Lipschitz condition, 418 Log-normal distribution, 388 Lower integral, 204, 228 Lower semicontinuous, 29. 1 Lower variation, 421 L P-space, 241

INDEX A-system, 41 Lusin theorem, 17.10 Lyapounov condition, 362 Lyapounov inequality, 81, 277 Mapping theorem, 344, 380 Marginal distribution, 261 Markov chain, 1 1 1, 363, 367, 29. 1 1 , 429 Markov inequality, 80, 276 Markov process, 435, 5 1 0 Markov sh ift, 312 Markov time, 1 33 Martingale, 1 0 1 , 458, 5 1 4 Martingale central limit theorem, 475 Martingale convergence th eorem, 468 Maximal ergodic th eorem, 3 1 7 Maximal inequ'llity, 287 Maximal solution, 122 wcontir:uity set, 335, 378 m-dependent, 6 1 1 , 364 Mean value, 26. 1 7 Measurable mapping, 1 82 Measurable proLess, 503 Measurable rectangle, 231 Measurable with respect to a u·field, 68, 225 Measurable set, 20, 38, 1 65 Measurable space, 161 Measure, 22, 1 60 Measure-preserving transformation, 31 1 Measure space, 1 6 1 Meets, A3 Method of moments, 388, 30.6 Minimal sufficient field, 454 Minimum·variance estimation, 454 Minkowski inequality, 5.1 0, 242 Mixing, 24.3, 363, 29. 1 0, 34. 16 Mixture, 473 IL*-measurable, 165 Moment, 274 Moment generating function, 1 .6, 1 46, 278, 284, 390 Monotone, 24, 162, 206 Monotone class, 43, 3.12 Monotone class th eorem, 43 Monotone convergence theorem, 208 M-test, 210, A28 Multidimensional central limit theorem, 385 Multidimensional characteristic function, 381 Multidimensional d istribution, 259 Multidimensional normal distribution, 383 Multinomial sampling, 29.8 Negative part, 200, 254 Negligible set, 8, 1 .3, 1 . 9, 44

INDEX Neym an- Pearson lemma, 1 9.7 Nonatomic, 2 . 1 9 Nondenumerable probabilities, 526 Nonmeasurable set, 45, 12 4 Nonnegative series, A25 Norm, 243 Normal distribution, 258, 383 Normal number, 8, 1 8, 86, 6 1 3 Normal number theorem, 9, 6.9 Nowhere dense, AlS Nowht"re differentiable, 3 1 . 18, 505 Null persistent, 1 30 Num ber theory, 393 Open set, Al l Optional sampling theorem, 466 Optimal stopping, 1 33 Ordel of dyadic interval, 4 Orthogonal projection, 250 Orthonormal, 249 Ottaviani inequahty, 22. 15 Outer boundary, 64 Outer measure, 3 7, 3.2, 165 Pairwise disjoint, A3 Partial-fraction expansion, 20.1 4 Partial inform ation, 57 Partition, A3 Path function, 308, 493, 500 Payoff function, 1 33 Perfect set, Al5 Pea:10 curve, 1 79 Period, 125 Permut

Probability and measure

Probability and measure theory

Probability and measure