This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
E Pij <J> ( j) , j
t compare the conditions (7.28) and (7.35).
i E S.
132 In
PROBABI LITY
terms of conditional expectation the requirement is
cp(i) > E;[ cp ( X1 )).
The value function v is excessive. PROOF. Given choose for each j in S a "good" stopping time 'lj Xn ) E 11, ] satisfying E1 [f( X'i )] > v ( j ) - By (8.43), [ 'lj = n ] = [( X0 , for some set I1 n of (n + 1)-long sequences of states. Set T = n + 1 (n > 0) on the set ( X1 = j ) n (( X1 , . . . , Xn+ l ) E 11n ) ; that is, take one step and then from the new state X1 add on the "good" stopping time for that state. Then Lemma 4.
£
T
£.
•
•
•
,
is a stopping time and 00
E; [ f( XT )] = E E EP; [ XI = j , ( X1 , . . . , Xn + l ) E /Jn , Xn + l = k ] f ( k ) n -o j k 00
- E E E P;J lJ [ ( Xo, . . . , Xn ) E /Jn ' Xn = k ) f ( k ) n-o j k
= E Pij EA !( x,)] . 1
Therefore, v(i) � E; [f( XT )] > was arbitrary, v is excessive.
E1 piJ ( v ( j ) -
£
)
= E1 piJ v ( j ) -
Lemma 5. Suppose that cp is excessive. (i) For all stopping times T, cp(i) > E; [ cp( XT )] .
£.
Since
£
•
For all pairs of stopping times satisfying E; [ cp( XT )] .
o
Part (i) says that for an excessive payoff function optimal strategy.
T
0
represents an
(ii)
PROOF.
( 8.46 )
To prove (i), put
TN
=
= min{ T, N } . Then TN is a stopping time, and
E; [ cp ( XTN )] = E E P; [ T = n ' xn = k ] cp ( k ) N- 1
n-O k
+
EP; [ T > N, XN = k]cp ( k ) . k
8. MARKOV CHAINS Since [ T > N] = [ T < N]c e �N - 1 , the final sum here is by (8.13)
133
SECTION
E E P; [ T > N, XN - 1 = j , XN = k]cp(k) k j
= E E P; [ T > N, XN - 1 = j ] PJkcp ( k ) < E P; [ T > N, XN - 1 = j ] cp ( j ) . 1
k j
Substituting this into (8.46) leads to E;[ cp ( XTN )] < E; [ cp( XTN _ 1 )]. Since To = 0 and E;[ cp ( X0 )] = cp (i), E; [ cp ( XTN )] < cp(i ) for all N. But for T( w) finite, q> ( XTN < "' > ( w )) --+ cp( XT< "' > (w)) (there is equality for large N), and so E; [ cp( XTN )] --+ E; [ cp( XT )] by Theorem 5.3. The proof of (ii) is essentially the same. If TN = min{ T, o + N ) , then TN is a stopping time, and oo
N- 1
E; [ cp ( XTN ) ] = E E E P;( o = m , T = m + n , xm + n = k ]cp( k ) m=O n=O
k
00
+ E E P; ( o = m , T > m + N, Xm + N = k] cp ( k ) . m=O
k
Since [o = m, T > m + N] = [o = m] - [o = m, T < m + N] e �+ N - 1 , again E;[cp ( XTN )] < E; [ q> ( XTN _ 1 )] < E; [cp( XT0 )]. Since T0 = o, part (ii) fol lows from part (i) by another passage to the limit. •
If an excessive function cp dominates the payofffunction f, then it dominates the value function v as well. Lemma 6.
all
By definition, to say that
i.
By Lemma 5, cp(i) > and so cp(i ) > v(i) for all i.
PROOF.
,. ,
g
dominates
8.10.
dominating f.
is to say that
g(i) > h(i)
for
E; (cp( XT )] > E; [f( XT )] for all Markov times
Since T = 0 is a stopping time, immediately characterize v: Theorem
h
•
v
dominates
f.
Lemmas
4
and
6
The value function v is the minimal excessive function
There remains the problem of constructing the optimal strategy T. Let M be the set of states i for which v(i) = f(i); M, the support set, is nonempty
134
PROBABILITY
since it at least contains those i that maximize f. Let A = n:o= o [ Xn � M] be the event that the system never enters M. The following argument shows that P; (A) = 0 for each i. As this is trivial if M = S, assume that M ¢ S. Choose 8 > 0 so that /(i) < v(i) - 8 for i e S - M. Now E;[/( XT )] = E�-o E k P; [T = n, Xn = k]f(k); replacing the /(k) by v(k) or v(k) - 8 according as k e M or k e S - M gives E; [/( XT )] < E; [v( XT )] - 8P; [XT E S - M ] < E; [v( XT )] - 8P; (A) < v(i) - 8P; (A), the last inequality by Lemmas 4 and 5. Since this holds for every Markov time, taking the supremum over T gives P; (A) = 0. Whatever the initial state, the system is thus certain to enter the support set M. Let T0 ( w ) = min[n: Xn (w) E M] be the hitting time for M. Then T0 is a Markov time and To = 0 if X0 E M. It may be that Xn (w) � M for all n, in which case T0 ( w) = oo , but as just shown, the probability of this is 0.
The hitting time T0 is optimal: E; [/( XT0 )] = v(i ) for all i. PROOF. By the definition of T0 , /( XT0 ) = l'( XT0 ). Put cp(i) = E; [/( XT0 )] = E; [ v( XT0 )]. The first step is to show that cp is excessive. If T1 = min[n: n > 1, xn E M], then Tl is a Markov time and
Theorem 8.11.
E; [ v ( XT. ) ] = E E P; [ X1 � M, . . . , Xn_ 1 � M, Xn = k ] v ( k ) 00
n-1
=
ke M
00
E E E P;JIJ [ Xo � M, . . . , xn - 2 � M, xn - 1 = k ] v ( k )
n-1
ke M jES
= E P;J E1 [ v ( XTo ) ] . 1
Since T0 � T1 , E; [ v( XT0 )] > E; [ v( XT1 )] by Lemmas 4 and 5. This shows that cp is excessive. And cp(i) < v(i) by the definition (8.44). If cp(i) > /(i) is proved, it will follow by Theorem 8.10 that cp(i) > v(i) and hence that cp(i) = v(i). Since To = 0 for X0 e M, i e M implies that cp(i) = E; [/( X0 )] = /(i). Suppose that cp(i) < /(i) for some values of i in S - M, and choose i0 to maximize /(i) - cp ( i ). Then -Jl(i) = cp(i) + /(i 0 ) - cp(i 0 ) dominates f and is excessive, being the sum of a constant and an excessive function. By Theorem 8.10, .p must dominate v, so that -Jl(i 0 ) > • v(i0), or /(i 0 ) � v(i 0 ). But this implies that i 0 e M, a contradiction The optimal strategy need not be unique. If all strategies have the same value.
f is constant, for example,
SECTION
8.
135
MARKOV CHAINS
For the symmetric random walk with absorbing barriers at 0 and r (Example 8.2) a function cp on S = { 0, 1, . . . , r } is excessive if cp (i) > !cp(i - 1) + �cp(i + 1) for 1 < i < r - 1. The requirement is that cp give a concave function when extended by linear interpolation from S to the entire interval [0, r]. Hence v thus extended is the minimal concave function dominating f. The figure shows the geometry: the ordinates of the dots are the values of f and the polygonal line describes v. The optimal strategy is to stop at a state for which the dot lies on the polygon. Example 8. 15.
0
0
r
,
If /(r) = 1 and /(i) = 0 for i < r, v is a straight line; v ( i ) = ijr. The optimal Markov time T0 is the hitting time for M = {0, r }, and v ( i ) = E; [/( XT0 )] is the probability of absorption in the state r. This gives another solution of the gambler' s ruin problem for the symmetric case. • For the selection problem in Example 8.5, the piJ are given by (8.5) and (8.6) for 1 < i S r, while p, + 1 . , + 1 == 1. The payoff is l( i) == i/r for i < r and 1( r + 1) == 0. Thus v ( r + 1) == 0, and since v is excessive, Example 8.16.
v ( i ) > g( i ) =
( 8 . 47)
r
L
J- i + l
.
.( . � l) v ( j) ,
J J
1 S i < r.
By Theorem 8 . 1 0, v is the smallest function satisfying (8.47) and v(i) > l(i) == ijr, 1 � i s r. Since (8.47) puts no lower limit on v ( r), v(r) = l ( r) == 1, and r lies in the support set M. By minimality
( 8 . 48)
v( i ) = max{ I( i ) , g( i ) } ,
1
< i < r.
M , then I( i) == v ( i) � �( i) � E j + 1 ij- 1 ( j - 1) - 1I( j ) == I( i )Ej + 1 ( j - 1) � 1 , and hence r:; _! + 1 ( j - 1) - � 1 . 0� the ��er .hand,! ! thi.s ineq�ali�y hold� and 1 + 1 , . . . , r all lie m M , then g( 1 ) == E + 1 1) ( 1 - 1) I( 1 ) == I( 1 )E1 - ; + 1 ( J - 1) - 1 s l( i), so that i e M by (8.48). Therefore, M == { i,, i, + 1, . . . , r, r + 1 } , If i
e
_
;
_
;
_
;
136 where i r is determined by
(8 .49) If i
�+ 1,
PROBABILITY
1 + ··· + 1 /(i) and so, by (8.48),
...
i
'r,l
0
J-i+ l
0;
J( J -
1)
v( j)
(
+ .!.. . 1 + . . . + 1 1 1 r
It follows by backward induction starting with i (8.50)
v( i)
= Pr ...
i, r
(
r -
1, -
==
i, -
1 that
1 . 1 + . . . + _1_ 1 1 1, -
).
r -
)
is constant for 1 < i < i,. In the selection problem as originally posed, X1 == 1 . The optimal strategy is to stop with the first Xn that lies in M. The princess should therefore reject the first i, - 1 suitors and accept the next one who is preferable to all his predecessors (is dominant). The probability of success is p, as given by (8.50). Failure can happen two ways. Perhaps the first dominant suitor after i, is not the best of all suitors; in this case the princess will be unaware of failure. Perhaps no dominant suitor comes after i ,; in this case the princess is obliged to take the last suitor of all and may be well aware of failure. Recall that the problem was to maximize the chance of getting the best suitor of all rather than, say, the chance of getting a suitor in the top half. If r is large, (8.49) essentially requires that log r - log i, be near 1, so that i, rje . In this case, p, 1/e . Note that although the system starts in state 1 in the original problem, its resolution by means of the preceding theory requires consideration of all possible initial states. • =
=
This theory carries over in part to the case of infinite S, although this requires the general theory of expected values, since /( XT) may not be a simple random variable. Theorem 8.10 holds for infinite S if the payoff function is nonnegative and the value function is finite. t But then problems t The only essential change in the argument is that Fatou's lemma (Theorem 16.3) must be used in place of Theorem 5.3 in the proof of Lemma 5.
SECTION
8.
137
MARKOV CHAINS
arise : Optimal strategies may not exist, and the probability of hitting the support set M may be less than 1. Even if this probability is 1, the strategy of stopping on first entering M may be the worst one of all. t PROBLEMS 8.1. 8.2.
Show that (8.2) holds if the first and third terms are equal for all n , i 0 ' . . . ' in ' and j. Let Y0 , Y1 , be independent and identically distributed with P[ Y, = 1] == p, P[ Y, == 0] == q == 1 - p, p ¢ q. Put Xn == Y, + Y, + 1 (mod 2). Show that Xo ' x1 ' . . . is not a Markov chain even though p [ xn 1 == j I xn - 1 == i] == P [ Xn 1 == j ]. Does this last relation hold for all Markov chains? Why? Show by example that a function /( X0 ), /( X1 ), of a Markov chain need not be a Markov chain. 4.18 f Given a 2 X 2 stochastic matrix and initial probabilities w0 and w1 , define P for cylinders in sequence space by P[ w: a; ( w ) == u; , 1 < i < n ] == wu Pu u Pu _ u for sequences u 1 , , u n of O's and 1 ' s (Problem 4.18 is th� c�s� piJ == � 1 =:: p1 ). Extend P as a probability measure to fl0 and then to fl. Show that a1 ( w ), a 2 ( w ), . . . is a Markov chain with transition probabili ties piJ . Show that •
•
•
+
+
8.3. 8.4.
•
· · ·
8.5.
• • •
/,. . I)
oo
� �
k -0
p))�� ) ==
oo
� �
n
� �
n-1 m=1
j,.< !" >p � � - m ) == 1) ))
and prove that if j is transient, then E n P�j' > Show that if j is transient, then
8.6.
8.7.
• •
0 if and only if P; [ XI -:1= i, . . . ' xn - 1 ¢ i, xn == j] > 0 for some n, and conclude that i is transient if and only if � ; < 1 for some j ¢ i such that fiJ > 0. (c) Show that an irreducible chain is transient if and only if for each i there is a j ¢ i such that Jj; < 1. 8. 10. Suppose that S == (0, 1, 2, . . . }, p00 == 1, and /;0 > 0 for all i. (a) Show that P; (Uj.... 1 [ Xn == j i.o.]) == 0 for all i. (b) Regard the state as the size of a population and interpret the conditions p00 == 1 and /;0 > 0 and the conclusion in part (a). 8.1 1. 8.6 f Show for an irreducible chain that (8.27) has a nontrivial solution if and only if there exists a nontrivial, bounded sequence { X; } (not necessarily nonnegative) satisfying X; == EJ piJ xJ , i -:1= i0• (See the remark following the proof of Theorem 8.5.) (b)
• ;0
8.12.
f Show that an irreducible chain is transient if and only if (for arbitrary i0 ) the system Y; == EJ PJ.J.Y.i ' i ¢ i0 (sum over all j) has a bounded, noncon stant solution { Y; , i e .s } .
8.13. Show that the P;-probabilities of ever leaving U for i e U are the minimal solution of the system
Z; == L PijZj + L Pi} ' i E U,
{ 8.51 )
jE U
0 S Z; < 1 ,
jE U
i E U.
The constraint z; < 1 can be dropped: the minimal solution automatically satisfies it, since z; 1 is a solution. 8.14. Show that supiJ n0(i, j) == oo is possible in Lemma 2. 8.15. Suppose that { 'IT; } solves (8.30), where it is assumed that E; l w; l < oo , so that the left side is well defined. Show in the irreducible case that the 'IT; are either all positive or all negative or all 0. Stationary probabilities thus exist in the irreducible case if and only if (8.30) has a nontrivial solution { 'IT; } (L, w, absolutely convergent). 8.16. Show by example that the coupled chain in the proof of Theorem 8.6 need not be irreducible if the original chain is not aperiodic. =
SECTION
8.
139
M.AllKOV CHAINS
8. 17. Suppose that S consists of all the integers and - 13 ' - 1 - p0. 0 - p0, + 1 Pk . k - 1 == q , Pk . k + 1 == p , Pk . k - 1 == p , Pk . k + 1 == q , n
.rO,
k < -1 k > 1. -
'
Show that the chain is irreducible and aperiodic. For which p's is the chain persistent? For which p 's are there stationary probabilities? 8.18. Show that the period of j is the greatest common divisor of the set ( 8.52) 8.19.
f
s
be nonnegative numbers with I == EC:.. 1 In Recurrent events. Let 11 , 12 , 1. Define u 1 , u 2 , recursively by u 1 == 11 and • •
• • •
•
( 8.53) (a) Show that I < 1 if and only if En un < oo .
(b)
Assume that I == 1, set
( 8.54)
p.
== E�_ 1 nln , and assume that
gcd[ n : n
�
1 , In > 0] == 1 .
Prove the renewal theorem: Under these assumptions, the limit u == lim n u n exists, and u > 0 if and only if p. < oo, in which case u == 1 Ip.. Although these definitions and facts are stated in purely analytical terms, they have a probabilistic interpretation: Imagine an event I that may occur at times 1, 2, . . . . Suppose In is the probability I occurs first at time n. Suppose further that at each occurrence of I the system starts anew, so that In is the probability that I next occurs n steps later. Such an I is called a recurrent event. If un is the probability that I occurs at time n , then (8.53) holds. The rec11rrent event I is called transient or persistent according as I < 1 or I == 1, it is called aperiodic if (8.54) holds, and if I == 1, p. is interpreted as the mean recurrence time. 8.20. Show for a two-state chain with S == {0, 1 } that pbQ> == p 1 0 + ( p00 p 1 0 ) p6{; _ 1+,. Derive an explicit formula for pbQ> and show that it converges to p 1 0/( p01 p 1 0 ) if all p, are positive. Calculate the other P1j> and verify that the limits satisfy (8. JO). Verify that the rate of convergence is exponen tial. 8.21. A thinker who owns r umbrellas travels back and forth between home and office, taking along an umbrella (if there is one at hand) in rain (probability p) but not in shine (probability q). Let the state be the number of umbrellas at hand, irrespective of whether the thinker is at home or at work. Set up the transition matrix and find the stationary probabilities. Find the steady-state probability of his getting wet, and show that five umbrellas will protect him at the 5% level against any climate (any p ) . •
140
PROBABILITY
8.22. (a) A transition matrix is doub�v stochastic if E; p, 1 == 1 for each j. For a finite, irreducible, aperiodic chain with doubly stochastic transition matrix, show that the stationary probabilities are all equal. (b) Generalize Example 8.14: Let S be a finite group, let p ( i ) be probabili I ties, and put p; 1 == p(j · ; - ) , where product and inverse refer to the group operation. Show that, if all p ( i ) are positive, the states are all equally likely in the 1imi t. (c) Let S be the symmetric group on 52 elements. What has (b) to say about card shuffling? 8.23. A set e in S is closed if E1 c pi J == 1 for i e e: once the system enters C it cannot leave. Show that a chain is irreducible if and only if S has no proper closed subset. 8.24. f Let T be the set of transient states and define persistent states i and j (if there are any) to be equivalent if /;1 > 0. Show that this is an equivalence relation on S - T and decomposes it into equivalence classes e1 , C2 , , so that S == T u C1 u e2 u . . . . Show that each em is closed and that /;; == 1 for i and j in the same em . 8.25. 8.13 8.23 f Let T be the set of transient states and let e be any closed set of persistent states. Show that the P;-probabilities of eventual absorption in C for i e T are the minimal solution of ·
e
• • •
Y; ==
( 8.55)
E
jE T
P;J J'.i
0 < Y; < 1 ,
+ E Pi } ' jE C
iE
T,
ie
T.
8.26. Suppose that an irreducible chain has period t > 1. Show that S decomposes into sets 50 , . . . , S,_ 1 such that piJ > 0 only if i E S, and j E S,+ for some 11 ( 11 + .1 reduced modulo t). Thus the system passes through the S, in cyclic successton. 8.27. f Suppose that an irreducible chain of period t > 1 has a stationary distribution { '17) } · Show that, if i E s, and j E s,+ cr ( 11 + a reduced modulo t). ' then lim n p� �r + cr ) == 'fT. Show that lim n n - I�"-- mn -= 1 p � �) == w.jt for all i and 1
1·
IJ
J"
'l
J
8.28. Suppose that { Xn } is a Markov chain with state space S and put Y, == ( Xn , Xn + 1 ). Let T be the set of pairs ( i , j ) such that P;1 > 0 and show that Y, is a Markov chain with state space T. Write aown the transition probabilities. Show that, if { Xn } is irreducible and aperiodic, so is { Y,, } . Show that, if "'; are stationary probabilities for { Xn }, then "'; piJ are stationary probabilities for { Y, }. 8.29. 6.10 8.28 f Suppose that the chain is finite, irreducible, and aperiodic and that the initial probabilities are the stationary ones. Fix a state i, let A n == [ Xn == i ], and let Nn be the number of passages through i in the first n steps. Calculate an and fln as defined by (5.37). Show that fln - a; = 0(1/n ) , so that n - 1 Nn --+ with probability 1 . Show for a function f on the state space that n - 1 E'ic .... 1 /( Xk ) --+ E;w;/(i) with probability 1. Show that n - 1 E'ic .... l g( Xk , xk + l > --+ E j wi pi j g,i , j ) for functions g on s X s. 'IT ;
i
SECTION
8.30.
8.
141
MARKOV CHAINS
, in , put p, ( w ) 8. 2 9 f If X0 ( w ) == i0 , , X, ( w ) == i, for states i0 == w, p, ; · · · p, , , so that p, ( w ) is the probability of the observation obse�e�� Show ffi�i - n - • tog p, ( w ) --+ h == - E, 1 P; log p, 1 with probabil ity 1 if the chain is finite, irreducible, and aperiodic. & tend to this case the notions of source, entropy, and asymptotic equipartition. A sequence { Xn } is a Markov chain of second order if P [ Xn + 1 == j l X0 == i0 , , Xn == in ] = P[ Xn + t == j ! Xn - t == i n_ 1 , Xn == in ] = P;n_ 1 ;n: 1 Show that nothing really new is involved because the sequence of pairs ( xn ' xn + 1 ) is an ordinary Markov chain (of first order). Compare Problem 8.28. Generalize this idea into chains of order r. Consider a chain on S == {0, 1, . . . , r }, where 0 and r are absorbing states and Pi , i + 1 == P; > 0, P; , ; - 1 == q; == 1 - P; > 0 for 0 < i < r. Identify state i with a point z; on the line, where 0 == z0 < · · · < z, and the distance from z; to z; + 1 is q;/P; times that from z; _ 1 to z; . Given a function q> on S, consider the associated function ip on [0, z, ] defined at the z; by ip(z; ) == q> ( i ) and in between by linear interpolation. Show that q> is excessive if and only if ip is concave. Show that the probability of absorption in r for initial state i is t; _ 1 /t, _ 1 , where I; == E� _0q1 Pk · Deduce (7.1). Show that qk /p 1 in the new scale the expected distance moved on each step is 0. Suppose that a finite chain is irreducible and aperiodic. Show by Theorem 8.9 that an excessive function must be constant. Alter the chain in Example 8.13 so that q0 == 1 - p0 == 1 (the other P; and q; still positive). Let {J == limn p 1 p, and assume that {J > 0. Define a payoff function by /(0) = 1 and /(i) == 1 - /;0 for i > 0. If X0 , , Xn are positive, put a, == n ; otherwise let an be the smallest k such that X,._ == 0. Show that E; [/( Xa )] --+ 1 as n --+ oo , so that v(i) 1 . Thus the support set is M = {0} , and for an initial state i > 0 the probability of ever hitting M is /; o < 1. For an arbitrary finite stopping time T, choose n so that P; [ T < n = a, ] > 0. Then E; [/( XT )] � 1 - h + n . o P; [ T < n == a, ] < 1. Thus no strategy achieves the value v(i) (except of course for i == 0). f Let the chain be as in the preceding problem, but assume that {J == 0, so that /;0 == 1 for all i. Suppose that A 1 , A 2 , exceed 1 and that A 1 A n --+ A < oo ; put /(0) == 0 and /(i) == A 1 A; _ 1 /p 1 P t · For an arbitrary (finite) stopping time T, the event [ T == n ] must have the- form [( X0 , , Xn ) e /n ) for some set In of ( n + 1)-long sequences of states. Show that for each i there is at most one n � 0 such that (i, i + 1 , . . . , i + n ) e ln. If there is no such n , then E; [ /( XT )] == 0. If there is one, then 6.14
• • •
• • • •
'IT;
8.31.
•
8.32.
•
•
•
•
8.33. 8.34.t
•
•
•
•
•
•
•
•
•
•
•
=
8.35.
•
•
•
•
•
•
•
•
•
•
•
•
;
•
•
.
•
.
and hence the only possible values of E; [ f( XT )] are 0 , /( i ) , P;/( i + 1 )
= /( i) A ; ,
P; P; + 1 /( i + 2) == /( i) A; A; + 1 ,
•
t lbe l ast two problems involve expected values for random variables with infinite range.
.
142
PROBABILITY
Thus v(i) == /( i)A/A 1 • • • A; _ 1 for i � 1 ; no strategy achieves this value. The support set is M == {0}, and the hitting time To for M is finite, but E; (/( XTo )) == 0. SECI10N 9. LARGE DEVIATIONS AND THE LAW OF THE ITERATED LOGARITHM *
It is interesting in connection with the strong law of large numbers to estimate the rate at which Sn/n converges to the mean m. The proof of the strong law used upper bounds for the probabilities P[ I Sn - m l > a] for large a. Accurate upper and lower bounds for these probabilities will lead to the law of the iterated logarithm, a theorem giving very precise rates for Sn/n --+ m. The first concern will be to estimate the probability of large deviations from the mean, which will require the method of moment generating functions. The estimates will be applied first to a problem in statistics and then to the law of the iterated logarithm. Moment Generating Functions
Let X be a simple random variable assuming values x 1 , • • • , x1 with respec tive probabilities p 1 , . . . , p 1• Its moment generating function is \
(9 . 1)
I
M ( t ) = E [ e'X ] = E p;e 1x• . ;- I
(See (5.16) for expected values of functions of random variables.) This function, defined for all real t, can be regarded as associated with X itself or as associated with its distribution-that is, with the measure on the line having mass P; at X; (see (5.9)). If c = max ;I X;I , the partial sums of the series e 1 x = Er/_0t k X k/k ! are bounded by el t lc, and so the corollary to Theorem 5.3 applies : (9 . 2 )
Thus M(t) has a Taylor expansion, and as follows from the general theory [A29], the coefficient of t k must be M < k > (O)jk! Thus (9 . 3 ) * This section may be omitted.
SECTION
9.
LARGE DEVIATIONS
143
Furthermore, term-by-term differentiation in (9.1) gives
M 0. Since ( M'(t)) 2 = E 2 [ Xe 1 x ] < E[e 1 x ] E [ X 2e 1 x ] = M(t)M "(t) by Schwarz's inequality (5.32), C"(t) > 0. Thus ·
the moment generating function
convex.
and the cumulant generating function are both
Large Deviations Let Y be a simple random variable assuming values y1 with probabilities p1 . The probl�m i� to estimate P[Y > a] when Y has mean 0 and a is positive. It is not.ationally convenient to subtract a away from Y and instead estimate P[Y > 0] when Y has negative mean. Assume then that
(9.11 )
P ( Y > O] > O, the second assumption to avoid trivialities. Let M(t) = E1 p1 e'Y1 be the moment generating function of Y. Then M '(0) < 0 by the first assumption in (9.11), and M(t) � oo as t --+ oo by the second. Since M(t) is convex, it has its minimum p at a positive argument T: (9.12 ) inf M ( t ) = M ( T ) = p , 0 < p < 1 , T > 0. E ( Y ] < O,
I
Construct (on an entirely irrelevant probability space) an auxiliary ran dom variable Z such that
(9.13 ) for each y1 in the range of Y. Note that the probabilities on the right do add to 1. The moment generating function of Z is
(9.1 4 )
146
PROBABILITY
and therefore
( 9.15)
M' T ) ( E ( Z ) = p = 0,
For all positive t, (5.27), and hence
T ) M" ( 2 2 s = E ( Z ] = p > 0.
P[Y 0] = P[e 1 Y > 1] < M(t) by Markov's inequality �
P ( Y > 0)
( 9.16 )
0, then
Put the final sum here in the form e - 1, and let p = P[Z > � 0. Since log x is concave, Jensen's inequality (5.29) gives fJ - IJ
= log E' e - "Yip - 1P ( Z = yi ] > E' ( - Tyj) p - 1P ( Z = yi ]
�
= - Tsp - 1 L' P [ Z = yi ] By
E'
0].
denotes
By
(9.16),
+ log p + log p
+ log p .
(9.15) and Lyapounov's inequality (5.33), Y· 1 E li Z < 1 E 112 z 2 ] = 1 . = z ( P < y j I] ; [ ] ; E' :
The last two inequalities give
(9.18 )
0 � IJ �
TS ) - log P ( Z > 0) . � P[Z O
This proves the following result.
Suppose that Y satisfies (9.11). Define p and T by (9.12), let variable with distribution (9.13), and define s 2 by (9.15). Then P[Y > 0] = pe - 9, where fJ satisfies (9.18).
Theorem 9. 1. Z be a random
To use (9.18) requires a lower bound for P[Z
> 0].
SECTION
9.
LARGE DEVIATIONS
1beorem 9.2. If E[Z] = 0, E[ Z 2 ] = � 0] � s 4/4 � 4 .t
s 2, and E[ Z 4 ] = � 4 > 0, then
147
P[Z
Let z + = Z/[ Z � OJ and z- = - Z/[ Z < OJ · Then z+ and z- are nonnegative, Z = z + _ z-, Z2 = (Z + )2 + (Z-) 2 , and PROOF.
(9. 1 9) Let p
= P[ Z
�
0]. By Schwarz's inequality (5.32),
By Holder's inequality (5.31) (for p = f and
Since E[ Z ] = q j) gives ==
= 3)
0, another application of Holder's inequality (for p = 4 and
Combining these three inequalities with
(Epl/4 ) 213� 4/3
q
=
2 ptl2� 2 .
(9.1 9)
gives
s2
�
p1 12� 2 +
•
Chernoff's Theorem* 1beorem
9.3.
Let xl, x2 , . . . be independent, identically distributed simple
random variables satisfying E [ Xn ] < 0 and P[ Xn > 0] > 0, let M(t) be their common moment generating function, and put p = inf M(t ) . Then ,
(9.20)
.!. tog P [ X1 + lim n � oo n
· · ·
+ Xn > 0 ) = log p .
t For a related result, see Problem 25.21 . *This theorem is not needed for the law of the iterated logarithm, Theorem 9.5.
148
PROBABILITY
Put Yn = X1 + · · · + Xn. Then E[ Yn] < 0 and P[ Yn > 0] > P " [ X1 > 0 ] > 0, and so the hypotheses of Theorem 9.1 are satisfied. Define Pn and Tn by inf,Mn ( t ) = Mn ( Tn ) = Pn, where Mn(t ) is the moment generating n fu�ction of Yn . Since Mn( t ) == M ( t ), it follows that Pn = p" and Tn = T, where M( T) == p. Let Zn be the analogue for Yn of the Z described by (9.13). Its moment n n generating function (see (9.14)) is Mn ( T + t)/p = ( M( T + t)/p) . This is also the moment generating function of V1 + · · · + V, for independent random variables V1 , . . . , V, each having moment generating function M( T + t )jp . Now each v; has (see (9.15)) mean 0 and some positive variance o 2 and fourth moment � 4 independent of i. Since Z, must have the same moments as V1 + · · · + V,, it has mean 0, variance s; = n o 2 , and fourth moment �! == n� 4 + 3n(n - 1) o 4 = O(n 2 ) (see (6.2)). By Theorem 9.2, P[ Zn > 0] > s:/4�! > a for some positive a independent of n. By Theo n 9 > rem 9.1 then, P[ Yn 0] = p e - , where 0 < on < Tnsna - 1 - log a = Ta - 1o /n - log a. This gives (9.20), and shows, in fact, that the rate of • convergence is 0( n - 1 /2 ). PROOF.
"
This result is important in the theory of statistical hypothesis testing. An informal treatment of the Bernoulli case will illustrate the connection. Suppose Sn == X1 + · · · + Xn , where the X; are independent and assume the values 1 and O with probabilities p and q. Now P( Sn > na ] == P(EZ= 1 ( X�c - a ) > 0], and Chernoff's theorem applies if p < a < 1. In this case M ( t ) = E ( e '< x1 - a ) ] = e - 'a ( pe ' + q). Minimizing this shows that the p of Chernoff's theorem satisfies
a
b
- log p == K( a , p) = a log - + b log - , p q where b == 1 - a. By (9.20), n- 1 log P[Sn > na ] -+ - K(a, p ); express this as (9 .21) Suppose now that p is unknown and that there are two competing hypotheses concerning its value, the hypothesis H1 that p = p 1 and the hypothesis H2 that p = P2 ' where P I < P2 . Given the observed results x. ' . . . ' xn of n Bernoulli trials, one decides in favor of H2 if sn > na and in favor of HI if s, < na ' where a is some number satisfying p 1 < a < p2 • The problem is to find an advantageous value for the threshold a. By (9.21), (9 .22) where the notation indicates that the probability is calculated for p = p 1 -that is, under the assumption of H1 • By symmetry, (9.23)
SECTION
9.
LARGE DEVIATIONS
149
The left sides of (9.22) and (9.23) are the probabilities of erroneously deciding in favor of H2 when H1 is, in fact, true and of erroneously deciding in favor of H1 when H2 is, in fact, true-the probabilities describing the level and power of the tes t. Suppose a is chosen so that K(a, p 1 ) = K(a, p 2 ), which makes the two error probabilities approximately equal. This constraint gives for a a linear equation with solution (9 .24) where q; = 1 - P; · The common error probability is approximately e - n K for this value of a , and so the larger K (a, p 1 ) is, the easier it is to distinguish statistically between p 1 and p2 • Although K( a(p . , p2 ), p 1 ) is a complicated function, it has 3a simple approxima tion for p 1 near p2 . As x -+ 0, log(1 + x) = x - !x 2 + O(x ). Using this in the definition of K and collecting terms gives (9 .25) Fix
x2 K ( p + x , p ) = 2pq + O ( x 3 ) ,
X -+ 0.
p 1 = p, and let p2 = p + t; (9.24) becomes a function 1/!(t) of t, and expanding
the logarithms gives {9.26)
t -+ 0
after some reductions. Finally, (9.25) and (9.26) together imply that (9.27)
t -+ 0.
In distinguishing p 1 == p from p2 == p + t for small t, if a is chosen to equalize the two error probabilities, then their common value is about e - n t 2 /Sp q. For t fixed, the nearer p is to ! , the larger this probability is and the more difficult it is to distin,uish p from p + t. As an example, compare p == .1 with p == .5. Now .36nt /8(.1)(.9) == nt 2/8(.5)(.5). With a sample only 36 percent as large, .1 can therefore be distinguished from .1 + t with about the same precision as .5 can be distinguished from .5 + t. 1be
Law of the Iterated Logarithm
analysis of the rate at which S,jn approaches the mean depends on the following variant of the theorem on large deviations.
The
Let sn = xl + . . . xn , where the xn are independent and identically distributed simple random variables with mean 0 and variance 1. If
Theorem 9.4.
150
PROBABILITY
a n are constants satisfying (9 .28 ) then ( 9. 2 9) for a sequence rn going to 0. PROOF. Put yn = sn - a nrn = E�- l ( Xk - a n! ..fi). Then E[Yn1 < 0. Since X1 has mean 0 and variance 1, P[X1 > 0] > 0, and it follows by (9.28) that P[nX1 > a,/ {i] > 0 for n sufficiently large, in which case P [Yn > 0] > p [ X1 - a n/..fn > 0] > 0. Thus Theorem 9.1 applies to Yn for all large enough n. Let Mn (t), Pn ' Tn , and zn be associated with yn as in the theorem. If m(t) and c( t ) are the moment and cumulant generating functions of the Xn , a then Mn( t ) is the n th power of the moment generating function e - t ,/�m(t) of xl - a nl rn ' and so yn has cumulant generating function (9 . 30 ) Since Tn is the unique minimum of Cn( t ), and since c;(t) = - a nrn + nc'(t ), Tn is determined by the equation c'( Tn ) = a n! rn . Since xl has mean 0 and variance 1, it follows by (9.6) that c (O ) = c' (O ) = 0, c ( O) = 1 . ( 9 . 31 ) Now c'( t ) is nondecreasing because c(t) is convex, and since c'( Tn ) = a n! rn goes tO 0, Tn must therefore go tO 0 as Well and must in fact be 0( a n/ rn ) . By the second-order mean-value theorem for c'( t ) , a nl rn = c'( Tn ) = Tn + 0( Tn2 ), from which follows "
( 9.32 )
'�'n =
an 7n
+
(0 a; ) --;;
·
c( t ), log pn = Cn( Tn ) = - Tn a nrn + nc ( Tn ) = - Tn a nrn + n 'l'n2 + o( ;
By the third-order mean-value theorem for
Applying (9.32) gives
( 9.33 )
[I
.,.
)] .
SECTION
9.
151
LARGE DEVIATIONS
Now (see (9 .14)) zn has moment generating function Mn ( Tn + t)/ Pn and (see (9.30)) cumulant generating function Dn (t) = Cn (Tn + t) - log pn = - (Tn + t)a nln + nc(t + Tn ) - log pn. The mean of zn is D�(O) = 0. Its variance s; is D�'(O); by (9.31), this is
(9 . 3 4 )
s; = nc"( Tn ) = n ( c"(O) + 0 ( Tn ) ) = n ( 1
+
o (1)) .
The fourth cumulant of zn is D�" ' (O) = nc""( Tn ) = 0( n ) . By formula (9.9) relating moments and cumulants (applicable because E[Zn1 = 0), E[Z:] = 3s: + D�" ' (O). Therefore, E[Z:];s: 3, and it follows by Theorem 9.2 that there exists an a such that P [Zn > 0] > a > 0 for all sufficiently large n. By Theorem 9. 1 , P[Yn � 0] = Pne - B, with 0 < (Jn < Tn sna - 1 + log a. By (Jn = O(a n ) = o(a;), and it follows by (9.33) that (9.28), (9.32), and (9.34), a�( l + o( 1 ))/2 . y = e � 0 ] n • [ P -+
The law of the interated logarithm is this:
Let sn = x1 + . . . + xn, where the xn are independent, identically distributed simple random variables with mean 0 and variance 1. Then
Theorem 9.5.
(9.35 )
P
[
Sn lim sup n J2 n log log n
=
1
]
=
1.
Equivalent to (9.35) is the assertion that for positive £
(9 .3 6 )
P [ Sn > ( 1 + £ ) J2n log log n
i.o. ]
=
0
P [ Sn � ( 1 - £ ) J2n log log n
i.o. ]
=
1.
and
(9. 3 7 )
The set in (9.35) is, in fact, the intersection over positive rational £ of the sets in (9.37) minus the union over positive rational £ of the sets in (9.36). The idea of the proof is this. Write
(9. 3 8 )
q> ( n ) = J2n log log n .
If A ! = [Sn > (1 ± £)f/>(n)], then by (9.29) , P ( A !) is near (log n) - < 1 ± '>2• If n k increases exponentially, say n k - (J k for (J > 1, then P (A�) is of the order k - < 1 ± '>2• Now Ekk -< 1 ± '>2 converges if the sign is + and diverges if
152
PROBABILITY
the sign is - . It will follow by the first Borel-Cantelli lemma that there is probability 0 that A� occurs for infinitely many k. In proving (9.36), an extra argument is required to get around the fact that the A; for n =I= nk must also be accounted for (this involves choosing (J near 1). If the A; were independent, it would follow by the second Borel-Cantelli lemma that with probability 1, A n " occurs for infinitely many k, which would in tum imply (9.37). An extra argument is required to get around the fact that the A n " are dependent (this involves choosing 8 large). For the proof of (9.36) a preliminary result is needed. Put Mk = max{ S0 , S1 , . . . Sk }, where S0 = 0. ,
Theorem 9.6.
for a � {2,
P
(9.39) PROOF.
If the Xk are independent with mean 0 and variance l , then
[ � > a] � [-!;- > a - ] . 2P
n:
If Aj == ( � _ 1 < a..fn < Mj], then
Since Sn - � has variance n - j, it follows by independence and Chebyshev's inequality that the probability in the sum is at most
(9.36). Given £ , choose (J so that (J > 1 but 8 2 < k 1 [(J ] and xk == 8(2 log log n k ) 12• By (9.29) and (9.39),
PROOF OF
nk
==
1
+
£.
Let
SECTION
9.
153
LARGE DEVIATIONS
where � k --+ 0. The negative of the exponent is asymptotically 8 2 log k and hence for large k exceeds 8 log k, so that
Since 8 > 1, it follows by the first Borel-Cantelli lemma that there is probability 0 that (see (9.38))
(9.40) for infinitely manx k. Suppose that nk _ 1 < n < n k and that
sn > ( 1
(9.41 )
+ £ ) q, ( n )
.
1 Now q, ( n ) � q, ( n k _ 1 ) - 8 - 12q,(n k ); hence, by the choice of 8, (1 + £)q,( n ) > IJq,( nk) if k is large enough. Thus for sufficiently large k, (9.41) implies (9.40) (if n k _ 1 < n < n k ), and there is therefore probability 0 that (9.41) • holds for infinitely many n. 1 PROOF OF (9.37). Given £, choose an integer (J so large that 38 - 12 < £. k Take nk = fJ . Now n k - n k _ 1 --+ oo , and (9.29) applies with n = n k - nk _ 1 1 and a n = xk/ Jn k - n k _ 1 , where xk = ( 1 - 8 - )q,( n k ). It follows that
P ( Sn k - Sn k - 1
> xk ]
=
P [ Sn - n 1c - 1 �
xk
]=
[
x k2 1 exp - 1 + �k ) , 2 nk - nk 1 ( -
-
]
1 where � k 0. The negative of the exponent is asymptotically ( 1 (J - )log k and so for large k is less than log k, in which case P[Sn k - Sn k _ 1 > xk ] > k - 1 • The events here being independent, it follows by the second Borel- Cantelli lemma that with probability 1, Sn " - Sn k - 1 > x k for infinitely many k. On the other hand, by (9.36) applied to { - Xn }, there is probability 1 that - Sn " _ 1 < 2q,( n k _ 1 ) < 2fJ - 1 12q,(n k ) for all but finitely many k. These two inequalities give sn k > x k - 2fJ - lllq,(n k ) > ( 1 • t ) � (n k ), the last inequality because of the choice of 8. .-.
That completes the proof of Theorem 9.5.
154
PROBABILITY
PROBLEMS 9.1. Prove (6.2) by using (9.9) and the fact that cumulants add in the presence of independence.
9.2. Let Y be a simple random variable with moment generating function M( t ) . Show that P[ Y > 0] == 0 implies that P [ Y � 0] == inf M ( t ) . Is the infimum ,
achieved in this case?
9.3. In the Bernoulli case, (9.21) gives
where p < a < 1 and
xn == n ( a - p ). Theorem 9.4 gives
where xn == a nlnpq . Resolve the apparent discrepancy. Use (9.25) to compare the two expressions in case x fn is small. See Problem 27. 17. n
9.4. Relabel the binomial parameter p as fJ == I( p ) , where I is increasing and continuously differentiable. Show by (9.27) that the distin�uishability of fJ from fJ + �fJ, as measured by K, is ( �6 ) 2j8p (1 - p)(l'( p )) + 0( �6 ) 3 • The leading coefficient is independent of fJ if I( p ) == arc sin {i. 9.5. From (9.35) and the same result for { - Xn } , together with the uniform boundedness of the xn ' deduce that with the probability 1 the set of limit points of the sequence { sn (2 n log log n ) - 1 12 } is the closed interval from - 1 to + 1. 9.6.
Suppose Xn takes the values ± 1 with probability ! each and show that P[ Sn = 0 i.o.] == 1 (This gives still another proof of the persistence of symmet ric random walk on the line (Example 8.3).) Show more generally that, if the Xn are bounded by M, then P [ I Sn l s M i.o.] == 1.
f
9.7. Weakened versions of (9.36) are quite easy to prove. By a fourth moment argument (see (6.2)), show that P[Sn > n 314 (log n ) < 1 + f )/4 i.o.] == 0. Use (9.29) to give a simple proof that P[ Sn > (3n log n ) 1 12 i.o.] == 0. 9.8. What happens in Theorem 9.6 if fi is replaced by fJ (0 < fJ < a)? 9.9. Show that (9.35) is true if sn is replaced by I sn I or maxk n sk or maxk n I sk I · s
s
CHAPTER
2
Measure
SECI'ION 10. GENERAL MEASURES
Lebesgue measure on the unit interval was central to the ideas in Chapter 1. Lebesgue measure on the entire real line is important in probability as well as in analysis generally, and a uniform treatment of this and other examples requires a notion of measure for which infinite values are possible. The present chapter extends the ideas of Sections 2 and 3 to this more general
case.
aasses
of Sets
The a-field of Borel sets in (0, 1] played an essential role in Chapter 1, and it is necessary to construct the analogous classes for the entire real line and for k-dimensional Euclidean space. Let x = (x 1 , . . . , xk) be the generic point of Euclidean k-space R k . The bounded rectangles Example 10.1.
(10.1 ) play in R k the role intervals (a, b] played in (0, 1]. Let � k be the o-field generated by these rectangles. This is the analogue of the class fJI of Borel sets in (0, 1]; see Example 2.6. The elements of � k are the k-dimen siona/ Borel sets. For k = 1 they are also called the linear Borel sets. Call the rectangle (1 0.1) rational if the a ; and h; are all rational. If G is an open set in R k and y e G, then there is a rational rectangle Av such that Y e A.v c G. But then G = Uy eGAy, and since there are only. countably will
155
156
MEASURE
many rational rectangles, this is a countable union. Thus f11 k contains the open sets. Since a closed set has open complement, f11 k also contains the closed sets. Just as fJI contains all the sets in (0, 1] that actually come up in ordinary analysis and probability theory, !1/k contains all the sets in R k that actually come up. The a-field gJ k is generated by subclasses other than the class of rectangles. If An is the x-set where a; < x; < b; + n - 1 , i = 1, . . . , k, then An is open and (10.1) is nnAn. Thus !1/ k is generated by the open sets. Similarly, it is generated by the closed sets. Now an open set is a countable union of rational rectangles. Therefore, the (countable) class of rational
rectangles generates f11 k.
•
The a-field !A 1 on the line R 1 is by definition generated by the finite intervals. The a-field fJI in (0, 1] is generated by the subintervals of (0, 1 ]. The question naturally arises whether the elements of !!4 are the elements of !1/ 1 that happen to lie inside (0, 1]. The answer is yes, as shown by the following technical but frequently useful result. Theorem 10.1. (i) If !F is
Oo.
Suppose that 0 0 c 0. a a-field in 0, then �0 = [ A
n
0 0:
A e �]
is a a-field in
If .JJI generates the a-field � in 0, then ..ra10 = [ A n 0 0 : A e ..raf] generates the a-field � = [A n 0 0 : A e � ] in 0 0• PROOF. Of course, 0 0 = 0 n 0 0 e �0• If B = A n 0 0 and A e �' then 0 0 - B = (0 - A ) n 0 0 and 0 - A e �- If Bn = An n 0 0 and An e �' then U n Bn = (U n An) n O o and U nAn E �- Thus � is a a-field in O o . Let a0( �0 ) be the a-field ..ra10 generates in 0 0• Since �0 is a a-field in 0 0 , and since .JJ/0 c � ' certainly a0( d0 ) c �0• Now �0 c a0 ( .JJ/0 ) will follow if it is shown that A e � implies A n 0 0 e a0 ( ..ra10 ), or, to put it another way, if it is shown that � is contained in f§ = [ A : A n O o E ao ( ..rafo )]. Since A E d implies that A n Oo E ..rafo c a0 ( .JJ/0 ), it follows that d c �- It is therefore enough to show that f§ is a a-field in D. But if A E f§, then (0 - A) n O o = O o - ( A n O o ) E ao (do) and hence D - A E �- If An E f§ for all n, then (UnAn) n Oo = Un(An n 0 0 ) E a0(�0 ) and hence U nAn E f§. • (ii)
If 0 0 e /F, then �0 = [ A : A c 0 0 , A e � ]. If 0 = R1, 0 0 = (0, 1 ] , and 1 � = fA , and if .JJI is the class of finite intervals, then ..ra10 is the class of subintervals of (0, 1] and fJI = a0( d0) is given by
( 1 0 .2)
ffl = ( A : A
c (O, 1] , A
e !11 1 ] .
SECTION
10.
GENERAL MEASURES
157
A
subset of (0, 1] is thus a Borel set (lies in ffl) if and only if it is a linear Borel set (lies in 91 1 ), and the distinction in terminology can be dropped. Conventions Involving oo
Measures assume values in the set [0, oo] consisting of the ordinary nonnegative reals and the special value oo , and some arithmetic conventions are called for. For x, y e [0, oo ], x < y means that y == oo or else x and y are finite (that is, are ordinary real numbers) and x < y holds in the usual sense. Similarly, x < y means that y == oo and x is finite or else x and y are both finite and x < y holds in the usual sense. For a finite or infinite sequence x, x 1 , x 2 , . in [0, oo ], • •
(10 .3) means that either (i) x == oo and xk == oo for some k, or (ii) x == oo and xk < oo for all k and :E k xk is an ordinary divergent infinite series or (iii) x < oo and xk < oo for all k and (10.3) holds in the usual sense for l:k xk an ordinary finite sum or convergent infinite series. By these conventions and Dirichlet's theorem [A26], the order of summation in (10.3) has no effect on the sum. For an infinite sequence x, x 1 , x 2 , in [0, oo ], • • •
(10.4) means in the first place that xk < xk 1 s x and in the second place that either (i) x < oo and there is convergence in the usual sense, or (ii) xk == oo for some k, or (iii) x == oo and the xk are finite reals converging to infinity in the usual sense. +
Measures
A set function p. on a field .fF in D is a
measure
if it satisfies these
conditions: (i) p.( A ) E [0, oo] for A E ofF; (ii) ,., (0) = 0; (iii) if A I , A 2, . . . is a disjoint sequence of jli:sets and if uk_ l Ak E .?F, then (see (10.3))
The measure p. is finite or infinite as p.(O) < oo or p.(O) = oo ; it is a probability measure if p.(D) = 1, as in Chapter 1. I f D = A 1 u A 2 u · · · for some finite or countable sequence of jli:sets satisfying p. ( A k) < oo, then p. is a-finite. The significance of this concept
158
MEASURE
will be seen later. A finite measure is by definition a-finite; a a-finite measure may be finite or infinite. If SJI is a subclass of !F, p. is a-finite on SJI if D = U k A k for some finite or infinite sequence of �sets satisfying p.( A k ) < oo . It is not required that these sets A k be disjoint. Note that if D is not a finite or countable union of �sets, then no measure can be a-finite on .JJI . If p. is a measure on a a-field � in D, the triple (0, �' p.) is a measure space. (This term is not used if � is merely a field.) It is an infinite, a a-finite, a finite, or a probability measure space according as p. has the corresponding properties. If p.( Ac) = 0 for an jli:set A , then A is a support of p., and p. is concentrated on A. For a finite measure, A is a support if and only if p. ( A ) = p.(D). The pair ( D, � ) itself is a measurable space if � is a a-field in D. To say that p. is a measure on (0, �) indicates clearly both the space and the class of sets involved. As in the case of probability measures, (iii) above is the condition of countable additivity, and it implies finite additivity: If A 1 , . . . , A n are disjoint jli:sets, then
As in the case of probability measures, if this holds for extends inductively to all n.
n = 2,
then it
A measure p. on (D, �) is discrete if there are finitely or countably many points w; in D and masses m ; in [0, oo] such that p. ( A ) = E"'. A m ; for A e �- It is an infinite, a finite, or a probability measure as E ; m ; diverges, or converges, or converges to 1; the last case was treated in Example 2.9. If � contains each singleton { w ; } , then p. is a-finite if and • only if m ; < oo for all i. Example 10. 2
I
e
Let � be the a-field of all subsets of an arbitrary D, and let p. ( A ) be the number of points in A , where p.(A) = oo if A is not finite. This p. is counting measure; it is finite if and only if D is finite and is a-finite if and only if D is countable. Even if � does not contain every • subset of 0, counting measure is well defined on �Example 10.3.
Specifying a measure includes specifying its domain. If p. is a measure on a field � and �0 is a field contained in �, then the restriction p. 0 of p. to .F0 is also a measure. Although often denoted by th e Example 10.4.
SECTION
10.
159
GENERAL MEASURES
same symbol, p.0 is really a different measure from p. unless �0 = �- Its properties may be different: If p. is counting measure on the a-field � of all subsets of a countably infinite 0, p. is a-finite, but its restriction to the • a-field $0 = { 0, 0 } is not a-finite. Certain properties of probability measures carry over immediately to the general case. First, p. is monotone: p.(A) � p.( B) if A c B. This is derived, just as its special case (2.4), from p.(A) + p.(B - A) = p.(B). But it is possible to go on and write p.(B - A) = p.(B) - p.(A) only if p.( B) < oo . If p, ( B) = oo and p.(A) < oo, then p.(B - A) = oo ; but for every a e (0, oo] there are cases where p.(A) = p.(B) = oo and p.( B - A) = a. The inclusion-exclusion formula (2.7) also carries over without change to jli:sets of finite measure:
(10 .5)
P.( lJ Ak ) = E p.( A;) - E p. ( A ; n Aj ) k-1
The proof of
i
��
+
···
finite subadditivity also goes through just as before:
here the A k need not have finite measure. 1beorem 10.2. Let p. be a measure on a field (i) Continuity from below: If A n and A p,( A n ) i p.( A).
�lie in � and A n j A, thent
Continuity from above: If A n and A lie in !F and An � A, and if I'( A 1 ) < oo, then p.( A n) � p ( A). (iii) Countable subadditivity: If A 1 , A 2 , and U k-t A k lie in �, then (ii)
•
•
•
If p. is a-finite on !F, then � cannot contain an uncountable, disjoint collection of sets of positive p.-measure. (iv)
tSee (10.4).
160
MEASURE
The proofs of (i) and (iii) are exactly as for the corresponding parts of Theorem 2.1 . The same is essentially true of (ii): If p.( A1) < oo , subtrac tion is possible and A1 - A n i A1 - A implies that p. ( A 1 ) - p. ( A n ) = p. ( A1 - A n ) i p.( At - A ) = p. ( At) - p. ( A ). There remains (iv). Let [ B6: 8 E 8 ] be a disjoint collection of jli:sets satisfying p. ( B6 ) > 0. Consider an jli:set A for which p.( A ) < oo . If 81, , 8n are distinct indices satisfying p. ( A n B6.) > £ > 0, then n £ < E7_ 1p.( A n B6. ) � p.( A ), and so n < p.( A)j£. Thus the index set [8: p. ( A n B6) > £] is finite, and hence (take the union over positive rational £) [8: p. ( A n B6) > 0] is countable. Since p. is a-finite, 0 = UkAk for some finite or countable sequence of jli:sets Ak satisfying p.(Ak) < oo . But then ek = [8: p.(Ak n B6 ) > 0] is countable for each k. If p.(Ak n B6 ) = 0 for all k, then p. ( B6 ) < • Ekp. ( Ak n B6 ) = 0. Therefore, 8 = U ke k , and so 9 is countable. PROOF.
•
•
•
I
I
Uniqueness
According to Theorem 3.3, probability measures agreeing on a w-system agree on a(fJ'). There is an extension to the general case.
9'
Suppose that p.1 and p.2 are measures on a(fJ'), where 9' is a w-system, and suppose they are ,a-finite on 9'. If p.1 and p. 2 agree on 9', then they agree on a ( 9' ). PROOF. Suppose B E 9' and p.1( B) = p.2( B ) oo , and let !1'8 be the class of sets A in a(fJ') for which p.1( B n A ) = p. 2 ( B n A ). Then !1'8 is a A-system containing 9' and hence (Theorem 3.2) containing a(fJ'). By a-finiteness there exist .9'-sets Bk satisfying 0 = UkBk and p.1( Bk ) = Theorem 10.3.
· Show that ( 0 , !F, p.) is a measure space. Under what conditions is it a-finite? Finite? On the a-field of all subsets of Q == {1, 2, . . . }, put p.( A ) == E k e 2- k if A is finite and p.( A ) == oo otherwise. Is p. finitely additive? Countably additi.ve.? (a) In connection with Theorem 10.2(ii), show that if An � A and p.( A k ) < oo for some k, then I'( An ) � p.( A ). (b) Find an example in which An � A , p.( A n ) oo , and A == 0.
E
A
=
162 10.5.
MEASURE
The natural generalization of (4.10) is ( 10.6)
1'( lim infn A,. )
.s
lim inf n I' ( A,. )
.s
lim sup I'( A,. )
n
.s
(
}
I' lim sup A,. .
n
Show that the left-hand inequality always holds. Show that the right-hand inequality holds if p.(U k nA k ) < oo for some n but can fail otherwise. 3.5 t A measure space ( 0, !F, p.) is complete if A c B, B e !F, and p. ( B ) == 0 together imply that A e !F- the definition is just as in the probability case. Use the ideas of Problem 3.5 to construct a complete measure space ( 0, r, p. + ) such that !F c r and p. and p. + agree on !F. �
10.6.
The condition in Theorem 10.2(iv) essentially characterizes a-finiteness. (a) Suppose that (Q, !F, p.) has no "infinite atoms," in the sense that for every A in !F, if p. ( A ) == oo , then there is in !F a B such that B c A and 0 < p. ( B ) < oo . Show that if !F does not contain an uncountable, disjoint collection of sets of positive measure, then p. is a-finite. (Use Zorn's lemma.) (b) Show by example that this is false without the condition that there are no " infinite atoms." 10.8. Deduce Theorem 3.3 from Theorem 10.3 and from Theorem 10.4. 10.9. Example 10.5 shows that Theorem 10.3 fails without the a-finiteness condi tion. Construct other examples of this kind. 10.10. Suppose that p.1 and p.2 are measures on a a-field !F and are a-finite on a field � generating !F. (a) Show that, if p. 1 ( A ) s p.2 ( A ) holds for all A in 'a , then it holds for all A in !F. (b) Show that this fails if � is only a w-system. 10.7.
SECI'ION 11. OUTER MEASURE Outer Measure
An outer measure is a set function p.* that is defined for all subsets of a space D and has these four properties: (i) p.*( A ) e (0, oo] for every A c 0; (ii) p.*(O) = 0; (iii) p.* is monotone: A c B implies p.*( A ) � p.*( B); ( iv) p.* is countably subadditive: p.*(UnA n) � Enp.*( An)·
SECTION
11.
OUTER MEASURE
163
The set function P* defined by (3.1) is an example, one that generalizes : Example 11.1. Let p be a set function on a class .JJI in 0. Assume that 0 e � and p(O) = 0, and that p(A) e [0, oo] for A e SJI; p and � are otherwise arbitrary. Put (1 1 .1) n
where the infimum extends over all finite and countable coverings of A by �sets A n . If no such covering exists, take p.*(A) = oo in accordance with the convention that the infimum over an empty set is oo. That p.* satisfies (i), (ii), and (iii) is clear. If p.*(A n ) = oo for some n, then obviously p.*( U n A n ) � En p.*(A n ) Otherwise, cover each A n by �sets Bn k satisfying Ek p ( Bn k) < p.*(A n ) + £/2 n ; then p.*( U n A n ) � E n , k P ( Bn k) < • E n p.*( A n ) + £. Thus p.* is an outer measure. ·
Define A to be p.*-measurab/e if p.* ( A n E ) + p.* ( A c n E ) = p.* ( E )
(1 1 .2)
for every E. This is the general version of the definition (3.4) used in Section 3. It is by subadditivity equivalent to (1 1 .3 )
Denote by .l(p.* ) the class of p.*-measurable sets. The extension property for probability measures in Theorem 3.1 was proved by means of a sequence of four lemmas. These lemmas carry over directly to the case of the general outer measure: if P* is replaced by p.* and .,1( by .l(p.*) at each occurrence, the proofs hold word for word, symbol for symbol. In particular, an examination of the arguments shows that oo as a possible value for p.* does not necessitate any changes. The fourth lemma in Section 3 becomes this: If p.* is an outer measure, then .l(p.*) is a a-field, and p.* restricted to .l(p.* ) is a measure.
1beorem
1 1.1.
This will be used to prove an extension theorem, but it has other applications as well; see Sections 19 and 29.
164
MEASURE
Extension Theorem 1 1.2.
a-field.
A
measure on a field has an extension to the generated
If the original measure on the field is a-finite, then it follows by Theorem 10.3 that the extension is unique. Theorem 11.2 can be deduced from Theorem 11.1 by the arguments used in the proof of Theorem 3.1. t It is unnecessary to retrace the steps, however, because the ideas will appear in stronger form in the proof of the next result, which generalizes Theorem 11.2. Define a class .JJI of subsets of D to be a semiring if (i) 0 E .JJI ; (ii) A , B e .JJI implies A n B e .JJI ; (iii) if A , B e .JJI and A c B, then there exist disjoint Jet-sets C1 , . . . , Cn such that B - A = Ui:_ 1 Ck . The class of finite intervals in 0 = R 1 and the class of subintervals of 0 = (0, 1] are the simplest examples of semirings. Note that a semiring need not contain 0. If A and B lie in a semiring .JJI , then since A n B e .JJI by (ii), A n B e = A - A n B has by (iii) the form U i: _ 1Ck for disjoint Jet-sets Ck . If A , B1, , B1+ 1 are in .JJI and A n Bf n · · · n Bt has this form, then A n Bf n · · · n Bt+ 1 = U i: _ 1 (Ck n Bt+ 1 ), and each Ck n Bt+ 1 is a finite disjoint union of Jet-sets. It follows by induction that for any sets A , B1, , B1 in .JJI , there are in .JJI disjoint sets Ck such that A n Bf • n Bt = u ;: _ lck . n
Example
11. 2.
•
•
·
·
•
•
•
•
·
Theorem 11.3. Suppose that p. is a set function on a semiring .JJI . Suppose that p. has values in [0, oo ], that p. (0) = 0, and that p. is finitely additive and countably subadditive. Then p. extends to a measure on a( .JJI ). *
This contains Theorem 11.2, because the conditions are all satisfied if .JJI is a field and p. is a measure on it. If 0 = U k A k for a sequence of d-sets satisfying p. ( A k ) < oo , then it follows by Theorem 10.3 that the extension is . uruque.
t See also Problem 11.1.
*And so p. must have been countably additive on � in the first place; see Problem 1 1 .4.
SECTION
11.
165
OUTER MEASURE
If A , B, and the Ck are related as in condition (iii) in the definition of semiring, then by finite additivity p.( B) = p. ( A ) + I:;: _ 1p.( Ck ) > p. ( A ). Thus p. is monotone. Define an outer measure p.* by (11.1) for p = p.: PROOF.
(11 .4)
n
the infimum extending over coverings of A by �sets. The first step is to show that d e J/(p.* ). Suppose that A E .JJ/. If p.*( E ) = oo, then (11.3) holds trivially. If p.*( E ) < oo, for given £ choose Jet-sets A n such that E e UnAn and Enp.(An) < p.*( E ) + £ . Since .JJI is a semiring, Bn = A n An lies in .JJI and Ac n A n = A n - Bn has the form U k'.:. 1Cn k for disjoint �sets Cnk· Note that A n = Bn u U k'.:. 1Cn k ' where the union is disjoint, and that A n E e UnBn and Ac n E e U n U k'.:. 1Cnk· By the definition of p. * and the assumed finite additivity of p., p.* ( A n
E ) + p.* ( A (' n E ) � L p. ( Bn ) + L n
n
mn
L p. (Cn k ) k == l
n Since £ is arbitrary, (11.3) follows. Thus d e J/(p.* ). The next step is to show that p.* and p. agree on .JJI . If A e UnAn for Jef-sets A and A n, then by the assumed countable subadditivity of p. and the monotonicity established above, p. ( A ) < Enp. ( A n An) � Enp.(An)· There fore, A E .JJI implies that p. ( A ) < p.*( A ) and hence, since the reverse inequality is an immediate consequence of (11 .4), p. ( A ) = p.*( A). Thus p.* agrees with p. on .JJI . Since .JJI e Jl ( p. * ) and .I ( p. *) is a a-field (Theorem 11.1 ), a ( .JJI ) e J/( p.* ). Since p.* is countably additive on .l( p.*) (Theorem 11.1 again), p.* • restricted to a ( .JJI ) is an extension of p. on d, as required. For .JJI take the semiring of subintervals of 0 = (0, 1] together with the empty set. For p. take length A: A ( a, b) = b - a . The finite additivity of A follows by Theorem 2.2 and the countable subadditiv ity by Lemma 2 to that theorem. t By Theorem 11 .3, A extends to a measure • on the class o ( .JJ/ ) = fJI of Borel sets in (0, 1 ] . Example 11.3.
t On a field, countable additivity implies countable subadditivity, but � is merely a semiring. Hence the necessity of this observation; but see Problem 1 1 .4.
166
MEASURE
This gives a second construction of Lebesgue measure in the unit interval. In the first construction A was extended first from the class of intervals to the field 910 of finite disjoint unions of intervals (see (2.1 1)) and then by Theorem 1 1 .2 (in its special form Theorem 3.1) from !110 to !11 a(aJ0). Using Theorem 11.3 instead of Theorem 11.2 effects a slight economy, since the extension then goes from JJI directly to !11 without the intermediate stop at 910, and the arguments involving (2.12) and (2.13) become unnecessary. ==
In Theorem 11.3 take for JJI the serrunng of finite intervals on the real line R t , and consider A t (a, b) = b - a. The arguments for Theorem 2.2 and its two lemmas in no way require that the (finite) intervals in question be contained in (0, 1 ], and so A t is finitely additive and countably subadditive on this class d. Hence A t extends to the a-field gJ t of linear Borel sets, which is by definition generated by JJI. This defines • Lebesgue measure A t over the whole real line. Example 11.4.
A subset of (0, 1] lies in !11 if and only if it lies in gJ t (see (10.2)). Now A t (A) = A( A ) for subintervals A of (0, 1 ], and it follows by uniqueness (Theorem 3.3) that A t (A) = A ( A ) for all A in !11 . Thus there is no inconsistency in dropping A t and using A to denote Lebesgue measure on 91 t as well as on !JI. k The class of bounded rectangles in R is a semiring, a fact needed in the next section. Suppose A = [x: X; e I;, i � k] and B = [x: X; e .f;, i � k] are nonempty rectangles, the I; and 1; being finite intervals. If A c B, then I; c 1;, so that 1; - I; is a disjoint union I;' u /;'' k of intervals (possibly empty). Consider the 3 disjoint rectangles [x: X; e lf;, i < k], where for each i, lf; is I; or /;' or /;''. One of these rectangles is A itself, and B - A is the union of the others. The rectangles thus form a semmng. • Exllllple l 11.5.
•
•
An Approximation 1beorem
If JJ1 is a semiring, then by Theorem 10.3 a measure on a( JJ/) is determined by its values on JJI if it is a-finite there. The following theorem shows more explicitly how the measure of a a( JJ/)-set can be approximated by the measures of �sets.
SECTION
11.
OUTER MEASURE
167
Suppose that d is a semiring, p. is a measure on !F = o( .SJI ), and p. is a-finite on d. (i) If B E .fF and £ > 0, there exists a finite or infinite disjoint sequence of JJI.sets such that B c U k A k and p. ((U k A k ) - B) < £. A1, A 2, (ii) If B e .fF and £ > 0, and if p. ( B) < oo , then there exists a finite disjoint sequence A 1 , , A n of JJI.sets such that p. ( BA (Ui:_ 1 A k )) < £. Theorem 1 1.4.
•
•
•
•
•
•
Return to the proof of Theorem 1 1 .3. If p.* is the outer measure defined by (1 1 .4), then .?Fe .l(p.* ) and p.* agrees with p. on d, as was shown. Since p.* restricted to .fF is a measure, it follows by Theorem 10.3 that p.* agrees with p. on !F as well. Suppose now that B lies in !F and p.(B) = p.*( B) < oo . There exist �sets A k such that B c U k A k and p.(U k A k ) � Ek p. ( A k ) < p. ( B) + £; but then p ((U kA k ) - B) < £. To make the sequence { A k } disjoint, replace A k nAk_ 1; each of these sets is (Example 11 .2) a finite by A k n AI n disjoint union oT sets in d. Next suppose that B lies in .fF and p.(B) = p.*( B) = oo . By a-finiteness there exist JJI.sets Cm such that D = U mCm and p.(Cm ) < oo . By what has just been shown, there exist JJI.sets A m k such that B n Cm c U k A m k and m p. ((U k A m k ) - ( B n Cm )) < £/2 . The sets A m k taken all together provide a sequence A 1 , A 2 , of JJI.sets satisfying B c U k A k and p.((U k A k ) - B) < £. As before, the A k can be made disjoint. To prove part (ii), consider the A k of part (i). If B has finite measure, so has A = U k A k , and hence by continuity from above (Theorem 10.2(ii)), p. ( A - U k � nA k ) < £ for some n. But then p.( B A (Uk_ 1 A k )) < 2£. • PROOF.
·
·
·
•
•
•
If, for example, B is a linear Borel set of finite Lebesgue measure, then A ( BLl. (U Z _ 1 A k )) < £ for some disjoint collection of finite intervals
A t , . . . ' A n.
If p. is a finite measure on a a-field !F generated by a field !F0, then for each �set A and each positive £ there is an !F0-set B such that
Corollary.
p.(ALl.B) < £.
This is of course an immediate consequence of part (ii) of the theorem, but there is a simple direct argument. Let fl be the class of jli:sets with the required property. Since A c Ll.B c = ALl.B, f§ is closed under com plementation. If A = U n A n , where A n e f§, given £ choose n 0 so that p. ( A - U n � noAn ) < £, and then choose !F0-sets Bn , n � n 0, so that p ( A n Ll.Bn ) < £ jn 0 • Since (U n � noA n )A(U n � noBn ) C Un � n0( A n A Bn ), the .fF0set B = U n � noBn satisfies p.(AAB) < 2£. Of course .?F0 c f§; since f§ is a • a- field, .fF c l§, as required. PROOF.
168
MEASURE
Caratheodory's Condition*
Suppose that p.* is an outer measure on the class of all subsets of R k . It satisfies Caratheodory 's condition if (1 1 . 5 )
p.*( A
u
B ) = p.*( A ) + p.*( B )
whenever A and B are at positive distance in the sense that dist( A, B) = inf( l x - Y l : x e A, y e B] is positive. Theorem 11.5.
Ifk p.* is an outer measure on R k satisfying Caratheodory 's
condition, then !Ji
c
.l(p.*).
Since .l(p.* ) is a a-field (Theorem 11.1) and the closed sets generate !Jt k (Example 10.1), it is enough to show that each closed set is p.*-measureable. The problem is thus to prove (11.3) for A closed and E arbitrary. Let B = A n E and C = A c n E. Let Cn = [x e C: dist(x, A ) � n - 1 ], so that Cn t C and dist(Cn , A) � n - 1 . By Caratheodory' s condition, p.*( E) = p.*( B U C ) � p.*( B U Cn ) = p.*( B) + p.*( Cn ). Hence it suffices to prove that p.*( Cn ) .-. p.*( C), or, since Cn t C, that p.*( C) � lim n p.*( Cn ). Let Dn = cn + 1 - en ; if X E Dn + 1 and l x - Yl < n - 1 ( n + 1) - 1 , then dist( y , A ) s I Y - x l + dist(x, A) < n - 1 , so that y � en . Thus l x - Y l > n - 1 ( n + 1) - 1 if X E Dn + 1 and y E en , and so Dn + 1 and en are at positive distance if they are nonempty. By the Caratheodory condition extended inductively, PROOF.
(11 .6) and ( 1 1 .7) By subadditivity,
( 1 1 .8)
00
00
p * ( C ) � p.*( C2 n ) + En p.* ( D2k ) + E p.*( D2k - l ) . k
k-n+ l
If Ep.*( D2 k ) or Ep.*( D2 k _ 1 ) diverges, p.*(C) < limnp.*(Cn ) follows from • (11 .6) or (11.7); otherwise, it follows from (11.8). * This topic may be omitted; it is needed only for the construction of Hausdorff measure in
Section 19.
SECTION
11.
OUTER MEASURE
169
For A c R 1 define A* (A) = inf E n ( bn - a n ), where the infimum extends over countable coverings of A by intervals ( a n , bn1 · Then A* is an outer measure (Example 11 .1). Suppose that A and B are at positive distance and consider any covering of A U B by intervals (a n , bn ] ; (11 .5) will follow if E( bn - a n ) � A*( A) + A*( B ) necessarily holds. The sum here is unchanged if (a n , bn ] is replaced by a disjoint union of subintervals each of length less than dist( A, B). But if bn - a n < dist( A , B), then ( a n , bn ] cannot meet both A and B. Let M and N be the sets of n for which ( a n , bn ] meets A and B, respectively. Then M and N are disjoint, A c U n e M ( a n , bn ], and B c Un e N (a n , bn ] ; hence E( bn - a n ) � En e M (bn - a n ) + E n E N ( bn - a n ) � A*( A ) + A*( B). Thus A* satisfies Caratheodory's condition. By Theorem 11.5, 91 1 c Jf(A*), and so A* restricted to 91 1 is a measure by Theorem 1 1 . 1 . By Lemma 2 in Section 2 (p. 24), A*( a, b) = b - a. Thus A* restricted to 91 1 is • A , which gives still another construction of Lebesgue measure. Example 11.6.
PROBLEMS 11.1.
11 .2.
11 .3.
The proof of Theorem 3.1 obviously applies if the probability measure is replaced by a finite measure, since this is only a matter of rescaling. Take as a starting point then the fact that a finite measure on a field extends uniquely to the generated a-field. By the following steps prove Theorem 11.2-that is, remove the assumption of finiteness. (a) Let I' be a measure (not necessarily even a-finite) on a field 'a and let !F == a ( 'a). If A e 'a and I'( A) < oo , restrict I' to a finite measure I'A on the field 'a( A ) == [ B: B c A , B e �] in A , and extend I'A to a finite measure (still denoted I'A ) on the a-field !F ( A ) == [ B: B c A , B e !F] generated in A by �(A ). (b) Suppose that E e !F. If there exist disjoint 'a-sets An such that I'( A n ) < oo and E c Un An , put I'( E) == 'En i'A ,.( E n An ) and prove consistency; otherwise, put I'( E) == oo . (c) Show that I' is a measure on !F and agrees with the original �£ on �(a) If I'* is an outer measure and v*( E) = I'*( E n A ), then v* is an outer measure. (b) If 11: are outer measures and I'*( E) = Ln i':( E) or I'*( E) = sup p.:( E) then I'* is an outer measure. 2.5 10.10 t Suppose .JJI is a semiring that contains 0. (a) Show that the field generated by .JJI consists of the finite disjoint unions of elements of .JJI. (b) Suppose that measures 1' 1 and p. 2 on a( .JJI ) are a-finite on .JJ1 and that 1' 1 ( A ) < 1' 2 ( A ) for all A in .JJI. Show that 1'1 ( A ) < 1' 2 ( A ) for all A in a( .JJI ) .
170 11.4.
MEASURE
Suppose that .JJI is a semiring and I' is a finitely additive set function on .JJI having values in [0, oo] and satisfying 1'(0) 0. (a) Suppose that A , A1 , , Am are �sets. Show that Ek'- 11'(Ak ) S I'( A ) if Uk'- 1 A k c A and the Ak are disjoint. Show that I'( A ) s Ek'- 1 1'(A k ) if A c Uk'- 1 A k (here the Ak need not be disjoint). (b) Show that I' is countably additive on .JJI if and only if it is countably subadditive. 11.5. Measure theory is sometimes based on the notion of a ring. A class .JJI is a ring if it contains the empty set and is closed under the formation of differences and finite unions; it is a a-ring if it is also closed under the formation of countable unions. (a) Show that a ring ( a-ring) containing Q is a field ( a-field). Find a ring that is not a field, a a-ring that is not a a-field, a a-ring that is not a field. (b) Show that a ring is closed under the formation of symmetric dif ferences and finite intersections. Show that a a-ring is closed under the formation of countable intersections. (c) Show that the field f( .JJI ) generated (see Problem 2.5) by a ring .JJI consists of the sets A for which either A e .JJI or A c e .JJI. Show that /( .JJI ) a( .JJI ) if .JJI is a a-ring. (d) Show that a (countably additive) measure on a ring .JJI extends to /( .JJI ) . Of course this follows by Theorem 11.3, but prove it directly. (e) Let .JJI be a ring and let .JJI ' be the a-ring generated (define) by .JJI. Show that a measure on .JJI extends to a measure on .JJI' and that the extension is unique if the original measure is a-finite on .JJI . Hint: To prove uniqueness, first show by adapting the proof of Theorem 3.4 that a monotone class containing .JJI must contain .JJI ' . 11.6. Call .JJI a strong semiring if it contains the empty set and is closed under the formation of finite intersections and if A , B e .JJI and A c B imply that there exist �sets A 0 , , A n such that A A0 c A1 c · · c A n B and A k - A k - l e .JJI, k 1, . . . , n . k (a) Show that the class of bounded rectangles in R is a strong semiring. (b) Show that a semiring need not be a strong semiring. (c) Call I' on .JJI pairwise additive if A , B, A u B e .JJI and A n B 0 imply that I'( A u B) I'( A ) + I'( B). It can be shown that, if I' is pairwise additive on a strong semiring, then it is finitely additive there. Show that this is false if .JJI is only a semiring. 11.7. Suppose that I' is a finite measure on the a-field !F generated by the field -Fa · Suppose that A e !F and that A1 , . . . , A n is a decomposition of A into Jli:sets. Show that there exist an �-set B and a decomposition B1 , , Bn of B into �-sets such that I' ( A k � Bk ) < £, k 1, . . . , n . Show that B may be taken as A if A e -Fo11.8. Suppose that I' is a-finite on a field -Fa generating the a-field !F. Show that for A !F there exist -Fa-sets Cn ; and Dn i such that, if c U n ni cn l and D n n U; Dn ; ' then C c A c D and I' ( D - C ) 0. 11.9. Show that a-finiteness is essential in Theorem 11.4. Show that part (ii) can fail if I'( B) oo. 11.10. Here is another approach to Theorem 11.4(i). Suppose for simplicity that p. is a finite measure on the a-field !F generated by a field �- Let t§ be the • • •
==
==
==
• • •
·
==
==
==
==
.
==
==
E
==
==
==
.
•
11. 11.
12.
171 class of Jli:sets G such that for each £ there exist 'a-sets A k and Bk satisfying nk Ak c G c Uk Bk and I' ((Uk Bk ) - (nk A k )) < £. Show that f§ is a a-field containing 'a. This and Problems 11.12, 16.25, 17.16, 17.17, and 17.18 lead to proofs of the Daniell-Slone and Riesz representation theorems. Let A be a real linear functional on a vector lattice !R of (finite) real functions on a space n. This means that if 1 and g lie in !R, then so do I v g and f A g (with values max{ f(w), g( w)} and min{ f(w), g( w)}), as well as a.f + flg, and A ( a.f + flg ) == a.A ( I) + /lA ( g). Assume further of 9 that f e !R implies f A 1 e !R (where 1 denotes the function identically equal to 1). Assume further of A that it is positive in the sense that I � 0 (pointwise) implies A (/) � 0 and continuous from above at 0 in the sense that In � 0 (pointwise) implies A (/n ) -+ 0. (a) If f s g (I, g e !R), define in Q X R1 an "interval" SECTION
MEASURES IN EUCLIDEAN SPACE
( I, g) == [ ( w , t ) : I( w ) < t s g ( w ) ] .
{ 11 .9)
Show that these sets form a semiring Jaf0• (b) Define a set function �'o on Jaf0 by { 11 .10) 11.12.
�'o ( /, g]
== A ( g - /) .
Show that �'o is finitely additive and countably subadditive on Jaf0• f (a) Assume f e !R and let In == ( n ( f - f A 1)) A 1. Show that /( w) � 1 implies In ( w) == 0 for all n and f( w) > 1 implies In ( w) == 1 for all sufficiently large n. Conclude that for x > 0, ( 0, xfn ) f [ w : /( w ) > 1 ] X (O, x] .
( 11 .11 )
(b) Let !F be the smallest a-field with respect to which every f in !R is measurable: !F== a[r 1 H: f e !R, H e 911 ]. Let 'a be the class of A in !F for which A X (0, 1] e a ( Jaf0 ) . Show that 'a is a semiring and that !F== a('a). (c) Let be' the extension of �'o ( see (11.10)) to a(Jaf0) and for A e 'a define l'o ( A ) == P ( A X (0, 1]). Show that l'o is finitely additive and count ably subadditive on the semiring �.,
SECI10N 12. MEASURES IN EUCLIDEAN SPACE Lebesgue Measure
Example 11.4 Lebesgue measure A was constructed on the class fll 1 of linear Borel sets. By Theorem 10.3, A is the only measure on fll 1 satisfying A( a , b] = b - a for all intervals. There is in k-space an analogous k-dimen sional Lebesgue measure A k on the class !Ji k of k-dimensional Borel sets (Example 10.1). It is specified by the requirement that bounded rectangles In
172
MEASURE
have measure (I2.I)
Ak (x:
a ; < X ; < bi ' i =
I, . . . , k]
=
k n ( b;
i=l
- a, ) .
This is ordinary volume-that is, length ( k = I), area ( k = 2), volume (k = 3), or hypervolume ( k > 4). Since an intersection of rectangles is again a rectangle, the uniqueness theorem shows that (I2.I) completely determines A k . That there does exist such a measure on gJ k can be proved in several ways. One is to use the ideas involved in the case k = I. A second construction is given in Theorem I2.5. A third, independent, construction uses the general theory of product measures; this is carried out in Section IS. t For the moment, assume the existence on fit k of a measure A k satisfying (I2.I). Of course, A k is a-finite. A basic property of A k is translation invariance. *
If A E (A k , then A + x = [a + x: a E A] E fit k and A k ( A ) = A k ( A + x ) for all x. Theorem 12.1.
If l§ is the class of A such that A + x is in fit k for all x, then f§ is a a-field containing the bounded rectangles, and so f§ ::> fit k . Thus A + x E !Ji k for A E !Ji k . For fixed x define a measure p. on gJ k by p.(A) = A k ( A + x). Then p. and A k agree on the w-system of bounded rectangles and so agree for all • Borel seb. PROOF.
If A is a (k - I)-dimensional subspace and x lies outside A, the hyperplanes A + tx for real t are disjoint and by Theorem I2.I, all have the same measure. Since only countably many disjoint sets can have positive measure (Theorem I0.2 (iv)), the measure common to the A + tx must be 0. Every ( k - I)-dimensional hyperplane has k-dimensional Lebesgue mea sure 0. The Lebesgue measure of a rectangle is its ordinary vowme. The follow ing theorem makes it possible to calculate the measures of simple figures.
If T:k R k --+ R k is linear and nonsingular, then A E fit k implies that TA E !A and Theorem 12.2.
(I2 .2)
t See also Problems 12. 1 5, 1 7.18, and 20.4.
*An analogous fact was used in the construction of a nonmeasurable set at the end of Section 3.
SECTION
12.
173
MEASURES IN EUCLIDEAN SPACE
As a parallelepiped is the image of a rectangle under a linear transforma tion, (1 2.2) can be used to compute its volume. If T is a rotation or a reflection-an orthogonal or a unitary transformation-del T = + 1, and so A k ( TA) = A k( A). Hence every rigid transformation (a rotation followed by a translation) preserves Lebesgue measure. Since TU n A n U n TAn and TAc ( TA)(' be cause of the assumed nonsingularity of T, the class f§ = [ A : TA E !1/ k ] is a a-field. Since TA is open for open A, again by the assumed nonsingularity of T, f§ contains all the open sets and hence (Example 10.1) all the Borel sets. Therefore, A E !1/ k implies that TA E !1/ k . For A E gJ k , set p. 1 (A) = A k ( TA) and p. 2 ( A ) = l det Tl · Ak(A). Then p. 1 and p. 2 are measures, and by Theorem 10.3 they will agree on !1/ k (which is the assertion (12.2)) if they agree on the w-system consisting of the rectan gles [ x: a; < X; < h;, i = 1, . . . , k] for which the a; and the h; are all rational (Example 10.1). It suffices therefore to prove (12.2) for rectangles with sides of rational length. Since such a rectangle is a finite disjoint union of cubes and A k is translation-invariant, it is enough to check (12.2) for cubes =
PROOF OF THE THEOREM.
A = [x: 0
( 12.3)
< X;
0. Extend to R*. Suppose that A e 9P1 , A ( A ) > 0, and 0 < 8 < 1. Show that there is a bounded open interval I such that A( A n I) > 8A( I). Hint: Show that A( A ) may be assumed finite and choose an open G such that A c G and A( A ) > 6A(G). Now G U,, In for disjoint open intervals In [A12], and E n A(A n In ) > 8EnA(In ); use an In. f If A e 911 and A ( A ) > 0, then the origin is interior to the difference set D( A ) [ x - y : x, y e A ]. Hint: Choose a bounded open interval I as in Problem 12.2 for 6 � . Suppose that l z 1 < A( I)/2 ; since A n I and (A n I) + z are contained in an interval of length less than 3A( I)/2 and hence cannot be disjoint, z e D( A ). 4.13 12.3 f The following construction leads to a subset H of the unit interval that is nonmeasurable in the extreme sense that its inner and outer Lebesgue measures are 0 and 1 : A .( H ) 0 and A* ( H ) 1 (see (3.9) and (3 .1 0)). Complete the details. The ideas are those in the construction of a nonmeasurable set at the end of Section 3. It will be convenient to work in G [0, 1); let E9 and e denote addition and subtraction modulo 1 in G, which is a group with identity 0. (a) Fix an irrational 6 in G and for n 0, ± 1, ± 2, . . . let (Jn be n8 reduced modulo 1. Show that 6n E9 6m 6n + m , 6n e 6m 6n - m , and the (Jn are distinct. Show that { 62 n : n 0, ± 1, . . . } and { 62 n + 1 : n 0, + 1, . . . } are dense in G. (b) Take x and y to be equivalent if x e y lies in { 6n : n 0, ± 1, . . . }, which is a subgroup. Let S contain one representive from each equivalence class (each coset). Show that G Un (S E9 6n ), where the union is disjoint. Put H Un (S E9 62 n ) and show that G - H H E9 6. (c) Suppose that A is a Borel set contained in H. If A( A ) > 0, then D( A ) contains an interval (0, £); but then some 62 k + I lies in (0, £) c D( A ) c D( H), and so 62 k + t h 1 - h 2 h 1 e h 2 ( s 1 E9 62 n ) e (s 2 E9 62 n ) for some h 1 , h 2 in H and some s 1 , s 2 in S. Deduce that s : s2 and ob tain a contradiction. Conclude that A.(H) 0. 1. (d) Show that A .( H E9 6) 0 and A*( H) f The construction here gives sets Hn such that Hn f G and A .( Hn ) 0. If Jn G - Hn , then Jn � 0 and A*( Jn ) 1. (a) Let Hn U Z , ( S E9 61c ), so that Hn f G. Show that the sets Hn E9 �2 n + !} v are disjoint for different v . (b) Suppose that A is a Borel set contained in Hn . Show that A and indeed all the A E9 6< 2 , + I ) v have Lebesgue measure 0.
==
==
==
12.3.
12.4.
==
==
==
==
== == == ==
==
12.5.
==
== ==
==
_ _
==
==
== == == ==
==
== == == ==
==
SECTION
12.6.
12.7.
12.8. 12.9.
12.10. 12.11.
12.12.
U.13.
12.14.
1 2.
MEASURES IN EUCLIDEAN SPACE
181
Suppose that p. is nonnegative and finitely additive on �k and p.( R k ) < oo . Suppose further that p.(A ) == sup p.( K), where K ranges over the compact subsets of A . Show that p. is countably additive. (Compare Theorem 12.3(ii).) Suppose p. is a measure on � k such that bounded sets have finite measure. Given A , show that there exist an F, set U (a countable union of closed sets) and a G8 set V (a countable intersection of open sets) such that U c A c V and p.( V - U) == 0. 2.17 f Suppose that p. is a nonatomic probability measure on ( R k , � k ) and that p.(A) > 0. Show that there is an uncountable compact set K such that K c A and p.( K ) == 0. The closed support of a measure p. on � k is a closed set C0 such that CO c C for closed C if and only if C supports p.. Prove its existence and uniqueness. Characterize the points of CO as those x such that p. ( U) > 0 for every neighborhood U of x. If k == 1 and if p. and the function F(x) are related by (12.5), the condition is F(x - £ ) < F(x + £) for all £; x is in this case called a point of increase of F. Let t§ be the class of sets in � 2 of the form [(x1 , x 2 ): x1 e H) for some H in �1 . Show that t§ is a a-field in � 2 and that two-dimensional Lebesgue measure is not a-finite on t§ even though it is a-finite on � 2 • Of minor interest is the k-dimensional analogue of (12.4). Let I, be (0, t ) for t > 0 and ( t, O] for t < 0, and let Ax == lx 1 X · · · X lx ,: 0' R1 such that q> is measurable .F' and I == q>T. Hint: First consider simple functions and then use Theorem 13 . 5. f Relate the result in Problem 13.6 to Theorem 5 . 1(ii). Show of real functions I and g that I( w ) + g( w) < x if and only if there exist rationals r and s such that r + s < x, I< w) < r, and g( w) < s. Prove directly that I + g is measurable .F if I and g are. Let .F be a a-field in R 1 • Show that 911 1c .F if and only if every continuous function is measurable !F. Thus 9P is the smallest a-field with respect to which all the continuous functions are measurable. -+
-+
188
MEASURE
13.10. Consider on R 1 the smallest class
13.1 1. 13. 12. 13.13.
13. 14. 13.15.
13.16.
(that is, the intersection of all classes) of real functions containing all the continuous functions and closed under pointwise passages to the limit. The elements of !I are called Baire functions. Show that Baire functions and Borel functions are the same thing. A real function I on the line is upper semicontinuous at x if for each there is a B such that l x - Y l < B implies that l( y ) < /( x) + £ . Show that. if I is everywhere upper semicontinuous, then it is measurable. If I is bounded and measurable !F, then there exist simple functions, each measurable !F, such that /,. ( w ) --+ I( w ) uniformly in w. Suppose that In and I are finite-valued, jlt:measurable functions such that In < w ) --+' I< w ) for w e A , where p. (A ) < oo ( p. a measure on !F ). Prove Egoroff s theorem : For each £ there exists a subset B of A such that I'( B) < £ and In < w ) --+ I< w ) uniformly on A -1 B. Hint: Let B�' d be the set of w in A such that ll( w ) - /; ( w ) l > k - for some i > n . Show that 00 B 0 as n f oo ' choose n k so that n( B (] < (. Hint: First ap� proximate I by a simple function measurable !F, and then use Problem 11.7. 13.18. 2.8 f Show that, if I is measurable a ( JJI), then there exists a countable subclass � of JJI such that I is measurable a( � ). 13.19. Circular Lebesgue measure. Let C be the unit circle in the complex plane and define T: [0, 1) --+ C by Tw == e 2 w i . Let !I consist of the Borel subsets of [0, 1) and let A be Lebesfue measure on !1. Show that � == (A : r- 1A e !I] consists of the sets in � (identify R 2 with the complex plane) that are contained in C. Show that � is generated by the arcs of C. Circular Lebesgue measure is defined as I' == A 1 1 • Show that I' is invariant under rotations: 1'[6z: z e A ] == I'( A ) for A e � and 6 e C. 13.20. f Suppose that the circular Lebesgue measure of A satisfies I'( A ) > 1 n - t and that B contains at most n points. Show that some rotation carries B into A : 6 B c A for some 6 in C. 13.17. 11.7 f
w
SECTION
14.
DISTRIBUTION FUNCTIONS
13.21. In connection with (13.7), show that p.(T'T) - 1 = ( p.T- 1 )(T') - 1 •
189
1 13.22. Show by example that p. a-finite does not imply p.T- a-finite.
13.23. Consider Lebesgue measure A restricted to the class !I of Borel sets in
(0, 1]. For a fixed permutation n1 , n 2 , • • • of the positive integers, if x has dyadic expansion . x 1 x 2 •1 • • , take Tx == • xn1 x, 2 • • • • Show that T is measura ble �/!I and that AT- == A. SECI10N 14. DISTRIBUTION FUNCI10NS
Distribution Functions
A random variable as defined in Section 13 is a measurable real function X on a probability measure space (0, �' P). The distribution or law of the random variable is the probability measure p. on ( R 1 , 91 1 ) defined by (14 .1 )
p. ( A) = P ( X e A ) ,
A
E
!1/ 1 .
As in the case of the simple random variables in Chapter 1, the argument w is usually omitted: P[ X e A] is short for P[w: X(w) e A ] . In the notation (13.7), the distribution is px - 1 • For simple random variables the distribution was defined in Section 5-see (5.9). There p. was defined for every subset of the line, however; from now on p. will be defined only for Borel sets, because unless X is simple, one cannot in general be sure that [ X e A] has a probability for A outside 91 1 . The distribution function of X is defined by (14 .2)
F( X ) = p.( - 00 ' X]
=
p[ X < X]
for real x. By continuity from above (Theorem 10.2(ii)) for p., F is right-continuous. Since F is nondecreasing, the left-hand limit F(x - ) = limy t x F(y) exists, and by continuity from below (Theorem 10.2(i)) for p., ( 14 .3)
F( X - ) = ,., ( - 00 ' X ) = p [ X < X ] .
Thus the jump or saltus in
F at x is
F( X ) - F( X -) = ,., { X } Therefore (Theorem 10.2(iv))
=
p[ X = X].
F can have at most countably many points of
190
MEASURE
discontinuity. Clearly,
(14.4)
lim F( x ) = O,
x � - oo
lim F( x ) = 1 .
A function with these properties must, in fact, be the distribution
function of some random variable:
a nondecreasing, right-continuous function satisfying (14.4), then there exists on some probability space a random variable X for which F(x) = P[ X < x]. FIRST PROOF. By Theorem 12.4, if F is nondecreasing and right-continu ous, there is on ( R 1 , 91 1 ) a measure p. for which p.(a, b] = F(b) - F( a). But lim x _. _ 00 F( x ) = 0 implies that p.( - oo, x] = F(x) , and lim x _. 00F(x) = 1 implies that p. ( R 1 ) = 1. For the probability space take (0, � ' P) = ( R 1 , !1/ 1 , p. ) and for X take the identity function: X(w) = w. Then P[ X < • x ] = p. [w e R 1 : w < x] = F(x). Theorem 14.1.
If F is
There is a proof that uses only the existence of Lebesgue measure on the unit interval and does not require Theorem 12.4. For the probability space take the open unit interval: 0 is (0, 1), � consists of the Borel subsets of (0, 1), and P(A) is the Lebesgue measure of A. To understand the method, suppose at first that F is continuous and strictly increasing. Then F is a one- to-one mapping of R 1 onto (0, 1); let cp : (0, 1) --+ R 1 be the inverse mapping. For 0 < w < 1, let X( w) = cp ( w ) . Since cp is increasing, certainly X is measurable �- If 0 < u < 1, then cp ( u) < x if and only if u < F(x) . Since P is Lebesgue measure, P[ X < x] = P[w e (0, 1): cp ( w ) < x] = P[w e (0, 1): w < F(x)] = F(x), as required. SECOND PROOF.
F (x)
.P (u) .P (p)
If F has discontinuities or is not strictly increasing, detinet
(14.5)
cp ( u ) = inf [ X :
t This is called the quantile function.
u < F( X ) ]
SECTION
1 4.
DISTRIBUTION FUNCTION S
191
for 0 < u < 1 . Since F is nondecreasing, [ x : u < F(x)] is an interval stretching to oo ; since F is right-continuous, this interval is closed on the left. For 0 < u < 1 , therefore, [ x : u < F( x )] = [cp( u), oo), and so cp(u) < x if and only if u < F(x). If X(w) = cp(w) for 0 < w < 1, then by the same • reasoning as before, X is a random variable with P[ X < x] = F(x). This second argument actually provides a simple proof of Theorem 12.4 for a probability distributiont F: the distribution p. (as defined by (14.1)) of the random variable just constructed satisfies p.( - oo, x] = F(x) and hence
p.(a, b] = F(b) - F(a).
Exponential Distributions
There are a numbet: of results which for their interpretation require random variables, independence, and other probabilistic concepts, but which can be discussed technically in terms of distribution functions alone and do not require the apparatus of measure theory. Suppose as an example that F is the distribution function of the waiting time to the occurrence of some event-say the arrival of the next customer at a queue or the next call at a telephone exchange. As the waiting time must be positive, assume that F(O) = 0. Suppose that F(x) < 1 for all x and furthermore suppose that
(14.6 )
1 - F( X + y ) 1 - F( x )
=
1 - F( y ) '
X,
y > 0.
The right side of this equation is the probability that the waiting time exceeds y; by the definition (4.1) of conditional probability, the left side is the probability that the waiting time exceeds x + y given that it exceeds x. Thus (14.6) attributes to the waiting-time mechanism a kind of lack of memory or aftereffect: If after a lapse of x units of time the event has not yet occurred, the waiting time still remaining is conditionally distributed just as the entire waiting time from the beginning. For reasons that will emerge later (see Section 23), waiting times often have this property. The condition (14.6) completely determines the form of F. If U(x) = 1 - F(x), (14.6) is U( x + y) = U(x)U(y). This is a form of Cauchy's equation [A20], and since U is bounded, U(x) = e - ax for some a. Since lim 00U( x) = 0, a must be positive. Thus (14.6) implies that F has the x
.....
t For the general case, see Problem 14.2.
192
MEASURE
exponential form if X < if X �
( 14.7)
0, 0,
and conversely. Weak Convergence
Random variables X1 , , Xn are defined to be independent if the events ( X1 e A 1 ], ( Xn e An] are independent for all Borel sets A 1 , , An, so that P[ X; e A;, i = 1, . . . , n ] = n?_ 1 P[ X; e A;]. To find the distribution function of the maximum Mn = max{ X1 , , Xn }, take A 1 = · · · = A n = ( - oo, x]. This gives P[Mn � x] = n?_ 1 P[ X; < x]. If the X; are indepen dent and have common distribution function G and Mn has distribution function Fn , then • • •
. . •
• • • ,
• • .
( 14.8) It is possible without any appeal to measure theory to study the real function Fn solely by means of the relation (14.8), which can indeed be taken as defining Fn . It is possible in particular to study the asymptotic properties of Fn :
EX111p11 le
Consider a stream or sequence of events, say arrivals of calls at a telephone exchange. Suppose that the times between successive events, the interarrival times, are independent and that each has the exponential form (14.7) with a common value of a. By (14.8) the maximum Mn among the first n interarrival times has distribution function Fn(x) = (1 - e - ax ) n , x � 0. For each x, limnFn (x) = 0, which means that Mn1 tends to be large for n large. But P[Mn - a - 1 log n � x] = Fn(x + a - log n). This is the distribution function of Mn - a - 1 log n, and it satisfies
( 14 .9) as n --+
14.1.
oo ; the equality here holds if log n �
- x , and so the limit holds a
for all x. This gives for large n the approximate distribution of the • normalized random variable Mn - a- 1 log n.
and F are distribution functions, then by definition, weakly to F, written Fn => F, if If
Fn
( 14.10)
Fn converges
SECTION
14.
193
DISTRIBUTION FUNCTIONS
for each x at which F is continuous.t To study the approximate distribu tion of a random variable Yn it is often necessary to study instead the normalized or rescaled random variable ( Yn - bn )/a n for appropriate con stants a n and bn . If Yn has distribution function Fn and if a n > 0, then P[( Yn - bn )/a n < x] = P[Yn < a n x + bn ], and therefore (Yn - bn )/a n has distribution function Fn (a n x + bn ). For this reason weak convergence often appears in the form*
( 14.11 ) An example of this is
e
- e- ax
.
(14.9):
there
a n = 1, bn = a - 1 log n,
Consider again the distribution function maximum, but suppose that G has the form Example 14. 2.
G(x) = where a > fore
0.
{�
Here Fn(n 1 1°x) =
_
(1
and F(x) =
(14.8)
of the
if X < 1, xif X > 1, n - n - 1x - a) for x > n - 1 /a, and there "
0, 0. This is an example of (14.11) in which a n = n 1 /a and bn = 0. if X < if X >
Example 14.3.
•
Consider (14.8) once more, but for
0 if X < 0, G ( x ) = 1 - (1 - x ) a if 0 < X < 1 , 1 if X > 1 , 0. This time Fn (n - llax + 1) = (1 - n - 1 ( - x) a ) n
where a > � 0. Therefore,
• ( ) li� Fn ( n - 1 /"x + 1 ) = { �- - x
a case of (14.11) in which an = n - l/a and
bn = 1.
if X < if X >
if - n 1 /a < x
0, 0, •
t For the role of continuity, see Example 14.4. * To write F,. ( an x + bn ) ::$ F( x ) ignores the distinction between a function and its value at an unspecified value of its argument, but the meaning of course is that F,. ( an x + bn ) --. F( x ) at continuity points x of F.
194
MEASURE
& be the distribution function with a unit jump at the origin:
Let
ll ( x ) =
( 14.12 ) If
{�
if X < if X >
0, 0.
X( w) = 0, then X has distribution function &. X1 , X2 , . . . be independent random variables for P[ Xk = 1] = P[ Xk = - 1] = � and put Sn = X1 + · · · + Xn . By the
Example 14.4.
Let
which weak law of large numbers,
( 14.13 ) for £ > 0. Let Fn be the distribution function of n - tsn . If x > 0, then Fn (x) = 1 - P[n - 1Sn > x] --+ 1; if x < 0, then Fn (x) < P[ l n - 1Snl > l x l ] 0. As this accounts for all the continuity points of &, Fn => &. It is easy to tum the argument around and deduce (14.13) from Fn => &. Thus the weak law of large numbers is equivalent to the assertion that the distribu tion function of n - 1sn converges weakly to &. If n is odd, so that Sn = 0 is impossible, then by symmetry the events [Sn s 0] and [Sn > 0] each have probability � and hence Fn (O) = � - Thus Fn(O) does not converge to &(0) = 1, but because & is discontinuous at 0, • the definition of weak convergence does not require this. -+
Allowing (14.10) to fail at discontinuity points x of F thus makes it possible to bring the weak law of large numbers under the theory of weak convergence. But if (14.10) need hold only for certain values of x, there arises the question of whether weak limits are unique. Suppose that Fn => F and Fn => G. Then F(x) = lim n Fn (x) = G(x) if F and G are both continu ous at x. Since F and G each have only countably many points of discontinuity,t the set of common continuity points is dense, and it follows by right continuity that F and G are identical. A sequence can thus have at most one weak limit. Convergence of distribution functions is studied in detail in Chapter 5. The remainder of this section is devoted to some weak-convergence theo rems which are interesting both for themselves and for the reason that they require so little technical machinery. t The proof following
(14.3) uses measure theory, but this is not necessary: If the saltus
a( x ) - F( x) - F( x - ) exceeds t: at x1 < · · · < xn , then F( x; ) - F( x;_ 1 ) > t: (take x0 < x1 ), and so n t: < F( xn ) - F( x0 ) < 1 ; hence [ x : a( x ) > t: ) is finite and [ x : a( x ) > 0] is countable.
SECTION
14.
195
DISTRIBUTION FUNCTI ONS
Convergence of Types *
Distribution functions F and G are of the same type if there exist constants a and b, a > 0, such that F(ax + b) = G(x) for all x. A distribution function is degenerate if it has the form �(x - x0) (see (14.12)) for some x0 ; otherwise, it is nondegenerate.
Suppose that Fn (u n x + vn ) => F(x) and Fn(a n x + bn ) => G (x), where u n > 0, a n > 0, and F and G are nondegenerate. Then there exist a and b, a > 0, such that a nfu n --+ a, (bn - vn )/u n --+ b, and F(ax + b ) = G(x). 1beorem 14.2.
Thus there can be only one possible limit type and essentially only one possible sequence of norming constants. The proof of the theorem is for clarity set out in a sequence of lemmas. In all of them, a and the a n are assumed to be positive. Lemma 1.
+ b).
If Fn => F, a n --+ a, and bn --+ b, then Fn (a n x + bn ) => F(ax
If x is a continuity point of F( ax + b) and £ > 0, choose continu ity points u and v of F so that u < ax + b < v and F( v) - F(u) < £; this is possible because F has only countably many discontinuities. For large enough n, u < a n x + bn < v, IFn (u) - F(u) l < £, and IFn (v) - F(v) l < £; but then F(ax + b) � 2£ < F( u) - £ < Fn (u) < Fn (a n x + bn ) < Fn (v) < • F( v) + £ < F( ax + b) + 2£. PROOF.
then Fn (a n x) => �(x). PROOF. Given £ choose a continuity point u of F so large that F( u) > 1 If x > 0, then for all large enough n, a n x > u and IFn (u) - F( u) l < £, SO that Fn (a n x) > Fn (u) > F(u) - £ > 1 - 2£. Thus limn Fn (a n x) = 1 for • x > 0; similarly, limn Fn (a n x) = 0 for x < 0. Lemma 2.
If Fn => F and a n --+
oo ,
£.
If Fn => F and bn is unbounded, then Fn (x + bn ) cannot con verge weakly. PROOF . Suppose that bn is unbounded and that bn --+ oo along some subsequence (the case bn --+ - oo is similar). Suppose that Fn (x + bn ) => G(x). Given £ choose a continuity point u of F so that F( u) > 1 - £. Whatever x may be, for n far enough out in the subsequence, + bn > u Lemma 3.
x
* This topic may be omitted.
196
MEASURE
Fn( u ) > 1 - 2£, so that Fn( x + bn ) > 1 - 2£. Thus G( x ) = lim n F,,(x • + bn ) = 1 for all continuity points x of G. which is impossible.
and
If Fn (x ) =$ F(x) and Fn (a n x + bn ) =$ G ( x), where F and G are nondegenerate, then Lemma 4.
( 14.14 )
0 < inf a n s sup a n < n
n
00 ,
n
Suppose that a n is not bounded above. Arrange by passing to a subsequence that a n oo . Then by Lemma 2, Fn (a n x) =$ �(x), and by Lemma 3 and the hypothesis, bn is bounded along this subsequence. Arrange by passing to a further subsequence that bn converges to some b. Then by Lemma 1, Fn (a n x + bn ) =$ �(x + b), and so by hypothesis G(x) = �(x + b), contrary to the assumption that G is nondegenerate. Thus a n is bounded above. If Gn (x) = Fn (a n x + bn ), then Gn (x) => G(x) and Gn (a n 1x - a; 1bn ) = Fn (x) =$ F(x). The result just proved shows that a; 1 is bounded. Thus a n is bounded away from 0 and oo . If bn is not bounded, pass to a subsequence along which bn --+ ± oo and an converges to some a. Since Fn (a n x) =$ F(ax) along the subsequence by Lemma 1, bn --+ ± oo is by Lemma 3 incompatible with the hypothesis that Fn (a n x + bn ) =$ G(x). • PROOF.
--+
If F(x) = F(ax + b) for all x and F is nondegenerate, then a = 1 and b = 0. n n PROOF. Since F(x) = F(a x + (a - 1 + · · · + a + 1)b), it follows by n Lemma 4 that a is bounded away from 0 and oo , so that a = 1, and it then follows that nb is bounded, so that b = 0. • PROOF OF THEOREM 14 . 2 . Suppose first that U n = 1 and Vn = 0. Then (14.14) holds. Fix any subsequence along which a n converges to some positive a and bn converges to some b. By Lemma 1, Fn (a n x + bn ) => F(ax + b) along this subsequence, and the hypothesis gives F(ax + b) = G(x). Suppose that along some other sequence, a n --+ u > 0 and bn --+ v. Then F( ux + v) = G(x) and F(ax + b) = G(x) both hold, so that u = a and v = b by Lemma 5. Every convergent subsequence of {(an, bn )} thus converges to (a, b), and so the entire sequence does. For the general case, let Hn (x) = Fn (u n x + vn ). Then Hn (x) =$ F(x) and Hn (a n u; 1x + (bn - vn )u n 1 ) =$ G(x), and so by the case already treated, a n u; 1 converges to some positive a and (bn - vn)u; 1 to some b, • and as before, F(ax + b) = G(x). Lemma 5.
SECTION
14.
DISTRIBUTION FUNCTIONS
197
Extremal Distributions*
A distribution function F is extremal if it is nondegenerate and if, for some distribution function G and constants a n ( a n > 0) and bn , ( 14.15) These are the possible limiting distributions of normalized maxima (see (14.8)), and Examples 14.1, 14.2, and 14.3 give three specimens. The following analysis shows that these three examples exhaust the possible types. n Assume is extremal. From (14.15) follow G k (a n x + bn ) => F k (x) that F n and G k (a n k x + bn k) => F(x), and so by Theorem 14.2 there exist con stants ck and dk such that ck is positive and
( 14 . 16 ) From F(cjk x + djk ) = pjk (x) follow (Lemma 5) the relations
d .k = c . dk + d . = ck d . + dk •
( 14 . 17 )
J
Of course, separately.
CASE 1.
= Fj(ck x + dk ) = F(cj (ck x + dk ) + dj )
c1 = 1
and
Suppose that
d1 = 0.
J
J
J
There are three cases to be considered
ck = 1 for all k. Then
(14.18 ) This implies that Fjfk (x) = F(x + dj - dk ). For positive rational r = jjk, put 8, = dj - dk ; (14.17) implies that the definition is consistent, and F'(x) = F( x + 8,). Since F is nondegenerate, there is an x such that 0 < F(x) < 1, and it follows by (14.18) that d k is decreasing in k, so that 8, is strictly decreasing in r. For positive real t let cp(t) = inf0 < r < ,8, (r rational in the infimum). Then cp( t) is decreasing in t and
(1 4 . 19 )
F ( x ) = F( x + cp ( t ) ) '
for all x and all positive t. Further, (14.17) implies that cp( st) = cp( s) + cp( t ), so that by the theorem on Cauchy's equation [A20] applied to cp(e x ), cp(t) * This topic may be omitted.
198
MEASURE
= {J log t, where {J > 0 because cp( t ) is strictly decreasing. Now (14.19) with t = e x ffl gives F(x ) = exp{ e - xlfliog F(O)} , and so F must be of the same type as -
(14.20) Example 14.1 shows that this distribution function can arise as a limit of distributions of maxima-that is, F1 is indeed extremal. CASE 2. Suppose that ck o ¢ 1 for some k0, which necessarily exceeds 1 . Then there exists an x ' such that ck 0x' + dk o = x'; but (14.16) then gives k p o ( x ') = F( x'), so that F(x') is 0 or 1. (In Case 1, F has the type (14.20) and so never assumes the values 0 and 1.) Now suppose further that, in fact, F(x') = 0. Let x0 be the supremum of those x for which F( x) = 0. By passing to a new F of the same type one can arrange that x0 = 0; then F(x) = 0 for x < 0 and F( x) > 0 for x > 0. The new F will satisfy (14.16), but with new constants dk . If a (new) dk is distinct from 0, then there is an x near 0 for which the arguments on the two sides of (14.16) have opposite signs. Therefore, dk = 0 for all k, and (14 .21 ) k j for all k and x. This implies that F f (x) = F(xcjjck ). For positive rational r = jjk, put Yr = cjjck . The definition is by (14.17) again con r sistent, and F (x) = F( yrx). Since 0 < F(x) < 1 for some x, necessarily positive, it follows by (14.21) that ck is decreasing in k, so that Yr is strictly decreasing in r. Put 1/l ( t ) = info < r < tYr for positive real t. From (14.17) follows 1/l ( st ) = 1f (s ) 1f ( t ), and by the corollary to the theorem on Cauchy' s equation [A20] applied to 1/l ( e x), 1/l ( t ) = � - � for some � > 0. Since F'(x) = F( 1/l ( t )x) for all x and for t positive, F(x ) = exp{ x - 1 1�log F(1)} for x > 0. Thus (take a = 1/�) F is of the same type as (14 .22)
if X < 0, if X > 0.
Example 14.2 shows that this case can arise. CASE 3. Suppose as in Case 2 that ck o ¢ 1 for some k0, so that F( x') is 0 or 1 for some x', but this time suppose that F( x') = 1. Let x 1 be the infimum of those x for which F(x) = 1. By passing to a new F of the same type, arrange that x 1 = 0; F(x) < 1 for x < 0 and F(x ) = 1 for x > 0. If
SECTION
14.
DISTRIBUTION FUNCTION S
199
dk ¢ 0, then for some x near 0, one side of (14.16) is 1 and the other is not. Thus d k = 0 for all k, and (14.21) again holds. And again Yj; k = cjjc k consistently defines a function satisfying F r(x) = F( yrx) . Since F is nonde generate, 0 < F( x) < 1 for some x, but this time x is necessarily negative, so that ck is increasing. The same analysis as before shows that there is a positive � such that F'(x) = F(t �x) for all x and for t positive. Thus F(x) = exp { ( - x ) 1 1�log F( - 1)} for x < 0, and F is of the type F3, a
( 14.23) Example
x >a { < e (x) = 1
if X < if X >
0, 0.
14.3 shows that this distribution function is indeed extremal.
This completely characterizes the class of extremal distributions:
The class of extremal distribution functions consists exactly of the distribution functions of the types (14.20), (14.22), and (14.23). Theorem 14.3.
It is possible to go on and characterize the domains of attraction. That is, it is possible for each extremal distribution function F to describe the class of G satisfying (14.15) for some constants a n and bn-the class of G attracted to F. t PROBLEMS 14.1.
The general nondecreasing function F has at most countably many discontinuities. Prove this by considering the open intervals (sup" < F( u ) , inf(, F( v ))-each nonempty one contains a rational. For distribution functions F, the second proof of Theorem 14.1 shows how to construct a measure I' on ( R1 , 9P1 ) such that I'( a , b] == F( b) - F( a ). (a) Extend to the case of bounded F. (b) Extend to the general case. Hint: Let F, ( x) be - n or F( x) or n as F( x ) < - n or - n < F( x ) < n or n � F( x ) . Construct the corresponding l'n and define I'(A ) == limnl'n ( A ). Suppose that X assumes positive integers as values. Show that P [ X > n + m i X > n ] == P[n X > m ] if and only if X has a Pascal distribution: P[ X == n ] == q - Ip, n == 1, 2, . . . . This is the discrete analogue of the exponential distribution. x
14.2.
14.3.
> x
t This theory is associated with the names of Fisher, Frechet, Gnedenko, and Tippet. For further information, see GALAMBOS.
]00
14.4.
MEASURE
(a) Suppose that X has a continuous, strictly increasing distribution
function F. Show that the random variable F( X) is uniformly distributed over the unit interval in the sense that P[ F( X) < u ] == u for 0 � u < 1. Passing from X to F( X) is called the probability transformation. (b) Show that the function q>( u ) defined by ( 1 4.5) satisfies F( q>( u ) - ) < u < F( q> ( u)) and that, if F is continuous (but not necessarily strictly increas ing), then F( q>( u)) == u for 0 < u < 1. (c) Show that P[ F( X) < u] == F( q> ( u ) - ) and hence that the result in part (a) holds as long as F is continuous.
14.5.
f Let C be the set of continuity points of F. (a) Show that for every Borel set A , P[ F( X) e A , X e C] is at most the Lebesgue measure of A. (b) Show that if F is continuous at each point of F" 1A , then P[ F( X) e A ] is at most the Lebesgue measure of A.
14.6.
A distribution function F can be described by its derivative or density f = F', provided that f exists and can be integrated back to F. One thinks in terms of the approximation
( 14.24)
P[ x < X < x
+
d.x ]
=
f( x ) dx.
(a) Show that, if X has distribution function with density f and g is
14.7.
increasing and differentiable, then the distribution function of g( X) has density /(g - 1 (x))/g'(g - 1 (x)). (b) Check part (a) of Problem 14.4 under the assumption that F is differentiable. (a) Suppose that X1 , , Xn are independent and that each has distribu tion function F. Let � kJ be the k th order statistic, the k th smallest among the X;. Show that � k > bas distribution function •
•
•
-
n
Gk ( X ) == E k
u-
( : ) ( F( X ) )
u
( 1 - F( X ) )
n-u
.
(b) Assume that F has density f and show that Gk has density
;:
F( x )) k - I /( x ) ( 1 - F( x )) " - k ( ( k - 1) n - k ) !
14.8.
•
Interpret this in terms of (14.24). Hint: Do not differentiate directly. Let A be the event that k - 1 of the X; lie in ( - oo , x], one lies in (x, x + h ], and n - k lie in (x, oo) . Show that P[x < � k ) < x + h ] - P (A ) is non negative and does not exceed the probability that two of the X, lie in (x, x + h ]. Estimate the latter probability, calculate P(A ) exactly, and let h go to 0. Show for every distribution function F that there exist distribution func tions F, such that F, � F and (a) F, is continuous; (b) F, is constant over each interval (( k - 1) n - 1 , kn - 1 ], k 0, ± 1, ± 2, . . . . =
201 14.9. The Levy distance d( F, G) between two distribution functions is the in fimum of those £ such that G(x - £) - £ < F( x) < G(x + £) + £ for all x. Verify that this is a metric on the set of distribution functions. Show that a necessary and sufficient condition for F, � F is that d( F, , F) --+ 0. 14.10. 12.3 f A Borel function satisfying Cauchy's equation [A20] is automati cally bounded in some interval and hence satisfies l(x) xl(1) Hint: Take K large enough that A[x: x > 0, ll(x) l < K ] > 0. Apply Problem 12.3 and conclude that I is bounded in some interval to the right of 0. 14. 1 1. f Consider sets S of reals that are linearly independent over the field of rationals in the sense that n 1 x1 + · · · + n k xk == 0 for distinct points X; in S and integers n; (positive or negative) is impossible unless n; 0. (a) By Zorn's lemma find a maximal such S. Show that it is a Hamel basis. That is, show that each real x can be written uniquely as x == n 1 x 1 + · · · + n k xk for distinct points X; in S and integers n; . (b) Define I arbitrarily on S and define it elsewhere by I( n 1 x 1 + · · · + n k xk ) == n 1 l(x 1 ) + · · · + n k l(xk ) Show that I satisfies Cauchy's equation but .need not satisfy l(x) = xl(1). (c) By means of Problem 14.10 give a new construction of a nonmeasur able function and a nonmeasurable set. SECTION
14 .
DISTRIBUTION FUNCTIONS
=
.
=
.
14.12. 14.9 f
(a) Show that if a distribution function F is everywhere continu
ous, then it is uniformly continuous. (b) Let BF (£) == sup[ F(x) - F(y): l x - Y l < £] be the modulus of con tinuity of F. Show that d( F, G) < £ implies that supx I F( x) - G ( x) I < £ + BF ( £). (c) Show that, if F, � F and F is everywhere continuous, then F, (x) --+ F( x) uniformly in x. What if F is continuous over a closed interval? 14.13. Show that (14.22) and (14.23) are everywhere infinitely differentiable, al though not analytic.
CHAPTER
3
Integration
SECitON 15. THE INTEGRAL
Expected values of simple random variables and Riemann integrals of continuous functions can be brought together with other related concepts under a general theory of integration, and this theory is the subject of the present chapter. Definition
Throughout this section, f, g, and so on will denote real measurable functions, the values ± oo allowed, on a measure space (0, ofF, p.). The object is to define and study the definite integral
Suppose first that f is nonnegative. For each finite decomposition of D into �sets, consider the sum
( 15 .1 ) In computing the products here, the conventions about infinity are
( 15 .2 ) 202
{ X0
. 00 = 00 .
0 = 0,
00 = 00 · X = 00 if 0 < X < 00 , 00 . 00 = 00 . •
{ A; }
SECTION
15.
203
THE INTEGRAL
The reasons for these conventions will become clear later. Also in force are the conventions of Section 10 for sums and limits involving infinity; see (10.3) and (10.4). If A; is empty, the infimum in (15 . 1) is by the standard convention oo ; but then p.(A;) = 0, so that by the convention (15.2), this term makes no contribution to the sum (15.1). The integral of f is defined as the supremum of the sums (15.1): (15 .3) The supremum here extends over all finite decompositions JEsets. For general /, consider its positive part,
its
(15 .5)
of 0 into
if 0 < / ( w ) < 00 ' if - oo < / ( w ) < 0
(15 .4) and
{ A; }
negative part, if - 00 < / ( w ) < 0, if 0 < / ( w ) < 00 .
These functions are nonnegative and measurable, and f = j+ - f- . The general integral is defined by ( 15 .6) unless f j+ dp. = f /- dp. = oo, in which case f has no integral. If f j+ dp. and f /- dp. are both finite, f is integrable, or integrable p., or summable, and has (15.6) as its definite integral. If f j+ dp. = oo and f 1- dp. < oo , f is not integrable but is in accordance with (15.6) assigned + oo as its definite integral. Similarly, if f j dp. < oo and f /- dp. = oo, f is not integrable but has definite integral - oo. Note that f can have a definite integral without being integrable; it fails to have a definite integral if and only if its positive and negative parts both have infinite integrals. The really important case of (15.6) is that in which f j+ dp. and f f- dp. are both finite. Allowing infinite integrals is a convention that simplifies the statements of various theorems, especially theorems involving nonnegative functions. Note that (15.6) is defined unless it involves "oo - oo; " if one term on the right is oo and the other is a finite real x, the difference is defined by the conventions oo - x = oo and x - oo = - oo.
204
INTEGRATION
The extension of the integral from the nonnegative case to the general case is consistent: (15.6) agrees with (15.3) if f is nonnegative because then
!-= 0.
Nonnegative Functions
It is convenient first to analyze nonnegative functions.
If f = E;x ; IA is a nonnegative simple function, { A; } being a finite decomposition of 0 into �sets, then f fdp. = E;x ;P. ( A;). (ii) If 0 � /( w ) � g(w) for all w, then f fdp. < f gdp.. (iii) If 0 � /n (w) j /( w) for all w, then 0 < f fdp. i f fdp.. (iv) For nonnegative functions f and g and nonnegative constants a and {J, j ( af + {Jg) dp. = af fdp. + {Jf g dp..
Theorem 15.1
(i)
I
In part (iii) the essential point is that f fdp. = lim n f In dp., and it is important to understand that both sides of this equation may be oo . If In = /A and f = /A , where A n i A, the conclusion is that p. is continuous from below (Theorem 10.2(i)): li m n p.(A n ) = p.(A); this equation often takes the form oo = oo . II
(i). Let { B1 } be a finite decomposition of 0 and let P1 be the infimum of f over B1 . If A ; n B1 ¢ 0, then {J1 s x;; therefore, E1P1p.(B1 ) = E;1p1p.((A ; n B1 ) s E;1x;p.(A; n B1 ) = E;x;p.(A;) · On the other hand, there • is equality here if { B1 } coincides with { A; }. PROOF OF
PROOF OF
by g.
(ii). The sums (15.1) obviously do not decrease if f is replaced
•
(iii) . By (ii) the sequence f In dp. is nondecreasing and bounded above by f f dp.. It therefore suffices to show that f f dp. S lim n f In dp., or that PROOF OF
m lim n /fn dfl > S = L V;fl ( A; )
( 15.7 )
i- 1
is any decomposition of 0 into �sets and V; = inf w e A f(w) . In order to see the essential idea of the proof, which is quite simple , suppose first that S is finite and all the v; and p.(A;) are positive and finite . Fix an £ that is positive and less than each v; , and put A;n = [w E A ;: fn ( w) > V; £]. Since In i /, A ;n i A;. Decompose 0 into A1 n , . . . , A m n and if
A1,
•
•
•
, Am -
I
SECTION
15.
THE INTEGRAL
the complement of their union, and observe that
( 15 .8 )
jf, dfl > iE- 1 ( v; - £ ) #' ( A ; n ) -+ iE= l ( v; - £ ) �L ( A; ) m
m
m
= s - £ E #' ( A ; ) . i- 1
Since the #' ( A ; ) are all finite, letting £ -+ 0 gives (15.7). Now suppose only that S is finite. Each product V; #L( A ; ) is then finite; suppose it is positive for i < m 0 and 0 for i > m 0• (Here m 0 � m; if the product is 0 for all i, then S = 0 and (15.7) is trivial.) Now v; and #'(A ; ) are positive and finite for i < m 0 (one or the other may be oo for i > m 0 ). Define A ; n as before, but only for i < m 0• This time decompose 0 into .A1 n , . . . , A m o n and the complement of their union. Replace m by m0 in (15.8) and complete the proof as before. Finally, suppose that S = oo . Then V;0 #'(A;0 ) = oo for some i0, so that v10 and #' ( A ; ) are both positive and at least one is oo . Suppose 0 < x < V; 0 0 � oo and 0 < y < #'(A; ) � oo, and put A; n = [w E A; : fn (w) > x]. Since 0 0 0 /,. f /, A;0 n i A;0 ; hence #'(A;0 n ) > y for n exceeding some n 0• But then (decompose 0 into A;0 n and its complement) f In dl' > X�L ( A ;0 n ) > xy for n > n 0 , and therefore li m n / In dl' > xy. If V;0 = oo, let x -+ oo , and if p( A;0 ) = oo, let y -+ oo . In either case (15.7) follows: lim n f In dl' = oo . • PROOF OF (iv). Suppose at first that f = E ; x ; IA and g = Ej.y./8 are j simple. Then a/ + {Jg = E ;1 (ax; + {Jy1 )IA , n B1 , and so J
I
j( af + fJg ) dfl = L. ( ax; + fJyj ) ll ( A; n Bi ) I)
j
= a E X ;fl ( A; ) + fJ LYifl ( Bi ) = a fdfl + I
}
fJ Jg dfl .
Note that the argument is valid if some of a , {J , X;, y1 are infinite. Apart from this possibility, the ideas are as in the proof of (5.18). For general nonnegative f and g, there exist by Theorem 13.5 simple functions In and Bn such that 0 S In i f and 0 < Bn i g . But then 0 < a.fn + /lg,. i af + {J g and f ( afn + {Jgn ) dl' = a f In dl' + {Jf 8n d#', so that (iv) fol lows via (iii) . •
15.1, the expected values of simple random Variables in Chapter 1 are integrals: E [ X] = f X( w)P( d w) . This also covers By part (i) of Theorem
206
INTEGRATION
the step functions in Section 1 (see (1.5)). The relation between the Riemann integral and the integral as defined here will be studied in Section 17.
Consider the line ( R 1 , 91 1 , A) with Lebesgue measure. Suppose that - oo < a 0 � a 1 S · · · < a m < oo and let f be the function with nonnegative value X; on (a; _ 1 , a;], i = 1, . . . , m , and value 0 on ( - oo, a 0 ] and ( a m , oo). By part (i) of Theorem 15.1, f fdA = E7' 1 x;(a,. a ; _ 1 ) because of the convention 0 · oo = O-see (15.2). If the "area under the curve" to the left of a 0 and to the right of am is to be 0, this convention is inevitable. It must then work the other way as well : oo · 0 = 0, so that f fdA = 0 if f is oo at a single point (say) and 0 elsewhere. If f = I( a , oo ) ' the area-under-the-curve point of view makes f fdp. = oo natural. Hence the second convention in (15.2), which also requires that the • integral be infinite if f is oo on a nonempty interval and 0 elsewhere. Example 15. 1.
Recall that
almost everywhere means outside a set of measure 0.
Theorem 15.2. Suppose that f and g are nonnegative. (i) Iff = 0 almost everywhere, then f fdp. = 0. (ii) If p.(w: /(w) > 0] > 0, then f fdp. > 0.
If f I dp. < oo, then f < oo almost everywhere. Iff < g almost everywhere, then ffdp. < f g dp.. If I = g almost everywhere, then f fdp. = f gdp.. Suppose that f = 0 almost everywhere. If A; meets [ w: /( w ) = 0], PROOF. then the infimum in (15.1) is 0; otherwise, p. ( A ; ) = 0. Hence each sum (15.1) is 0, and (i) follows. If A = [w: /(w) > £], then A( i [w: /(w) > 0] as £ � 0, so that under the hypothesis of (ii) there is a positive £ for which p. ( A( ) > 0. Decomposing 0 into A ( and its complement shows that f fdp. > £p.(A() > 0. If p.[/ = oo ] > 0, decompose D into [/ = oo ] and its complement: f fdp. � oo · p.[/ = oo] = oo by the conventions. Hence (iii) . To prove (iv), let G = [/ < g]. For any finite decomposition { A 1 , , A m } (iii) (iv) (v)
(
•
of 0,
[ ] p. ( A ; ) = E [ in�I ] p. ( A ; n G )
E inf I A,
A,
s
[
E inf
A, n G
]
•
•
I p. ( A ; n G )
SECTION
15.
207
THE INTEGRAL
where the last inequality comes from a consideration of the decomposition A 1 n G, . . . , A m n G, Gc. This proves (iv), and (v) follows immediately. • Suppose that I = g almost everywhere, where I and g may not be nonnegative. If I has a definite integral, then since I+ = g + and 1- = g almost everywhere, it follows by Theorem 15 . 2(v) that g also has a definite integral and f l dp. = f g dp. . Uniqueness Although there are various ways to frame the definition of the integral, they are all equivalent- they all assign the same value to lldp. . This is because the integral is uniquely determined by certain simple properties it is natural to require of it. It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. But these uniquely determine the integral for nonnegative functions: For I nonnega tive, there exist by Theorem 13.5 simple functions In such that 0 < In f I; by (iii) , I Idp, must be limn I In" dp. , and (i) determines the value of each I In dp.. Property (i) can itself by derived from (iv) (linearity) together with the assumption that I lA dp. == p. ( A ) for indicators /A : I (E; x; IA ) dp. == E ; x; l lA dp. == E; X;p. ( A ; ). If (iv) of Theorem 15.1 is to persist when the integral is extended beyond the class of nonnegative functions, I1 dp. must be I dp. = I r dp. - I r dp. , which makes the definition (15.6) inevitable. I
I
PROBLEMS These problems outline alternative definitions of the integral. In all of them I is assumed measurable and nonnegative. Call (15.3) the lower integral and write it as
{15 .9) to
distinguish it from the upper integral
(15 .10) The infimum in (15.10), like the supremum in (15.9), extends over all finite partitions { A; } of Q into Jli:sets. 15.1.
Show that I * ldp. == oo if p. [ w : l( w ) > 0] == oo or if p. [ w : l( w ) > x ] all (finite) x .
As there are many I 's of this last kind that ought to be integrable,
>
0 for
(15.10) is
inappropriate as a definition of Ildp,. The only trouble with (15.10), however, is that
INTEGRATION
it treats infinity incorrectly. To see this, and to focus on essentials, assume in the following problems that p.(O) < oo and that I is bounded and nonnegative. 15.2.
f (a) Show that
� [��/("')] I'( A, ) < 7 [���/("') ] I'( B; )
{ B1 } refines { A ; }. Prove a dual relation for the sums in (15.10) and conclude that I * ldp. s I * ldp.. (b) Suppose that M bounds I and consider the partition A ; == [ w : i£ � if
I< w ) < ( i + 1)£], i that
0, . . . , N, where N is large enough that N£ > M. Show
=
Conclude that
( 15.11 ) To define the integral as the common value in (15.11) is the Darboux- Young approach. The advantage of (15.3) as a definition is that it applies at once to unbounded I and infinite p.. 15.3. 3.2 15.2 f For A c Q, define p.*(A) and p. .( A ) by (3.9) and (3.10) with p. in place of P. Suspend for the moment the assumption that I is measurable and show that I * lA dp. == p.*( A ) and I * lA dp. == p. .( A ) for every A . Hence (15.11) can fail if I is not measurable. (Where was measurability used in the proof of (15.11)?) Even though the definition (15.3) makes formal sense for all nonnegative functions, it is thus idle to apply it to nonmeasurable ones. Assume again that I is measurable (as well as nonnegative and bounded, and that p. is finite). 15.4.
f Show that for positive £ there exists a finite partition { A } such that , if { � } is any finer partition and w1 e � , then ;
Jfd#L - 'L, /( "'i ) #L ( B; )
< f.
j
15.5.
f
Show that fd" == lim n
f
•
n 2" k -
L
k-l
2"
1
[P. "'
:
k-
2"
1
< t< "'>
k
< 2"
]
·
The limit on the right here is Lebesgue 's definition of the integral.
SECTION
15.6.
16.
PROPERTIES OF THE INTEGRAL
209
f Suppose that the integral is defined for simple functions by f (L;x; IA '. ) dp. == E;x;p. (A;) · Suppose that In and gn are simple and nondecreasing and have a common limit: 0 � In f f and 0 < gn f f. Adapt the arguments used to prove Theorem 15.l(iii) and show that limn /In dp. == limn / gn dp.. Thus f fdp. can (Theorem 13.5) consistently be defined as limn /In dp. for simple functions for which 0 � In f /. SECI'ION 16. PROPERTIES OF THE INTEGRAL
FApdities and Inequalities
f is that f f+ dp. and ff- dp. both be finite, which is the same as the requirement that f /+ dp. + Jr dp. < oo and hence is the same as the requirement that / ( /+ + f- ) dp. < oo (Theorem 15.1(iv)). Since /+ + f+ = 1/1 , f is integrable if and only if
By definition, the requirement for integrability of
jill dp. < 00 .
( 16.1)
l g l almost everywhere and g is integrable, then integrable as well. If p.(D) < oo , a bounded f is integrable. It follows that if
11leorem
1/ 1
16.1 (i)
everywhere, then
g - almost everywhere, and so (16.2) follows by the definition (15.6). • (ii). First,
af + pg is integrable because, by Theorem 15.1,
j Ia/ + Ps i dp. s j (Ia I · Ill + I P I · l s i ) dp. = I a I j Ill dp. + I P I j l s i dp. < oo .
210
INTEGRATION
By Theorem 15.1(iv) and the definition (15.6), f (a/) dp. = af f dp.-consider separately the cases a > 0 and a < 0. Therefore, it is enough to check (16.3) for the case a = P = 1. By definition, ( / + g ) + - ( / + g ) - = f + g = f+ 1- + g + - g- and therefore ( I + g ) + + /- + g - = ( I + g ) - + /+ + g + . All these functions being nonnegative, f ( f + g ) + dp. + f /- dp. + f g - dp. f ( / + g ) - dp. + f /+ dp. + f g + dp., which can be rearranged to give f ( / + g) + dp. - f ( I + g ) - dp. = f /+ dp. - f /- dp. + f g + dp. - f g - dp.. But this reduces to (16.3). • =
Since - 1/1
n , the function corresponding to xn l ' Xn 2 ' . . . has integral E:, _ l xm by Theorem 15.1(i) (consider the decom position { 1 }, . . . , { n }, { n + 1, n + 2, . . . }). It follows by Theorem 15.1(iii) that in the nonnegative case the integral of the function given by { x m } is the sum E mxm (finite or infinite) of the corresponding infinite series. In the general case the function is integrable if and only if E: _ 1 1 x m 1 is a convergent infinite series, in which case the integral is E:_ 1 x� - E:_ 1 x� . The function xm = ( - 1) m + l m - 1 is not integrable by this definition and even fails to have a definite integral, since E:_ 1 x� = E:_ 1 x m = oo . This invites comparison with the ordinary theory of infinite series, according to which the alternating harmonic series does converge in the sense that lim M E !'_ 1 ( - 1) m + l m - 1 = log 2. But since this says that the sum of the first M terms has a limit, it requires that the elements of the space 0 be ordered. If D consists not of the positive integers but, say, of the integer lattice points in 3-space, it has no canonical linear ordering. And if E mxm is to have th e same finite value no matter what the order of summation, the series must be absolutely convergent.t This helps to explain why f is defined to be • integrable only if f /+ dp. and f f- dp. are both finite. Example 16.1.
•
t RUDIN , p. 76.
•
•
•
16. PROPERTIES OF THE INTEGRAL 211 Example 16.2. In connection with Example 15.1, consider the function I = 3/( a , oo ) - 2 /( - oo , a ) · There is no natural value for f fdA (it is "oo - oo " ), and none is assigned by the definition. If a function f is bounded on bounded intervals, then each function I, = fl( - n , n > is integrable with respect to A. Since f = lim n fn , the limit of I In dA, if it exists, is sometimes called the " principal value" of the integral of f. Although it is natural for some purposes to integrate symmetrically about the origin, this is not the right definition of the integral in the context of general measure theory. The functions gn = fl( - n , n + 1 > for example also converge to f, and f gn dA may have some other limit, or none at all; f(x) = x is a case in point. There is no general reason why In should take precedence over gn . As in the preceding example, I = Er_ 1 ( - 1) kk - 1I( k , k + 1 has no integral, > even though the f In dA above converge. • SECTION
Integration to the Limit
The first result, the Theorem 15.1(iii).
monotone convergence theorem, essentially restates
If 0 S In i f almost everywhere, then f In dp. i f / dp. . If 0 < In j f on a set A with p.(Ac) = 0, then 0 < fn/A i f/A holds
1beorem 16.2. PROOF.
everywhere, and it follows by Theorem 15.1(iii) and the remark following Theorem 15.2 that f In dp. = fIn/A dp. i f f/A dp. = f fdp.. • As the functions in Theorem 16.2 are nonnegative almost everywhere, all the integrals exist. The conclusion of the theorem is that lim n f In dp. and I Idp. are both infinite or both finite and in the latter case are equal. Consider the space { 1, 2, . . . } together with counting measure, as in Example 16.1. If for each m, 0 < x n m j x m as n --+ oo, then lim n E m x n m = E m x m , a standard result about infinite series. • Example 16.3.
If p. is a measure on � and �0 is a a-field contained in S', then the restriction p.0 of p. to �0 is another measure (Example 10.4). If / == /A and A e �0, then Example 16.4.
the common value being p. ( A ) = p. 0 ( A ). The same is by linearity true for
212
INTEGRATION
nonnegative simple functions measurable �0• It holds by Theorem 16.2 for all nonnegative I measurable j&'0 because (Theorem 13.5) 0 < In i I for simple functions In measurable j&'0• For functions measurable �0, integra tion with respect to p. is thus the same thing as integration with respect to •
P. o·
In this example a property was extended by linearity from indicators to nonnegative simple functions and thence to the general nonnegative func tion by a monotone passage to the limit. This is a technique of very frequent application. For a finite or infinite sequence of measures P.n on j&', p.(A) = LnP.n ( A) defines another measure (countably additive because [A27] sums can be reversed in a nonnegative double series). For indicators I, Example 16.5.
and by linearity the same holds for simple I > 0. If 0 < lk i I for simple lk, then by Theorem 16.2 and Example 16.3, f I dp. = lim k f lk dp. lim k En f lk dp.n = En lim k f lk dp.n = En f ldP.n · The relation in question thus holds for all nonnegative 1. • =
An important consequence of the monotone convergence theorem is
Fatou 's lemma: Theorem 16.3.
For nonnegative In,
If gn = infk 'il!:. nlk , then 0 < gn j g = lim infnln , and the preceding two theorems give f In dp. � f 8n dp. -+ f g dp.. • PROOF.
.
On ( R1 , !1/ 1 , A), In = n 2/(o, n -t ) and I = 0 satisfy ln ( x ) -+ l ( x ) for each x, but / IdA = 0 and fln d A = n -+ oo . This shows that the inequality in (16.6) can be strict and that it is not always possible to integrate to the limit. This phenomenon has been encountered before; see • Examples 5.6 and 7.7. EXIIIple II 16. 6.
16. PROPERTIES OF THE INTEGRAL 213 Fatou' s lemma leads to Lebesgue 's dominated convergence theorem: SECTION
f
If Ifn i s g almost everywhere, where g is integrable, and if In f almost everywhere, then f and the In are integrable and fIn dp. f fdp.. PROOF. From the hypothesis 1/n l s g it follows that all the In as well as I * = lim supn In and f * = lim inf In are integrable. Since g + In and g - In neorem 16.4. -+
-+
are nonnegative, Fatou' s lemma gives
and
.S:
j
j
j
lim inf In dp. . n ( g - In ) dp. = g dp. - lim sup n
Therefore (16 . 7) .S:
j
j
lim sup In dp. .S: lim sup In dp. . n n
The only assumption used thus far is that the In are dominated by an integrable g. If In f almost everywhere, the extreme terms in (16.7) agree • with f fdp.. -+
Example 16.6 shows that this theorem can fail if no dominating EXIIIple II 16. 7.
g exists.
The Weierstrass M-test for series. Consider the space
{ 1, 2, . . . } together with counting measure, as in Example 16.1. If l x n m l < M, and E m Mm < oo, and if lim n x n m = x m for each m, then lim n E m x n m = E,.. x m . This follows by an application of Theorem 16.4 with the function given by the sequence M1 , M2 , in the role of g. This is another standard • result on infinite series [A28]. •
•
•
214
INTEGRATION
The next result, the bounded convergence theorem, is a special case of Theorem 16.4. It contains Theorem 5.3 as a further special case.
If p.(D) < oo and the In are uniformly bounded, then In --+ f almost everywhere implies fIn dp. --+ f Idp.. Theorem 16.5.
The next two theorems are simply the series versions of the monotone and the dominated convergence theorems. Theorem 16.6.
If fn > 0, then J Enfn dp. = En / fn dp..
The members of this last equation are both equal either to oo or to the same finite, nonnegative real number.
If En fn converges almost everywhere and IE� == tfk l < g almost everywhere, where g is integrable, then En fn and the In are integrable and f En ln dp. = En fIn dp.. Theorem 16.7.
If Enf Ifni dp. < oo , then Enfn converges absolutely almost ev erywhere and is integrable, and f En fn dp. = En fIn dp.. PROOF. The function g = En Ifni is integrable by Theorem 16.6 and is finite almost everywhere by Theorem 15.2(iii). Hence Enl/nl and En/n converge • almost everywhere and Theorem 16.7 applies. Corollary.
In place of a sequence { In } of real measurable functions on ( Q, !F, p. ) , consider a family [/,: 1 > 0] indexed by a continuous parameter 1. Suppose of a measurable f that ( 16.8) on a set A , where { 16.9)
lim /,( w ) == f( w )
t� oo
A e !F ,
A technical point arises here, since holds:
p. ( O - A ) == O. !F
need not contain the w-set where (16.8)
Ex1111le 1p 16.8. Let !F consist of the Borel subsets of Q == (0, 1], and let H be a nonmeasurable set-a subset of Q that does not lie in !F (see the end of Section 3). Define /, ( w ) == 1 if w equals the fractional part t [ 1] of 1 and their common value lies in H('; define /, ( w ) == 0 otherwise. Each /, is measurable !F, but if f( w ) 0, • then the w-set where (16.8) holds is exactly H. -
=
Because of such examples, the set A above must be assumed to lie in � (Because of Theorem 13.4, no such assumption is necessary in the case of sequences.)
215 The assumption that (16.8) holds on a set A satisfying (16. 9) can still be expressed as the assumption that (16.8) holds almost everywhere. If I, == f /, dp. converges to I == f f dp. as t --+ oo , then certainly I,n --+ I for each sequence { t, } going to infinity. But the converse holds as well: if I, does not converge to I, then there is a positive £ such that 1 1, n - II > £ for a sequence { tn } going to infinity. To the question of whether l,n converges to I the previous theorems apply. Suppose that (16.8) holds almost everywhere and that 1/, 1 < g almost every where, where g is integrable. By the dominated convergence theorem, f must then be integrable and I, --+ I for each sequence { tn } going to infinity. It follows that f/, dp. --+ f f dp.. In this result t could go continuously to 0 or to some other value instead of to infinity. SECTION
Theorem 16.8.
( a, b).
16.
PROPERTIES OF THE INTEGRAL
Suppose that f( w, t) is a measurable function of w for each t in
(i) Suppose that /( w, t) is almost everywhere continuous in t at t0; suppose
further that, for each t in some neighborhood of t0, 1/(w, t) l < g( w) almost every where, where g is integrable. Then f f( w, t)p.( dw) is continuous in t at t0• (ii) Suppose that for w e A , where A satisfies (16.9), /( w , t) has in (a, b) a derivative / '( w, t) with respect to t; suppose further that 1/ '( w, t) I < g( w ) for w e A and t e ( a , b), where g is integrable. Then f f( w, t)p.(dw) has derivative f /' ( w, t) p. ( d w) on (a, b). Part (i) is an immediate consequence of the preceding discussion. To prove part (ii), consider a fixed t. By the mean value theorem, PllOOF.
/( w , I + h ) - /( w , I) h
== / ' ( w , s ) '
where s lies between t and t + h. The left side goes to /'( w, t) as h --+ 0 and is by the hypothesis dominated by the integrable function g( w ) . Therefore,
� [ J!( w , t + h ) p.( dw ) - jf( w , t ) p. ( dw ) ] --+ jf' ( w , t) p. ( dw ) .
•
The condition involving g can be localized. It suffices to assume that for each t there is an integrable g( w, t) such that 1/ '( w, s) 1 < g( w, t) for all s in some neighborhood of t. Integration over Sets The integral of I over a set A in � is defined by
(16 . 10)
j fdp. = jJAfdp. . A
The definition applies if I is defined only on A in the first place (set I = 0
outside A ). Notice that
fA idp. = 0 if p. ( A ) = 0.
216
INTEGRATION
All
the concepts and theorems above carry over in an obvious way to integrals over A . Theorems 16.6 and 16.7 yield this result:
If A1, A 2, are disjoint, and if f is either nonnegative or integrable, then fu, A , fdp. = En fA , fdp.. 16.9. The integrals (16.10) suffice to determine /, provided p. 11 EX111ple is a-finite. Suppose that f and g are nonnegative and that fAfdp. < fAgdp. for all A in !F. If p. is a-finite, there are jli:sets A n such that A n i 0 and p. ( A n ) < oo . If Bn = [0 < g < f, g S n ], then the hypothesized inequality applied to A n n Bn implies fA , , fdp. s fA , , gdp. < 00 (since A n n Bn has finite measure and g is bounded there) and hence f IA , ,(I - g) dp. = Theorem 16.9.
• • •
nB
nB
n8
0. But then by Theorem 15.2(ii), the integrand is 0 almost everywhere, and so p. ( A n n Bn ) = 0. Therefore, p.(O s g < f, g < oo] = 0, so that I < g almost everywhere:
Iff and g are nonnegative and fAfdp, = fAg dp, for all A in � ' and if p. is a-.finite, then f = g almost everywhere. (i)
There is a simpler argument for integrable functions; it does not require that the functions be nonnegative or that p, be a-finite. Suppose that f and g are integrable and that fAfdp, < fAgdp. for all A in �. Then f - g) dp. = 0 and hence p,[g < / ] = 0 by Theorem 15.2(ii):
Ils = h n l[ ,. < a ] ' then for each a, lim n h �a > = 0 and so lim n f h �a> dp, = 0 by the bounded convergence theorem. Now Theorem 16.13.
h
Given £, choose a so that the first term on the right is less than £ for all n; then choose n 0 so that the second term on the right is less than £ for all • n � n 0 • This shows that f h n dp. --+ 0. Suppose that
( 16.23)
sup /II,. I I +• dp. n
< oo
SECTION
for a positive
and
£.
If
16.
PROPERTIES OF THE INTEGRAL
221
K is the supremum here, then
so { In } is uniformly integrable.
Complex Functions A complex-valued function on Q has the form h are ordinary finite-valued real functions on Q.
l( w ) = g( w ) + ih ( w), where g and
It is, by definition, measurable !F if g and h are. If g and h are integrable, then I is by definition integrable, and its integral is of course taken as (16.24)
j( g + ih ) dp. = f gdp. + ; jh dp. .
Since max{ lgl , I h i } < Ill < lgl + l h l , I is integrable if and only if I Ill dp. < oo , just as in the real case. The linearity equation (16.3) extends to complex functions and coefficients-the proof requires only that everything be decomposed into real and imaginary parts. Consider the inequality (16.4) for the complex case: {16.25 )
ffdp. � jill dp. .
If f == g + ih and g and h are simple, the corresponding partitions can be taken to be the same ( g == Ek xk lA and h = Ek Yk lA ), and (16.25) follows by the triangle inequality. For the gener� integrable I, rePresent g and h as limits of simple
functions dominated by Ill and pass to the limit. The results on integration to the limit extend as well. Suppose that lk = gk + ih k are complex functions satisfying E k I Ilk I dp. < oo . Then E k I l gk I dp. < oo , and so by the corollary to Theorem 16.7, Ek gk is integrable and integrates to Ek l gk dp.. The same is true of the imaginary parts, and hence Ek lk is integrable and ( 16.26)
PROBLEMS 16.1. 16.1.
Suppose that p.(A) < oo and a s I < fl almost everywhere on A. Show that ap. ( .A ) < IAidp. S /lp. ( A ). (a) Deduce parts (i) and (ii) of Theorem 10.2 from Theorems 16.2 and 16.4. (b) Compare (16.7), (10.6), and (4.10).
222
INTEGRATION
13.13 Suppose that p.(Q) < oo and In are uniformly bounded. (a) Assume In --+ f uniformly and deduce I/,. dp. --+ I I dp. from (16.5). (b) Use part (a) and Egoroff's theorem to give another proof of Theorem 16.5. 16.4. Prove that if 0 < In --+ f almost everywhere and I In dp. < A < oo , then f is integrable and Ifdp. < A . (This is essentially the same as Fatou' s lemma and is sometimes called by that name.) 16.5. Suppose that p. is Lebesgue measure on the unit interval 0 and that ( a, b) = (0, 1) in Theorem 16.8. If /( w, t) is 1 or 0 according as w < t or w > t, then for each t, f'( w, t ) = 0 almost everywhere. But lf( w, t)p.( dw ) does not differentiate to 0. Why does this not contradict the theorem? Examine the proof for this case. 16.6 (a) Suppose that functions an , bn , fn converge almost everywhere to func tions a , b, f, respectively. Suppose that the first two sequences may be integrated to the limit-that is, the functions are all integrable and I an dp. --+ I a dp., I bn dp. --+ I b dp.. Suppose, finally, that the first two sequences enclose the third: an < In < bn almost everywhere. Show that the third may be integrated to the limit. (b) Deduce Lebesgue's dominated convergence theorem from part (a). 16.7. Show that, for f integrable and £ positive, there exists an integrable simple g such that I If - gl dp. < £ . If .JJ1 is a field generating !F, g can be taken as E; x; IA with A ; e .JJI. Hint: Use Theorems 13.5, 16.4, and 11.4. 16.8. Suppose that f( w, · ) is, for each w, a function on an open set W in the complex plane and that f( · , z ) is for z in W measurable !F. Suppose that A satisfies (16.9), that /( w, · ) is analytic in W for c.> in A , and that for each z0 in W there is an integrable g( · , z0 ) such that 1/( w, z ) I < g( w, z0 ) for all w e A and all z in some neighborhood of z0 • Show that I/( w, z )p.( dw) is analytic in W. 16.9. Suppose that In are integrable and supn l In dp. < oo . Show that, if In f /, then I is integrable and lfn dp. --+ lfdp.. This is Beppo Levi 's theorem. 16.10. Suppose of integrable In that am n = I lim - ln l dp. --+ 0 as m , n --+ 00 . Then there exists an integrable f such that /Jn = I If - In I dp. --+ 0. Prove this by the following steps. (a) Choose n k so that m , n > n k implies that am n < 2- k . Use the corollary to Theorem 16.7 to show that f = lim k In k exists and is integrable and lim k I If - Ink I d�t = o. (b> For n > n k , write I If - In I d�t < I If - In k I dp, + I lin k - In I d�t . 16. 1 1 (a) If p. is not a-finite, (i) in Example 16.9 can fail. Find an example. (b) Modifying the argument in Example 16.9, show that if f, g > 0, if IA fdp. < IA g dp. for all A e !F, and if [ g < oo,f > 0] is a countable union of sets of finite p.-measure, then f < g almost everywhere. (c) Show that the condition on [ g < oo , f > 0] in part (b) is satisfied if the measure v ( A ) = IA fdp. is a-finite. 16.12. f (a) Suppose that (16.11) holds ( B � 0). Show that v determines B up to a set of p.-measure 0 if p. or v is a-finite. 16.3.
I
SECTION
16.
223
PROPERTIES OF THE INTEGRAL
(b)
16. 13. 16. 14. 16.1 5. 16.16. 16. 17.
16.18. 16.19.
16.20.
16.21.
16.22.
16.13.
Show that, if p. is a-finite and p.( B = oo ] = 0, then ., is a-finite. (c) Show that if neither p. nor ., is a-finite then B may not be unique even up to sets of p.-measure 0. Show that the supremum in (16.15) is half the integral on the right. Show that if (16.14) holds and if lim infnBn > B almost everywhere, then (16.15) holds and lim infnBn = B almost everywhere. Show that Scheffe' s theorem still holds if (16.14) is weakened to �'n (g) --+ ., < n ) . f Suppose that In and f are integrable and that /,, --+ f almost every where. Show that / lfn I dp. --+ / 1/1 dp. if and only if / lfn - / 1 dp. --+ 0. (a) Show that if 1/n I < g and g is integrable, then { In } is uniformly integrable. Compare the hypotheses of Theorems 16.4 and 16.13. (b) On the unit interval with Lebesgue measure, let In = ( njlog n) / for n > 3. Show that the In are uniformly integrable (and f In dp. --+ 0) although they are not dominated by any integrable g. (c) Show for In = nl - nl< n- •. 2 n- • > that the In can be integrated to the limit (Lebesgue measure) even though the In are not uniformly integra ble. Show that if f is integrable, then for each £ there is a B such that p. (A) < B implies fA 1/1 dp. < £. f Suppose that p.(O) < oo . Show that { In } is uniformly integrable if and only if (i) f lin I dp. is bounded and (ii) for each £ there is a B such that p.(A ) < B implies fA Ifn i dp. < £ for all n. 2.17 16.19 t Assume p.(Q) < oo. (a) Show by examples that neither of conditions (i) and (ii) in the preceding problem implies the other. (b) Show that (ii) implies (i) for all sequences { In } if and only if p. is nonatomic. 16.16 16.19 t Suppose that p.(O) < oo , In and f are integrable, and In --+ f almost everywhere. Show that these conditions are equivalent: (i) { In } is uniformly integrable; (ii) I lin I dp. --+ I 1/1 dp. ; (iii) / 1/n - / 1 dp. --+ 0. 13.15 16.19 t (a) Suppose that p.(Q) < oo and In > 0. Show that if the In are uniformly integrable, then lim supn / In dp. < / lim supn In dp.. (b) Use this fact to give another proof of Theorem 16.13 in the nonnega tive case. (a) Suppose that /1 ,/2 , are nonnegative and put gn = max{ /1 , , fn } . Show that /g, a gn dp. < E Z - 1 ftk a fk dp.. (b) Suppose further that { In } is uniformly integrable and p. ( Q) < oo . Show that f gn dp. = o( n). function integrating to re;e , r > 0. From Let f be a complex-valued ; 6 / ( 1/( w ) l - e - /( w))p.( dp.) = / 1/1 dp. - r, deduce (16.25) . •
�
16.14.
•
•
.
�
•
•
224
INTEGRATION
Consider the vector lattice .ftJ and the functional A of Problems 11.11 and 11.12. Let p. be the extension (Theorem 11.3) to !F= a ( � ) of the set function p.0 on 'a. (a) Show by (1 1.11) that for positive x, y1 , y2 , v([ l > 1] X (0, x ]) = xp.0( 1 > 1] xp.[ l > 1] and v([y1 < I < Y2 J X (0, x]) = xp.[yt < I < Y2 ). (b) Show that if I e .ftJ, then I is integrable and
16.15. 11.12 f
=
A ( /) = jfdp. . (Consider first the case I � 0.) This is the Danie/1-Stone representation
theorem.
SECI'ION 17. INTEGRAL WITH RESPECf TO LEBESGUE MEASURE The Lebesgue Integral on the Line
Lebesgue integrable if it is integrable with respect to Lebesgue measure A, and its Lebesgue integral f f dA is denoted by f I( x ) dx or, in case of integration over an interval, by
A real measurable function on the line is
J:f( x ) dx. It is instructive to compare it with the Riemann integral .
The Riemann Integral
I on an interval (a, b] is by definitiont Riemann integrable, with integral r, if this condition holds: For each £ there exists a 8 with the
A real function
property that
( 17 .1 )
r - E l ( xi ) A ( Jj )
0 for each subinterval J of (0, 1]. Take f = lA. Suppose that f = g almost everywhere and that { Jj } is a decomposition of (0, 1] into subintervals. Since 'A(Jj n A n [/ = g]) = A ( � n A) > 0, g(y1 ) = f(y1 ) = 1 for some y1 in Jj n A, and xam
=
(17 .3 )
L g ( yj ) 'A ( Jj ) = 1 > 'A( A ). j
If g were Riemann integrable, its Riemann integral would coincide with the • Lebesgue integrals f g dx = f fdx = A( A), in contradiction to (17 .3). It is because of their extreme oscillations that the functions in Examples 17.1 and 17.2 fail to be Riemann integrable. (It can be shown that a bounded function on a bounded interval is Riemann integrable if and only if the set of its discontinuities has Lebesgue measure o.t ) This cannot happen in the case of the Lebesgue integral of a measurable function: if f tSee Problems 1 7.5 and 1 7.6.
226
INTEGRATION
fails to be Lebesgue integrable, it is because its positive part or its negative part is too large, not because one or the other is too irregular.
E
ple
xam
1 7.3.
( 17 .4 )
It is an important analytic fact that . lim SID X 'IT dx = 2 t -. oo 0 X
1'
•
The existence of the limit is simple to prove because /O 0 < 6 < 2w
R2
•
Stieltjes Integrals
Suppose that F is a function on R k satisfying the hypotheses of Theorem 12.5, so that there exists a measure p. such that p.( A) = &AF for bounded rectangles A . In integrals with respect to p., p.(d.x) is often replaced by dF( x ):
£!( x) dF(x) £f( x )p. (dx ) . =
(17 .16)
The left side of this equation is the Stieltjes integral of I with respect to F; since it is defined by the right side of the equation, nothing new is involved. Suppose that I is uniformly continuous on a rectangle A, and suppose that A is decomposed into rectangles A m small enough that ll(x) - I( Y ) I < £/p. ( A ) for x, y e A m . Then
£f( x ) dF( x ) - 'f. f(xm ) ll A .,F m
<E
for x m e A m . In this case the left side of (17.16) can be defined as the limit of these approximating sums without any reference to the general theory of measure and for historical reasons is sometimes called the Riemann-Stieltjes integral; (1 7.16) for the general I is then called the Lebesgue-Stieltjes integral. Since these distinctions are unimportant in the context of general measure theory, f l ( x ) dF(x) and f ldF are best regarded as merely nota tional variants for f l( x )p.( dx ) and f ldp.. PROBLEMS
17.1
to R k .
17.1.
Extend Theorem
17.2.
Show that if f is integrable, then lim f If ( x
t�o
Use Theorem
17.1.
+ t ) - / ( x )I dx == 0.
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
231
17.3.
Suppose that p. is a finite measure on qjA and A is closed. Show that p. ( x + A ) is upper semicontinuous in x and hence measurable.
17.4.
Stappose that /0 1/( x ) l dx < oo . Show that for each £ , A [ x: x > a , 1/( x ) l > � 1 � 0 as a --+ oo . Show by example that /( x) need not go to 0 as
X � 00 .
17.5.
A Riemann integrable function on (0, 1] is continuous almost everywhere. Let the function be f and l�t A( be the set of x such that for every 8 there are points y and z satisfying IY - x l < 8, l z - x l < 8, and 1/( y ) - /( z ) l > £ . Show that A( is closed and that the set D of discontinuities of f is the
union of the A ( for positive, rational £ . Suppose that [0, 1] is decomposed into intervals I, . If the interior of I; meets A( , choose Y; and z; in I; in such a way that /(Y; ) - /( z; ) > £; otherwise take Y; = z, e 1; . Show that E( /( Y; ) - /( z; )) A ( I; ) > £ A ( A( ) . Finally, show that if f is Riemann integra ble, then A ( D ) = 0.
17.6.
A bounded Borel function on (0, 1 ] that is continuous almost everywhere is Riemann in.tegrab/e. Let M bound 1/1 and suppose that A ( D ) = 0. Given find an open G such that D c G and A ( G) < £/M and take C = [0, 1] - G. For each x in C find a 8x such that 1/( y ) - /( x) l < £ if IY - x l < 8x . Cover C by finitely many ( xk - 8x /2, x" + 8x /2) and take 2 8 less than " the minimum 8x . Show that 1/( y) _! /( x) l < 2£ if IY - x l < 8 and x e C. If [0, 1] is deco�posed into intervals I; with A ( I; ) < 8 , and if Y; e I; , let g f
f,
be the function with value /( Y; ) on 1; . Let E' denote summation over those i for which I, meets C, and let I:" denote summation over the other i. Show that
17. 7.
Show that if f is integrable, there exist continuous, integrable functions gn such that gn (x) --+ /(x ) except on a set of Lebesgue measure 0. (Use Theorem 17.1(ii) with £ = n - 2 . )
17.8.
13.13 1 7.7 f Let f be a finite-valued Borel function over (0, 1). By the ' following steps, prove Lusin s theorem: For each £ there exists a continuous function g such that A [ x e (0, 1): /( x) =I= g( x)] < £ . (a) Show that f may be assumed integrable, or even bounded. (b) Let gn be continuous functions converging to f almost everywhere. Combine Egoroff's theorem and Theorem 12.3 to show that convergence is uniform on a compact set K such that A ((O, 1) - K ) < £. The limit lim n gn ( x) = /(x) must be continuous when restricted to K. (c) Exhibit (0, 1) - K as a disjoint union of open intervals Ik [A12] ; define g as f on K and define it by linear interpolation on each lk . n n Let fn ( x ) = x - 1 - 2x2 - 1 • Calculate and compare /JE':- 1 /n ( x) dx and EC:- 1 /J/n ( x) dx. Relate this to Theorem 16.6 and to the corollary to Theorem 16.7.
17.9.
232
INTEGRATION
17.10. Show that (1 + y 2 ) - 1 has equal integrals over ( - oo , - 1), ( - 1, 0), (0, 1), (1 , oo). Conclude from (17.10) that /J (1 + y 2 ) - 1 dy == w/4. Expand the integrand in a geometric series and deduce Leibniz's formula
1 1 1 'IT - == 1 - - + - - - + · · · 5 7 4 3 16.7 (note that its corollary does not apply). 17.11. Take T( x ) == sin x and l(y) = (1 - y 2 ) - 1 12 in (17.7) and show that t == /Ji0 1(1 - y 2 ) - 1 12 dy for 0 < t s wj2. From Newton's binomial for mula and two applications of Theorem 16.6, conclude that by Theorem
{ 17 .17) 17.12. 17.7 f
Suppose that T maps [ a, b] into [ u , v] arid has continuous deriva tive T'; T need not be monotonic, but assume for notational convenience that Ta s Tb. Suppose that I is a Borel function on [ u, v] for which J: I/( T( x )) T' ( x ) l dx < oo and Jl: II(Y) I dy < oo . Then (17.7) holds. (a) Prove this for continuous I by replacing b by a variable upper limit and differentiating. (b) Prove it for bounded I by exhibiting I as the limit almost everywhere of uniformly bounded continuous functions. (c) Prove it in general.
17.13. A function I on [0, 1] has Kurzweil integral· I if for each £ there exists a positive function B( · ) over [0, 1] such that, if x0 , , xn and � 1 , , �, •
•
•
•
•
•
satisfy
{ 17 .18)
{
O == x0 < x1 < · · · < xn == 1 ' X; - 1 < �; < X; ' X; - X; - 1 < B ( �; ) ,
i == 1 , . . . , n ,
then
( 17.19)
n
E l( �; ) ( x; - X; _ 1 ) - I < £ .
k- 1
Riemann's definition is narrower because it requires the function B( · ) to be constant. Show in fact by the following steps that, if a Borel function over [0, 1] is Lebesgue integrable, then it has a Kwzweil integral, namely its Lebesgue integral. (a) The Kunweil integral being obviously additive, assume that I > 0. Choose k0 so that J1 � k I dx < £, and for k == 1, 2, . . . choose open sets Gk such that Gk :::> A k = r(k - 1)£ < I < k£] and A ( Gk - A k ) < 1jkk5 2k . Define B(() == dist (�, Gk ) for E e A k .
SECTION
17.
233
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
(b) Suppose that (17.18) holds. Note that �; e A k implies that [ X; _ 1 , c Gk and show that E7- 1 /( �; )( x; - X; _ 1 ) < I + 2£ by splitting the sum according to which A k it is that �; lies in. k (c) Letk E< > denote summation over the i for which �; e Ak , and show that E< > ( x; - X; _ 1 ) 1 - E1 ., k E < I) ( x; - X; _ 1 ) > A ( Ak ) - k0 2 • Con clude that E7- 1 l(�; )( x; - X; _ 1 ) > k - 3£. Thus - 2£ > I
x,]
f dx
=
I dx
11< 0 I dx
(17.18) implies (17.19) for I = I Idx (with 3£ in place of £). 17. 14. f (a) Suppose that I is continuous on (0, 1] and ( 17 .20)
lim j11( x ) dx = I.
��o ,
Choose 1J so that t s 1J implies that I I - 1/1 dxl < £; define 80( ) over (0, 1] in such a way that 0 < x , y s 1 and lx - Y l < B0 ( x ) imply that ll( x ) - I( Y ) I < £, and put ·
B( x ) =
min{ � x , Bo ( x )} min{ 1 + 1 �( 0) I } 11 '
if O < X < 1 , if X
= 0.
From (17.18) deduce that � 1 = 0 and then that (17.19) holds with 3£ in place of £. (b) Let F be the function in Example 17.4, and put I = F' on [0, 1]. Show that I has a Kunweil integral even though it is not Lebesgue integrable. 17.15. 12.16 f Identifying w. Deduce from (12.19) that ( 17 .21 )
A2 ( A)
= 'ITo fJ rdrdfJ , 'IT B
where B is the set of ( r, 8 ) with (rcos 8, r sin 8 ) e A , and then that ( 17.22)
fJ f( x , y ) dx dy = � JJ R
2
f( rcos fJ , r sin fJ ) r dr dfJ .
r>O
0 < 9 < 2 tr
When w0 = has been established in Problem 18.24, this will give (17.15). 17.16. 16.25 f Let !R consist of the continuous functions on R1 with compact support. Show that !R is a vector lattice in the sense of Problem 11.11 and has the property that I e !R implies I 1\ 11 e !R (note that 1 E !R). Show that the a-field !F generated by !R is 9P • Suppose A is a positive linear functional on !R; show that A has the required continuity property if and only if ln (x) �O uniformly in x implies A ( /n ) � 0. Show under this ,
234
INTEGRATION
assumption on
A
{ 17 .23 )
that there is a measure p. on 911 such that A
( /)
=
ffdp. ,
f e !R.
Show that p. is a-finite and unique. This is a version of the Riesz representa
tion theorem .
17. 17.
f Let A ( /) be the Riemann integral of f, which does exist for f in !£'. Using the most elementary facts about Riemann integration, show that the p. determined by (17 .23) is Lebesgue measure. This gives still another way of constructing Lebesgue measure.
17. 18.
f
Extend the ideas in the preceding two problems to R k . SECI'ION 18.
PRODUCf MEASURE AND FUBINI'S THEOREM
Let ( X, !I) and ( Y, C!!J ) be measurable spaces . For given measures p. and v on these spaces, the problem is to construct on the cartesian product X X Y a product measure 'IT such that 'IT(A X B) = p.(A)v( B) for A c X and B c Y. In the case where p. and " are Lebesgue measure on the line, 'IT will be Lebesgue measure in the plane. The main result is Fubini 's theorem, according to which double integrals can be calculated as iterated integrals. Product Spaces
It is notationally convenient in this section to change from (Sl, �) to ( X, fi) and ( Y, C!!J ). In the product space X X Y a measurable rectangle is a product A X B for which A e fi and B e C!!J . The natural class of sets in X X Y to consider is the a-field fi X C!!J generated by the measurable rectangles. (Of course, fi X C!!J is not a cartesian product in the usual sense.) Suppose that X = Y = R 1 and f£= C!!J = !R 1 . Then a measurable rectangle is a cartesian product A X B in which A and B are linear Borel sets. The term rectangle has up to this point been reserved for cartesian products of intervals, and so a measurable rectangle is more general. As the measurable rectangles do include the ordinary ones and the latter generate !A 2 , !A 2 c !A 1 X 91 1 • On the other hand, i f A is an interval, [B: A X B e !A 2 ] contains R 1 (A X R 1 = U n ( A X ( - n, n ])e !A 2 ) and is closed under the formation of proper differences and countable unions; thus it is a a-field containing the intervals and hence the Borel sets. Therefore, if B is a Borel set, [ A : A X B e 91 2 ] contains the intervals and hence, being a Example 18. 1.
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
235
a-field, contains the Borel sets. Thus all the measurable rectangles are in 91 2 , and so fJ1t 1 x !1/ 1 = fJ1t 2 consists exactly of the two-dimensional Borel
.
��-
As this example shows, fi X C!!J is in general much larger than the class of measurable rectangles.
If E e fi X C!!J, then for each x the set [y: (x, y) e E] lies in C!!J and for each y the set [x: (x, y) e E] lies in !I. (ii) Iff is measurable fi X C!!J , then for each fixed x the function f(x, y) with y varying is measurable C!!J, and for eachfixedy the function f(x, y) with varying is measurable !I.
Theorem 18. 1. (i)
x
The set [y: (x, y) e E] is the section of E determined by x, and as a function of y is the section off determined by x.
f(x, y)
x and consider the mapping Tx : Y --+ X X Y defined by TxY = (x, y). If E = A X B is a measurable rectangle, Tx 1E is B or 0 according as A contains x or not, and in either case Tx 1E e C!!J. By Theorem 13.1(i), Tx is measurable C!!Jj!IX C!!J . Hence [y: (x, y) e E] = Tx- 1E e C!!J for E e fiX C!!J. By Theorem 13.1(ii), if f is measurable fi X fl/91 1 , then fTx is measurable C!!f/Y't 1 • Hence f(x, y) = fTx ( Y ) as a function of y is measurable C!!J. The symmetric statements for fixed y are proved the PROOF.
Fix
-
-
•
same way.
Product Measure
Now suppose that ( X, !I, p.) and (Y, C!!J , ") are measure spaces, and suppose for the moment that p. and " are finite. By the theorem just proved Jl[y: (x, y) e E] is a well-defined function of x. If !l' is the class of E in fiX C!!J for which this function is measurable fi, it is not hard to show that !l' is a A-system. Since the function is IA(x)JI(B) for E = A X B, !l' contains the w-system consisting of the measurable rectangles. Hence !l' coincides with � X f!!/ by the 'IT-A theorem. It follows without difficulty that (18 . 1 )
'11' ' ( £ ) =
f: [y: ( x, y ) E E ) p(dx) ,
E e !I x C!!J ,
is a finite measure on fiX C!!J, and similarly for
(18 .2 )
'11' "( £ ) =
ft lx: ( x, y )
E
E ) P ( dy ) ,
E E fi X
C!!J .
INTEGRATION
For measurable rectangles,
w' ( A X B ) = w"( A X B ) = �t ( A ) · P ( B ) . The class of E in fiX C!!J for which w'(E) = w"( E) thus contains the measurable rectangles; since this class is a A-system, it contains :r X C!!J. The common value w'(E) = w"(E) is the product measure sought. To show that (18.1) and (18.2) also agree for a:.finite I' and ,, let { A m } and { Bn } be decompositions of X and Y into sets of finite measure, and put l' m (A) = �t(A n A m ) and �'n (B) = P( B n Bn > · Since P( B) = Lm �'m ( B), the integrand in (18.1) is measurable fi in the a-finite as well as in the finite case; hence w' is a well-defined measure on fi X C!!J and so is w". If w:,n and w:,:n are (18.1) and (18.2) for I' m and "n ' then by the finite case, already treated, w'( E ) = Lm nw:, n (E) = L m nw,::n ( E) = w"(E) . Thus (18.1) and (18.2) coincide in the a-finite case as well. Moreover, w'( A X B) = L m nl' m (A)Pn ( B) = #'(A)P( B).
(18.3 )
If ( X, !I, I') and ( Y, C!!J, ") are a-finite measure spaces, w( E ) = w'( E ) = w"(E) defines a a-finite measure on fi X C!!J; it is the only measure such that w(A X B) = �t(A) P(B) for measurable rectangles.
Theorem 18.2.
·
Only a-finiteness and uniqueness remain to be proved. The prod ucts Am X Bn for { A m } and { Bn } as above decompose X X Y into measurable rectangles of finite w-measure. This proves both a-finiteness and uniqueness, since the measurable rectangles form a w-system generating fi X f!!/ (Theorem 10.3). • PROOF.
The w thus defined is called product measure; it is usually denoted I' X " · Note that the integrands in (18.1) and (18.2) may be infinite for certain x which is one reason for introducing functions with infinite values. and Note also that (18.3) in some cases requires the conventions (15.2).
y,
Fubini's Theorem
Integrals with respect to w are usually computed via the formulas
( 1 8 .4 )
j
XX Y
and
( 1 8 .5 )
f(x,y)'1T(d(x,y)) = J [ J t(x,y),(dy) ] �t(dx) X
Y
J f(x,y)'1T(d(x,y)) = J [/ f(x,y)�t(dx)]P(dy). Xx Y
Y
X
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
237
The left side here is a double integral, and the right sides are iterated integrals. The formulas hold very generally, as the following argument shows. Consider (18.4). The inner integral on the right is
/j (x , y ) v ( dy ) .
{ 18. 6)
Because of Theorem 18.1(ii), for f measurable fi X CfY the integrand here is measurable d.Y; the question is whether the integral exists, whether (18.6) is measurable fi as a function of x , and whether it integrates to the left side of (18. 4). First consider nonnegative f. If f = I£, everything follows from Theorem 18.2: (18.6) is l'[ y : (x, y ) e E] and (18.4) reduces to 'IT ( E ) = 'IT '( E ). Because of linearity (Theorem 15.1(iv)), if f is a nonnegative simple function, then (18.6) is a linear combination of functions measurable fi and hence is itself measurable f£; further application of linearity to the two sides of (18.4) shows that (18.4) again holds. The general nonnegative f is the monotone limit of nonnegative simple functions; applying the monotone convergence theorem to (18.6) and then to each side of (18.4) shows that again f has the properties required. Thus for nonnegative f, (18.6) is a well-defined function of x (the value oo is not excluded), measurable fi , whose integral satisfies (18.4). If one side of (18.4) is infinite, so is the other; if both are finite, they have the same finite value. Now suppose that f, not necessarily nonnegative, is integrable with respect to 'IT. Then the two sides of (18.4) are finite if f is replaced by 1/1 . Now make the further assumption that
hf ( x , y ) l v ( dy) < oo
(18.7 )
for all x. Then ( 18 .8 )
/j ( x , y )v (dy ) /j+ ( x , y )v ( dy) - J;- ( x , y)v( dy ) . =
functions on the right here are measurable fi and (since /+ , /- < 1/1 ) integrable with respect to p., and so the same is true of the function on the left. Integrating out the x and applying (18.4) to j+ and to f- gives (18.4) for I itself. The
238
INTEGRATION
The set A 0 of x satisfying (18.7) need not coincide with X, but p.( X A0) = 0 if f is integrable with respect to 'IT because the function in (18.7) integrates to f 1/1 d'IT (Theorem 15.2(iii)). Now (18.8) holds on A0, (18.6) is measurable !I on A0, and (18.4) again follows if the inner integral on the right is given some arbitrary constant value on X - A0• The same analysis applies to (18.5): Theorem 18.3.
functions
Under the hypotheses of Theorem 18.2, for nonnegative f the
j f( x , Y ) v ( dy ) , j f( x , y ) p. ( dx )
(1 8 .9)
y
X
are measurable !I and C!!l, respectively, and (18.4) and (18.5) hold. Iff (not necessarily nonnegative) is integrable with respect to 'IT, then the two functions (1 8.9) are finite and measurable on A0 and on B0, respectively, where p. ( X - A0) = 1'( Y - B0) = 0, and again (18.4) and (18.5) hold. It is understood here that the inner integrals on the right in (18.4) and (1 8.5) are set equal to 0 (say) outside A0 and B0. t This is Fubini 's theorem; the part concerning nonnegative f is sometimes called Tonelli ' s theorem. Application of the theorem usually follows a two-step procedure that parallels its proof. First, one of the iterated in tegrals is computed (or estimated above) with 1/1 in place of f. If the result is finite, then the double integral (integral with respect to 'IT ) of 1/1 must be finite, so that f is integrable with respect to 'IT; then the value of the double integral of f is found by computing one of the iterated integrals of f. If the iterated integral of 1/1 is infinite, f is not integrable 'IT. Let I = f �«Je - x2 dx. By Fubini' s theorem applied in the plane and by the polar-coordinate formula (see (17.15)),
Example
18. 2.
]2
=
x l +y l ) dx dy = jj -< e f I r>O R2
0 < 8 < 2w
e -'2rdrd0.
The double integral on the right can be evaluated as an iterated integral by t Since
two functions that are equal almost everywhere have the same integral, the theory of integration could be extended to functions that are only defined almost everywhere; then A 0 and B0 would disappear from Theorem 1 8. 3 .
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S THEOREM
239
another application of Fubini' s theorem, which leads to the famous formula
(18 .10) As
the integrand in this example is nonnegative, the question of integrability • does not arise.
E
ple
It is possible by means of Fubini' s theorem to identify the limit in (17 .4). First, xam
18.3.
l0 e - uxsin x dx = 1 +1 u [1 - e - u ( u sin 2
'
'
as follows by differentiation with respect to
t
+ cos t ) ]
.
t. Since
Fubini's theorem applies to the integration of e - u x sin x over (0, t)
11 sin x dx = 11sm. x [ l ooe - ux du ] dx 0 x
0
0
oo 1 = o
X (0, oo ) :
du 1+u 2
oo e - ul ( u sin t + cos t ) du. 1 o
1+u
2
next-to-last integral is 'IT/2 (see (17.10)), and a change of variable ut == s shows that the final integral goes to 0 as t --+ oo . Therefore,
The
( 18.11 ) lntearation by Parts
. 'IT 1 sm x 1 lim 0 dx = -2 . X 1 -.
oo
•
Let F and G be two nondecreasing, right-continuous functions on an . lllterval [a, b ], and let p. and " be the corresponding measures:
l&( x , y]
=
F ( y) - F(x ) ,
Jl( x, y) = G ( y) - G ( x ) ,
a < x < y < b.
INTEGRATION
dF(x) and dG(x) in place
In accordance with the convention (17.16), write of p.( dx) and P ( dx ).
If F and G have no common points of discontinuity in (a, b ],
Theorem 18.4.
then
( 1 8 .12)
j
(a, b)
G ( x ) dF( x )
=
F ( b ) G ( b ) - F( a ) G ( a )
In brief: f GdF = integration formula.
-f
F( x ) dG ( x ) .
(a, b]
&FG - f FdG. This is one version of the partial
Note first that replacing F(x) by F(x) - C leaves (18.12) un changed-it merely adds and subtracts CP(a, b) on the right. Hence (take C = F(a)) it is no restriction to assume that F(x) = p.(a, x] and no restriction to assume that G(x) = P(a, x] . If , = p. X , is product measure in the plane, then by Fubini's theorem, PROOF.
(18 .13)
'1T [(x, y) : a < y � x � b ] =
j
(a, b ]
v( a, x) �&(dx)
j
=
(a, b ]
G (x) dF( x )
and (18 .14)
'1T [(x. y) : a < x � y < b] =
j
�&( a, y) v( dy )
(a, b 1
=
j
F(y) dG ( y ) .
(a, b ]
The two sets on the left have as their union the square S = The diagonal of S has 'IT-measure
'1T [ ( x, y ) : a < x y < b] =
=
j
(a, b ]
(a, b] X (a, b ].
v { x } �& ( dx ) 0 =
because of the assumption that p. and , share no points of positive measure. Thus the left sides of (18.13) and (18.14) add to 'IT( S) = p.(a, b]P(a, b] =
F(b)G(b).
•
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
241
g
Suppose that " has a density with respect to Lebesgue measure and let Transform the right side of (18.12) by the formula c + (1 6.13) for integration with respect to a density; the result is
G(x ) =
J:g(t) dt.
(18 .15 )
J
(a, b 1
G( x ) dF(x )
= F(b)G(b) - F( a )G( a ) - j6F( x ) g ( x ) dx . a
A consideration of positive and negative parts shows that this holds for any g integrable over Suppose further that p. has a density I with respect to Lebesgue measure ' c + and let Then (18.15) further reduces to
(a, b ). F( x ) = J:l(t) dt. (18. 16 ) j6G( x ) f ( x ) dx = F(b )G(b) - F( a )G( a ) - j6F(x ) g (x) dx . a
a
Again, I can be any integrable function. This is the classical formula for integration by parts. Under the appropriate integrability conditions, can be replaced by an unbounded interval.
(a, b ]
Products of Higher Order
Suppose that ( X, !£, p. ), ( Y, �, "), and ( Z, .11' , "1) are three a-finite measure spaces. In the usual way, identify the products X X Y X Z and ( X X Y) X Z. Let !£X � X .11' be the a-field in X X Y X Z generated by the A X B X C with A, B, C in !£, �, .11' , respectively. For C in Z, let ric be the class of E e !I X � for which E X C e !£X � X .11' . Then ric is a a-field contain ing the measurable rectangles in X X Y, and so ric !£X �- Therefore, (� X '!?I ) X .11' c !£ X � X 11'. But the reverse relation is obvious, and so (.!'" X '!?/) X .11' !£ X � X .11' . Define the product p. X " X "1 on !£X � X .11' as ( p. X ") X "1. It gives to p.(A)11(B) 1J (C), and it is A X B X C the value (p. X 11) (A X B) · 1J (C) the only measure that does. The formulas (18.4) and (18.5) extend in the obvious way. Products of four or more components can clearly be treated in the same way. This leads in particular to another construction of Lebesgue measure k in R R 1 x · · · X R 1 (see Example 18.1) as the product A X · · · XA ( k
=
=
=
=
242
INTEGRATION
factors) on !Ji k = !1/ 1 x · · · X91 1 . Fubini' s theorem of course gives a practi cal way to calculate volumes: Let Vk be the volume of the sphere of radius 1 in R k ; by Theorem 12.2, a sphere in R k with radius r has volume r k Vk . Let A be the unit sphere in R k , let B = [(x 1 , x 2 ): xf + xi < 1], and let C(x 1 , x 2 ) = [(x 3 , , x k ): E�_ 3 xl < 1 - xf - xi]. By Fubini' s theorem, Example 18.4.
•
•
•
=
vk - 2
JJ (t - P2 ) < k - 2>f2 p dp dO
O < ll < 2w O 3. Since V1 = follows by induction that
2(2w ) ;- t v2 ; - 1 = --------. 5 --. .� . ( 2i - 1- -) ' 1 .3� for i =
2, it
); v2 ; = -. (2w 2 4 . . . -( 2i ) ---------
1, 2, . . . .
• PROBLEMS
18.1. 18.2.
18.3. 18.4.
Show by Theorem 1 8.1 that if A E !I and B E tty_
A X B is nonempty and lies in !I X
tty , then
Suppose that X == Y is uncountable and !I == tty consists of the countable and the co-countable sets. Show that the diagonal E == [( x y ) : x == y ] does not lie in !I X tty , even though [y: ( x, y) E E] E t!JI and [ x : ( x, y) E E ) E !I for all x and y. 10.6 1 8.1 f Let ( X, !I, p. ) == ( Y, tty , v ) be the completion of ( R1 , 911 , A ) . Show that ( X X Y, !I X tty , p. X v) is not complete. 2.8 f
,
The assumption of a-finiteness in Theorem 1 8.2 is essential: Let p. be Lebesgue measure on the line, let v be counting measure on the line, and take E == [( x, y): x == y]. Then (1 8.1) and (1 8.2) do not agree.
SECTION
18.5.
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
243
Suppose Y is the real line and ., is Lebesgue measure. Suppose that, for each x in X, l(x, y) has with respect to y a derivative l '(x, y). Formally,
J' [ /Xf ' ( x . � ) p.( dx ) ] d� ... jX[ J( x , y ) - f( x , a ) ] p. ( dx ) a
and hence fxl(x, Y)l'( dx) has derivative fxl '(x, y)p.( dx). Use this idea to prove Theorem 16.8(ii).
18.6.
Suppose that I is nonnegative on a a-finite measure space ( n, !F, p. ) . Show that
Prove that the set on the right is measurable. This gives the " area under the curve." Given the existence of I' X A on n X R1 , one can use the right side of this equation as an alternative definition of the integral. 18.7.
Reconsider Problem 12.13.
18.8.
Suppose that v [y: (x , y) e E) == v[y: (x, y) e F) for all x and show that (I' X ., )( E) == (I' X ., )( F). This is a general version of Cavalieri ' s principle.
18.9.
(a) Suppose that p. is a-finite and prove the corollary to Theorem 16.7 by
18.10.
Fubini's theorem in the product of ( 0 , !F, p.) and {1 , 2, . . . } with counting measure. (b) Relate the series in Problem 17.9 to Fubini's theorem.
(a) Let p. == ., be counting measure on X == Y == { 1 , 2, . . . }. If
l( x, y )
==
2 - 2-x if X == y, - 2 + 2 - x if X == y + 1 , 0 otherwise,
then the iterated integrals exist but are unequal. Why does this not con tradict Fubini's theorem? (b) Show that xyj(x 2 + y 2 ) 2 is not integrable over the square [(x, y): I x I , IY I < 1 ] even though the iterated integrals exist and are equal. 18.11.
For an integrable function I on (0, 1 ], consider a Riemann sum in the form R == E}- 1 l(x1 )( x1 - x1 _ 1 ), where 0 == x0 < · · · < xn == 1 . Extend I to (0, 2] by setting I< x) == I< x - 1) for 1 < x < 2, and define n
R ( t) == E J- 1
I( x1 + t ) ( x1 - x1 1 ) . _
For 0 < t < 1 , the points x1 + t reduced modulo 1 give a partition of (0, 1], and R ( t) is essentially the corresponding Riemann sum, differing from it by at most three terms. If maxJ ( x1 - x 1 ) is small, then R ( t) is for most � values of t a good approximation to Jol(x) dx, even though R = R (O) may _
244
INTEGRATION
be a poor one. Show in fact that 1 ( 1 8 . 1 7) R ( t ) - t( x) dx
fo
f
... 10
n
1
L
j == 1
n
< L
1
x
1
XJ - l
[ /( x1 + t) - /( x + t ) ) dx dt
1 1 10 1f( x1 + t} - f( x 1
x
j-1
dt
I
x,
+
t )l dt dx.
Given £ choose a continuous g such that /J ll(x) - g(x) l dx < £ 2/3 and then choose B so that l x - Yl < B implies that l g(x) - g(y ) l < £ 2/3. Show that max(x1 - x1 _ 1 ) < B implies that the first integral in (1 8.17) is at most £ 2 and hence that I R ( t) - /JI(x) dx l < £ on a set of t (in (0, 1]) of Lebesgue measure exceeding 1 - £ .
18. 12. Exhibit a case in which (1 8.12) fails because F and G have a common point of discontinuity.
18.13. Prove (1 8.16) for the case in which all the functions are continuous by differentiating with respect to the upper limit of integration.
18.14. Prove for distribution functions F that /�00 ( F(x + c) - F(x)) dx = c.
t. 18.16. Suppose that a number In is defined for each n > n0 and put F(x) = E n o s n s x ln · Deduce from (1 8.1 5) that 18.15. Prove for continuous distribution functions that /� 00 F( x) dF( x )
L
{ 1 8 . 1 8)
G ( n )/,
n0 � n � x
=
F( x ) G ( x ) -
=
{n oF( t ) g( t ) dt
f1g( t) dt, which will hold if G has continuous derivative g. First assume that the In are nonnegative. 18. 17. f Take n0 1 , In 1 , and G(x) 1/x and derive E n s x n - 1 log x + y + 0( 1 /x ) , where y 1 - /f (t - [t]) t - 2 dt is Euler's constant. if
G(y) - G(x) =
=
=
=
=
==
18.18. 5.17 1 8.16 f such that ( 1 8 .19)
18.19. If A k
=
Use (1 8.1 8) and (5.47) to prove that there exists a constant
L .! p p �x E�_ 1 a; and Bk
=
==
log log x
E�_ 1 b; , then
n
L a k Bk
k-1
+c+
c
o( ) · 1 10g x
n
==
An Bn + L A k - l bk . k-2
This is Abel's partial summation formula. Derive it by arguing proof of Theorem 1 8.4.
as
in the
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S TH EOREM
18. 20. (a) Let In == j012 (sin x)" dx . Show by partial integration that In
1)( 1n _ 2 - In ) and hence that In == ( n - 1) n - 1 1n _ 2 •
245 ==
(n -
(b) Show by induction that 1 r;, + 1 From
(17.17) deduce that 2 w - == 8
...
( - 1 ) " ( 2n + 1 )
{ -} ).
1 1 1 1 + -2 + -2 + -2 + · · · 3
5
7
1 22
1 32
1 42
and hence that
2 w 6
== 1 + - + - + - + · · ·
(c) Show by induction that
2 n - 1 2n - 3 . . . 3 1 'IT 2 n 2n - 2 . . . 4 2 == n 1 12 n == 2n 2n - 2 1 2 + 4 22' 2n + 1 2n - 1 S3· From /2 n I > 12 n > 12 , + 1 deduce that 'IT 22 · 4 2 · · · ( 2n ) 2 2n w < < 2 2n + 1 3 2 . 5 2 . . . ( 2n - 1 ) 2( 2n + 1 ) 2 ' and from this derive Wallis ' s formula, ,
18.21.
4466 - 1 3 3 -5 5 7
'IT 22 == - -
{ 18.20)
2
•
•
•
Stirling ' s formula. The area under the curve y == log x between the abscis sas 1 and n is An == n log n - n + 1 . The area under the corresponding inscribed polygon is Bn == log n ! - ! log n. y=
log
log
X
2
0
1
2
3
4
INTEGRATION
(a) Let Sk be the sliver-shaped region bounded by curve and polygon
between the abscissas k and k + 1, and let Tk be this region translated so that its right-hand vertex is at (2, log 2). Show that the Tk do not overlap and that T, u T, + 1 u · · · is contained in a triangle of area t log(1 + 1/n ). Conclude that log n ! =
{ 18.21 ) where
(b)
( n �) +
log n
- n + c + an ,
c is 1 minus the area of T1 u T2 u · · · and 1 1 1 0 < an < 2 log 1 + n < n .
)
(
By Wallis's formula (18.20) show that log
{f
Substitute
=
�
[
2 n log 2 +
2logn! - log(2n)! - � log( 2n + 1 ) ] .
(1 8.21) and show that c == log J2w . This gives Stirling's formula
18.22. Euler's gamma function is defined for positive t by f( t) == f0x ' - 1e - x dx. (a) Prove that r(t) == f0x' - 1 (log x) ke- x dx.
(b) Show by partial integration that f( t + 1) == tf( t) and hence that f( n + 1) == n ! for integral n. (c) From (1 8.10) deduce f( ! ) == {; .
(d)
Show that the unit sphere in
Rk has volume (see Example 18.4)
( 18.22) 18.13. By partial integration prove that /0 (sin x )/x) 2 dx == wj2 and /� 00 (1 cos x ) x- 2 dx ==
18.24. 17.15 f and
w. Identifying
0 s cp
0. Then there exists in G a disjoint sequence S1 , S2 , . . . of closed spheres such that A k (G - U n Sn ) = 0 and 0 < diam Sn < PROOF. Let S 1 be any closed sphere in G satisfying 0 < diamS1 < Suppose that S 1 , . . . , Sn have already been constructed. Let 9:_ be the class of closed spheres in G that have diameter less than and do not meet S1 u · · · u Sn . Let sn be the supremum of the radii of the spheres in 9:_, and let Sn + l be an element of 9:_ whose radius rn + 1 exceeds s,/2. Since the Sn are disjoint subsets of G, En Vk r,:C = En 'A k (Sn ) < 'A k (G) < oo , and hence £
Lemma 1.
£.
£.
£
( 19.6 ) Put B = G - U n Sn . It is to be shown that 'A k (B) = 0. Assume the opposite. Let s; be the sphere having the same center as Sn and having radius 4rn . Then En 'A k (S;) = 4 k L,n 'A k (Sn ) < 4 k 'A k (G). Choose N so that En > N 'A k (S;) < 'A k (B). Then there is an x 0 such that (19.7)
x 0 e B,
x o � U s;. n
As the Sn are closed, there is about small enough that
>N
x 0 a closed sphere S with radius r
( 1 9 .8 ) If S does not meet S1 U · · · U Sn , then r < sn < 2rn + 1 ; this cannot hold for all n because of (19.6). Thus S meets some Sn ; let Sn o be the first of these:
( 19.9 )
S n ( S1 U · · · U Sn0 - 1 ) = 0 •
Because of (19.8), n o > N, and by (19 .7 ), Xo � s;0. But since Xo � s;0 and S n Sn o =I= 0, S has radius r >. 3rno > .l.2 sn0 _ 1 . By (19.9), r < sn 0 _ 1 • Thus • Sn o - 1 � r > .l.2 Sno - 1 ' a contrad"lCUOn. In other words, the volume of a set cannot exceed that of a sphere with the same diameter.
SECTION
19.
HAUSDORFF MEASURE
e
251
e
Represent the general point of R k as (t, y), where t R 1 , y R k - t. Let A y = [t: (t, y) A]. Since A is bounded, each A Y is bounded and A Y = 0 for y outside some bounded set. By the results of Section 18, k .Ay e 91� and A(A ) is measurable £1t - t . Let I be the open (perhaps Y y empty) mterval ( - A (A y )/2, + A( A y )/2), and set S1 A = U y [(t, y): t y ].k If 0 S fn ( Y ) t A(A y )/2 and In is simple, then [(t, y): l t l < /n ( Y )] is in £Il and increases to S1 A ; thus S1 A £Il k . By Fubini ' s theorem, A k ( A ) = A k (S1 A ). The passage from A to S1 A is Steiner symmetrization with respect R k : x 1 = 0]. to the hyperplane H1 = [x
e
PROOF.
y
e
e/
e
A
Let JY be the closed interval from inf A Y to supA Y ; it has length A (Jy ) > A ( A y ). Let m y be the midpoint of JY . If s IY and s' IY ' ' then I s - s' l < � A ( A y ) + �A( A y ,) < l m y - m y ' l + !A( Jy ) + �A( Jy ,), and this last sum is 1 t - t 'l for appropriate endpoints t of JY and t' of Jy ' · Thus for any points (s, y) and (s', y') of S1 A , there are points (t, y) and (t', y') that are at least as far apart and are limits of points in A . Hence diam S1 A < diam A. For each i there is a Steiner symmetrization S; with respect to the hyperplane H; = [x: X; = 0]; A k (S;A) = A k ( A ) and diam S;A < diam A . It is obvious by the construction that S;A is symmetric with respect to H;: if x ' is x with its ith coordinate reversed in sign, then x and x' both lie in S;A or neither does. If A is symmetric with respect to H; (i > 1) and y' is y with its (i - 1)st coordinate reversed in sign, then, in the notation above, �, = AY ' ' and so S1 A is symmetric with respect to H; . More generally, if A lS symmetric with respect to H; and j ¢ i, then S A is also symmetric with j respect to H;. Pass from A to SA = Sk · · · S1 A . This is central symmetrization. Now A k ( SA) = A k ( A ), diam SA < diam A, and SA is symmetric with respect to
e
e
252
INTEGRATION
each H; , so that x e SA implies - x e SA . If l x l > !diam A, then l x ( - x ) l > diam A > diam SA , so that SA cannot contain x and - x . Thus SA is contained in the closed sphere with center 0 and diameter diam A . • 19.1. It is only necessary to show that K = 1 in (1 9.4). By Lemma 1 the interior of the unit cube C in R k contains disjoint spheres such that diam Sn < £ and h k , ( (C - Un Sn ) < h k (C - Un Sn ) = S1 , S2 , K 'A k ( C - U n Sn ) = 0. By (19.3), h k , ( (C) = h k , ( (U n Sn ) < ck En (diam Sn ) k = ck E nck" 1 'A k ( Sn ) < A k (C) = 1. Hence h k (C) < 1. If C C U n Bn , then by Lemma 2, 1 < En 'A k ( Bn ) < Enck (diam Bn ) k , and • hence h k ( C ) � 1. PROOF OF THEOREM
•
•
•
Change of Variable
If x 1 , , x m are points in R k , let P(x 1 , , x m ) be the parallelepiped [E;"_ 1 a;x;: 0 < a 1 , , a m < 1]. Suppose that D is a k X m matrix, and let x 1 , , x m be its columns, viewed as points of R k ; define I D I = h m ( P(x 1 , , x m )). If C is the unit cube in R m and D is viewed as a linear map from R m to R k , then P(x 1 , , x m ) = DC, so that • • .
•
•
•
• . •
• . .
• • •
• • •
(19 .10) In the case
m =
k it follows by Theorem 12.2 that
I D I = l det D l ,
(19.11)
m
= k.
Theorem 12.2 has an extension: Theorem 19.2.
and
Suppose that D is univa/ent. t If A e £1/ m , then DA e £Il k
(19 .12) Since D is univalent, D Un A n = Un DA n and D( R m - A ) = DR m - DA ; since DR m is closed, it is in £Il k . Thus [A : DA e £Il k ] is a a-field. Since D is continuous, , DA is compact if A is. By Theorem 13.1(i), D is measurable £1/ mj£1/ k . Each side of (19.12) is a measure as A varies over £1/ m . They agree by definition if A is the unit cube in R m , and so by (19.5) they agree if A is any • cube. By the usual argument they agree for all A in !A m . PROOF.
t That is, one-to-one: Du
==
0 implies that u
=
0. In this case
m
< k.
19. HAUSDORFF MEASURE 253 If D is not univalent, A E £1/ m need not imply that DA E £Il k , but (19 .12) still holds because each side vanishes. The Jacobian formula (17.12) for change of variable also generalizes. Suppose that V is an open set in R m and T is a continuous, one-to-one mapping of V onto the set TV in R k , where m < k. Let t;(x) be the i th coordinate of Tx and suppose that t iJ (x) at;(x);axj is continuous on V. SECTION
=
Pu t
D
(19.13 )
X
=
.
............
•
'
IDx l as defined by (19.10) plays the role of the modulus of the Jacobian (1 7.11) and by (19.11) coincides with it in case m k. =
Suppose that T is a continuous, one-to-one map of an open set V in R m into R k . If T has continuous partial derivatives and A c V, then
Theorem 19.3.
(19.14 ) and (19.15 ) By Theorem by
A m.
19.1, h", on the left in (19.14) and (19.15) can be replaced
This generalizes Theorem 17 .2. Several explanations are required. Since V is open, it is a countable union of compact sets Kn . If A is closed, T( A n Kn ) is compact because T is continuous, and so T( A n V) = U n T( A n Kn ) E £Il k . In particular, TV E £Il k , and it follows by the usual argument that TA e £Il k for Borel subsets A of V. During the course of the proof of the theorem it will be seen that IDx l is continuous in x. Thus the formulas (19.14) and (19.15) make sense if A and I are measurable. As usual, (19.15) holds if f is nonnegative and also if the two sides are finite when f is replaced by 1 /1 . And (19.15) follows from (1 9.14) by the standard argument starting with indicators. Hence only (1 9.14) need be proved. To see why (19.14) ought to hold, imagine that A is a rectangle split into many small rectangles A n ; choose x n in A n and approximate the integral in (1 9.1 4) by E I Dx l h m (A n ) · For y in A n , Ty has the linear approximation "
254
INTEGRATION
Tx n + Dx ,( Y - x n ), and so TA n is approximated by Tx n + Dx,(A n - x n ), an m-dimensional parallelepiped for which h m is I Dx, l h m (A n - x n ) = I Dx , l h m (A n ) by (19.12). Hence EI Dx, l h m (A n ) Eh m ( TA n ) = h m ( TA) . The first step in a proof based on these ideas is to split the region of integration into small sets (not necessarily rectangles) on each of which the Dx are all approximately equal to a linear transformation that in tum gives a good local approximation to T. Let V0 be the set of x in V for which Dx is univalent. ==
If (J > 1, then there exists a countable mdecomposition of V0 into Borel sets B1 such that, for some linear maps M1 R --+ R k , Lemma 3.
:
and Choose £ and 80 in such a way that fJ - 1 + £ < 80 1 < 1 < 80 < fJ - £. Let Jl be the set of k X m matrices with rational coefficients. For M in Jl and p a positive integer let E (M, p) be the set of x in V such that
PROOF.
(19.18) and (19.19)
y E V. Imposing these conditions for countable, dense sets of v ' s and y ' s shows that E( M, p) is a Borel set. The essential step is to show that the E (M, p) cover V0• If Dx is univalent, then 80 = inf[ I Dxv l : I vi = 1] is positive. Take 8 so that 8 < ( 80 - 1)80 and 8 � (1 - 80 1 )80 , andm then choose in Jt an M such that I Mv - Dxv l � 8 1 v l for all v E R . Then I Mv l < I Dxf'l + I Mv - Dxf1 1 < I Dxv l + 8 1 v l � 801 Dxf'l , and (19.18) follows from this and a similar in equality going the other way. As for (19.19), it follows from (19.18) that M is univalent, and so there is a positive ., such that 1J I VI � I Mv l for all v. Since T is differentiable at x, there exists a p such that IY - x l < p - 1 implies that I Ty - Tx - Dx ( Y x) l � £ 1J IY - x l � £ 1 M(y - x) l . This and (19.18) imply that I Ty - Tx l < I Dx (x - Y ) l + I Tx - Ty - Dx (x - y) l � fJ0IM(x - y ) l + £ 1 M(x - y ) l
SECTION
19.
HAUSDORFF MEASURE
255
IJI M( x - y ) l . This inequality and a similar one in the other direction give (19.19). Thus the E( M, p) cover V0 • Now E( M, p) is a countable union of sets B such that diam B < p - 1 . These sets B (for all M in Jf and all p ), made disjoint in the usual way, together with the corresponding maps M, satisfy �
the requirements of the lemma.
•
Also needed is an extension of the fact that isometries preserve Hausdorff measure.
Suppose that A c R k , cp: A --+ R ;, andm 1/J: A --+ Ri. If lcpx fi>Y I < 8 1 1/J x - 1/IY I for X, y E A, then h m (cpA) < (J h m (l/IA). PROOF. Given £ and 8, cover 1/JA by sets Bn such that 8 diam Bn < £ and cm En1 (diam Bn ) m < h m ( 1/IA) + B. The sets cpl/J - 1Bn cover cpA and m diamcpl/J - Bn < IJ diam Bn < £, whence follows h m ,((cpA) < (J (h m (l/IA) + Lemma 4.
3).
•
19.3:
Suppose first that A is contained in the set V0 where Dx is univalent. Given 8 > 1, choose sets B1 as in Lemma 3 and put A1 = A n B1 • If C is the unit cube in R m , it follows by (19.16) and Lemma 4 that 9 - mh m (M1C) < h m (DxC) < 8 mh m (M1C) for X E A1; by (19.10), 9 - m i M1I S I Dx l < 8 mmi M1 1 for x E A1 t By (19.17) and Lemma 4, I h m (A1) 9 - mh m ( M1A1) < h m (TA I ) < IJ hmm (MIA I ). Now h m ( MIA I ) = I M1 m by Theorem 19.2, and so 8 - 2 h m (TA1) < fA,I Dx l h m ( dx ) < 8 2 h m ( TA1) . Adding over A 1 and letting (J --+ 1 gives (19.14). SECOND CASE. Suppose that A is contained in the set V - V0 where Dx is not univalent. It is to be shown that each side of (19.14) vanishes. Since the entries of (19.13) are continuous in x and V is a countable union of compact sets, it is no restriction to assume that there exists a constant K such that I Dxv l < K l v l for x e A, v e R m . It is also no restriction to assume that h m (A) = A m (A) is finite. Suppose that 0 < £ < K. Define T' : R m --+ R k X R m by T'x = ( Tx, £x). For the corresponding matrix D; of derivatives, D;v = ( Dxv, £ V ), and so D; is univalent. Fixing x in A for the moment, choose in R m an orthonormal set v 1 , • • • , vm such that Dxv 1 = 0, which is possible because x e V - V0 • Then I D;v 1 1 = 1 (0, £V 1 ) 1 = £ and ID.:V;I < I Dxv;l + £ I V; I < 2K. Since h m ( P( v l , . . . ' vm )) = 1, it follows by Theorem 19.2 that I D; I = h m ( D;P( v 1 , • • • , vm )). But the parallelepiped D;P( v 1 , • • • , vm ) of dimension PROOF OF THEOREM
FIRST CASE.
.
tsince the entries of Dx are continuous in x, if y is close enough to x, then fJ - 1 1 Dxv I < I Dyv I S fJ I Dx v I for v e R m , so that by the same argument, 8 - m I Dx I < I Dy I � (J m I Dx I ; hence I Dx I is continuous in x .
256
INTEGRATION
m in R k + m has one edge of length £ and m - m1 edges of length at most 2K and hence is isometric with a subset of [x E R : l x 1 1 < £, l x; l < 2mK, i = (19.14) for the univalent case 2, . . . , m]; therefore, I D� I < £(2mK) m - •. Now already treated gives h m (TA) S fA £(2mK ) m - lh m ( dx ). If w: R k X R m R k is defined by w(z, v) = z, then TA = wT'A, and Lemma 4 gives h m (TA) < h m (TA) = 0. h m (T'A) < £(2mK ) m - 1h m (A). Hence m If Dx is not univalent, Dx R is contained in an (m - I)-dimensional subspace of R k and hence is isometric to a set in R m of Lebesgue measure 0. Thus I Dx l = 0 for x e V - V0 , and both sides of (19.14) do vanish. • -+
Calculations
The h m on the left in (19.14) and (19.15) can be replaced by A m . To evaluate the integrals, however, still requires calculating I D I as defined by (19.10). Here only the case m = k - 1 will be considered. t If D is a k X ( k - 1) matrix, let z be the point in R k whose ith component is ( - 1) ; + 1 times the determinant of D with its i th row removed. Then IDI = 1 z 1 . (19.20 ) To see this, note first that for each x in R k the inner product x · z is the determinant of D augmented by adding x as its first column (expand by cofactors). Hence z is orthogonal to the columns x 1 , , x k _ 1 of D. If z =I= 0, let z 0 = z/ l z l ; otherwise, let z 0 be any unit vector orthogonal to x 1 , , x k _ 1 • In either case a rotation and an application of Fubini 's theorem shows that I D I = h k _ 1 (P(x 1 , , x k _ 1 )) = A k (P(z0, x 1 , , x k _ 1 )); this last quantity is by Theorem 12.2 the absolute value of the determinant of the matrix with columns z 0 , x 1 , , x k _ 1 • But, as above, this is z 0 z = 1 z I · •
•
•
•
•
•
•
•
•
•
•
•
•
.
•
•
V be the1 unit sphere in R k - 1 . If t;(x) = X ; for i [1 - E�.:-lxf] 12 , then TV is the top half of the surface of the unit sphere in R k . The determinants of the ( k - 1) X ( k - 1) submatrices of Dx are easily calculated, and I Dx l comes out to t; 1 (x). If A k is the surface area of the unit sphere in R k , then by (19.14), A k = 2h k - 1 (TV) = 2 f vti 1 (x) dx. Integrating out one of the variables shows that A k = 2wVk _ 2 , where Vk _ 2 is the volume of the unit sphere in R k - 2 • The calculation in Example 18.4 now gives A 2 = 2w and i- I ' 2 2w (2 ( ) ) '17' .= A 2; - 1 = . . . . . 2 A ' 2 2 4 6 (2 i - 2) 1 3 5 ( . - 3) Example 19.3. < k and t k (x) =
Let
'
t For the general case, see SPIVAK.
I
•
•
•
•
•
SECTION
19.
257
HAUSDORFF MEASURE
for i = 2, 3, . . . . Note that A k = k Vk for k = 1 , 2, . . . . By (19.5) the sphere in R k with radius r has surface area r k - tA k . • PROBLEMS 19.1. 19.2.
Describe S2 S1 A for the general triangle A in the plane. Show that in (19.1) finite coverings by open sets suffice if A is compact: for positive £ and B there exist finitely many m open sets Bn such that A c
Un Bn , diam Bn < £ , and cm En (diam Bn ) 19.3.
0, the density is if X < 0, if X > 0.
(20.10) The corresponding distribution function
F( x )
(20.1 1 )
=
{�
_
e - ax
if X S 0, if X > 0
was studied in Section 14. For the normal distribution with parameters (20.12)
f( x )
=
1 {ii exp 2w o
[ - ( x - m2 ) 2 ] , 2o
m and o,
o > 0,
- oo < x < oo ;
a change of variable together with (1 8.10) shows that I does integrate to 1 . For the standard normal distribution, m = 0 and o = 1 . For the uniform distribution over an interval (a, b ], (20.13)
l(x )
=
1
b-a
0
if
a < x s b,
otherwise.
The distribution function F is useful if it has a simple expression, as in (20.1 1). It is ordinarily simpler to describe p. by means of the density l(x) or the discrete probabilities ,., { x r } . If F comes from a density, it is continuous. In the discrete case, F increases in jumps; the example (20.8), in which the points of discontinuity ' The general question of the relation between differentiation and integration is taken up in Section 31 .
264
RANDOM VARIABLES AND EXPECTED VALUES
are dense, shows that it may nonetheless be very irregular. There exist distributions that are not discrete but are not continuous either. An example is p. (A) = �p.1( A) �p. 2 (A) for p. 1 discrete and p. 2 coming from a density; such mixed cases arise, but they are few. Section 31 has examples of a more interesting kind, namely functions F that are continuous but do not come from any density. These are the functions singular in the sense of Lebesgue; the Q( x ) describing bold play in gambling (see (7 .33)) turns out to be one of them. If has distribution p. and is a real function of a real variable,
+
X
g
(20.14)
g( X) p. g-1
in the notation (1 3.7). is Thus the distribution of In the case where there is a density, f and F are related by F(x )
(20.1 5 )
=
{'
- oo
f( t ) dt.
Hence f at its continuity points must be the derivative of F. As noted above, if F has a continuous derivative, this derivative can serve as the density f. Suppose that f is continuous and is increasing, and let T= The distribution function of is
g-1•
If
g(X)
P[ g ( X ) < x ) P[ X < T(x )) =
g
=
F( T( x )) .
T is differentiable, this differentiates to
�d P (g ( X ) < x )
( 20 .16 )
g(X)
=
/ ( T( x )) T'( x ) ,
which is thus the density for (as follows also by (17.8)). has normal density (20.12) and > 0, (20.16) shows that b If b and has the normal density with parameters Calculating (20.16) is generally more complicated than calculating the distribution function of X) from first principles and then differentiating, which works even if is many-to-one:
X
g(
Example
20.1.
If
a am +
ao .
X has the standard normal distribution, then
aX + g
SECTION
for X > 0. Hence
20.
X2
RANDOM VARIABLES AND DISTRIBUTIONS
265
has density
/ (x ) =
0
1 .� x - 1/2e x/2 -
v � 'IT
if X < 0, •
if X > 0 .
Multidimensional Distributions
For a k-dimensional random vector X = ( X1 , , Xk ), the distribution p. (a probability measure on gJ k ) and the distribution function F (a real func tion on R k ) are defined by . • •
(20.17)
{
= P ( ( X1 , , Xk ) E A ] , A e fJit k , F( x 1 , , x k ) = P( X1 < x 1 , , Xk < x k ] = p.(Sx ) ,
P. ( A )
• • •
• • •
• • •
where Sx = [ y: Y; < x ; , i = 1, . . . , k] consists of the points "southwest" of x. Often p. and F are called the joint distribution and joint distribution function of xl, . . . ' xk . Now F is nondecreasing in each variable, and &AF > 0 for bounded rectangles A (see (12.12)) As h decreases to 0, the set
s h=
+ h' i = 1' . . . ' k] decreases to Sx , and therefore (Theorem 2.1(ii)J F is continuous from above in the sense that lim h l 0 F(x 1 + h, . . . , x k + h) = F(x 1 , , x k ). Further, F( x , x k ) --+ 0 if x; --+ - oo for some i (the other coordinates held fixed), and F( x 1 , , x k ) --+ 1 if X; --+ oo for each i. For any F with these properties there is by Theorem 12.5 a unique probability measure p. on gJ k such that p. ( A ) = &AF for bounded rectangles A, and p. ( Sx ) = F(x) for Xt
1,
[ y : Y; < X;
• • •
• • •
all X.
• • •
As h decreases to 0, Sx , - h increases to the interior i = 1, . . . , k ] of Sx , and so
sxo = [ y:
Y; < X ;,
(20.18) Since F is nondecreasing in each variable, it is continuous at x if and only if it is continuous from below there in the sense that this last limit coincides with F(x). Thus F is continuous at x if and only if F(x) = p.(Sx ) = p. ( Sx0 ), which holds if and only if the boundary asx = sx - sxo (the y-set where Y; s; X; for all i and Y; = X; for some i) satisfies p.( asx ) = 0. If k > 1 , F can have discontinuity points even if p. has no point masses: if p. corre-
RANDOM VARIABLES AND EXPECTED VALUES
sponds to a uniform distribution of mass over the segment B = [(x, 0): 0 < x < 1] in the plane (p.( B) = A[x : 0 < x < 1, (x, O) e B]) then F is discontinuous at each point of B. This also shows that F can be discontinu ous at uncountably many points. On the other hand, for fixed x the boundaries asx , h are disjoint for different values of h, and so (Theorem 1 0.2(iv)) only countably many of them can have positive p.-measure. Thus x is the limit of points (x 1 + h, . . . , x k + h) at which F is continuous: the continuity points of F are dense. There is always a random vector having a given distribution and distribu tion function : Take (O, S&", P) = ( R k , f1t k , p.) and X(w) = w. This is the obvious extension of the construction in the first proof of Theorem 14.1. The distribution may as for the line be discrete in the sense of having countable support. It may have density I with respect to k-dimensional Lebesgue measure: p.(A) = fA I(x) dx. As in the case k = 1, the distribu tion p. is more fundamental than the distribution function F, and usually p. is described not by F but by a density or by discrete probabilities. If X is a k-dimensional random vector and g: R k .-. R ,. is measurable, then g( X) is an i-dimensional random vector; if the distribution of X is p., the distribution of g( X) is p.g - 1 , just as in the case k = 1 see (20.14). If gj : R k .-. R 1 is defined1 by gj (x 1 , , x k ) = xj, then gj ( X) is Xj , and its distribution P.j = p.gj is given by p.j (A) = p.[(x 1 , , x k ): xj e A] = P[ � e A] for A e 91 1 • The P.j are the marginal distributions of p.. If p. has a density I in R k , then P.j has over the line the density ,
-
•
•
•
•
•
•
Jj(x ) =
(20 .1 9 )
since by Fubini ' s theorem the right side integrated over A comes to p.[(x 1 , , x k ): xj e A]. Now suppose that g: U --+ V is one-to-one, where U and V are open subsets of R k . Suppose that T = g - 1 is continuously differentiable, and let J(x) be its Jacobian. If X has a density that vanishes outside U, then P[g( X) E A] = P[ X E TA] = frA I( y ) dy, and so by (17.1 3), •
•
(20.20 ) for A c
•
P ( g ( X ) e A ) = j f( Tx )IJ( x )l dx A
V. Thus the density for g( X) is I(Tx)I J(x) l on V and 0 elsewhere. t
tAlthough (20.20) is indispensible in many investigations, it is used in this book only for linear T and in Example 20.2; and here different approaches are possible-see ( 1 7. 14) and Proble m 1 8.24.
SECTION 11 EX111ple
20. 2
20.
RANDOM VARIABLES AND DISTRIBUTIONS
267
Suppose that ( X1 , X2 ) has density
and let g be the transformation to polar coordinates. Then U = R 2 , and V and T are as in Example 17 .6. If R and 8 are the polar coordinates of 1 ( X1, X2 ), then ( R , 8 ) = g ( X1 , X2 ) has density (2w) - re - r 2 /2 in V. By (20 . 1 9), R has density re - r 2 12 on (0, oo ) and 8 is uniformly distributed • over (0 , 2 w ) .
Independence
Random variables xl , . . . ' xk are defined to be independent if the a-fields , o ( Xk ) t�ey generate are independent in the sense of Section 4. a( X1 ), This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since o ( X;) consists of the sets [ X; E H ] for H E 91 1 , xl , . . . ' xk are independent if and only if • • •
for all linear Borel sets H1 , , Hk . The definition (4.11) of independence requires that (20.21) hold also if some of the events [ X; e H; ] are sup pressed on each side, but this only means taking H; = R 1 Suppose that • • •
•
for all real x 1 , , xk; it then also holds if some of th� events [ X; < X;] are suppressed on each side (let X; --+ oo ). Since the intervals ( oo , x] form a tr-system generating !1/ 1 , the sets [ X; < x] form a w-system generating a ( X;). Therefore, by Theorem 4.2, (20.22) implies that X1 , Xk are independent. If, for example, the X; are integer-valued, it is enough that P [ X1 = n 1 , , Xk = n k ] = P [ X1 = n 1 ] P[ Xk = n k ] for integral " 1 ' . . , n k (see (5.6)). Let ( X1 , , Xk ) have distribution p. and distribution function F, and let the X; have distributions P. ; and distribution functions F; (the marginals). By (20.21), X1 , , Xk are independent if and only if p. is product measure in the sense of Section 18: • • •
-
, • • •
•
•
• • •
•
.
• • •
• • •
(20. 23 )
268
RANDOM VARIABLES AND EXPECTED VALUES
By (20.22), X1,
• • •
, Xk are independent if and only if
(20 .24) Suppose that each P. ; has density /;; by Fubini' s theorem, f1 (y 1 ) fk ( Yk ) X ( - oo , x k ] is just F1 (x 1 ) Fk (x k ), so integrated over ( oo, x 1 ] X that p. has density • • •
-
· · ·
· · ·
(20 .25 ) in the case of independence. If f§1 , , f§k are independent a-fields and X; is measurable f§;, i = 1 , . . . ' k, then certainly x1 , . . . ' xk are independent. If X; is a d;-dimensional random vector, i = 1, . . . , k, then X1 , , Xk are by definition independent if the a-fields a( X1 ), , a( Xk ) are indepen dent. The theory is just as for random variables: x1 , . . . ' xk are indepen dent if and only if (20.21) holds for H1 e gJd•, , Hk e gJd". Now ( X1 , , Xk ) can be regarded as a random vector of dimension d = E�_ 1 d;; X Rdk and P. ; is the distribution of if p. is its distribution in Rd = Rd1 X X; in Rd•, then, just as before, X1 , , Xk are independent if and only if X p. k . In none of this need the d; components of a single X; p. = p. 1 X be themselves independent random variables. An infinite collection of random variables or random vectors is by definition independent if each finite subcollection is. The argument follow ing (5.7) extends from collections of simple random variables to collections of random vectors: • • •
• • •
• • •
• • •
• • •
· · ·
• • •
·
·
·
Theorem 20.2.
(20 .26)
Suppose that X11 , X1 2 , X21 , x22 , . . . · · ·
•
•
•
•
•
•
•
•
•
is an independent collection of random vectors. If .fF; is the a-field generated by the ith row, then �1 , �2 , . . . are independent. Let �; consist of the finite intersections of sets of the form [ X;1 e H] with H a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The a-fields .fF; = a( �;), i = 1, . . . , n, are independent • for each n, and the result follows. PROOF.
Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them.
SECTION
20.
RANDOM VARIABLES AND DISTR IBUTIONS
269
Suppose that X and Y are independent random vectors with distribu tions p. and Jl in Ri and R k . Then ( X, Y ) has distribution p. X Jl in R i X R k = R i +k . Let x range over Ri and over R k . By Fubini's theorem,
y
(20 .2 7) ( p. X ) ( B ) = J1
Replace B by reduces to
(20 .2 8 )
( ll
X
f. [ y : ( x, y ) Rl
J1
E
B ) p. ( dx ) ,
(A X R k ) n B, where A E gJi and B E gJ J +k . Then (20.27) v)(( A X R k ) n B ) = j v[ y : ( x , y ) E B ) fl (dx) , A
Bx = [ y: ( x, y) E B] is the x-section of B, so that Bx E gJ k (Theorem 1 8 .1), then P[(x, Y) E B] = P[w: (x, Y( w )) E B] = P[w: Y(w) E Bx ] = • · Expressing the formulas in terms of the random vectors themselves If
gives this result: Theorem 20.3.
p,
and
Jl
(20 . 29 )
If X and Y are independent random vectors with distributions
in Ri and R k, then P [ ( X, Y )
E B) =
f. .P [( x , Y ) E B ) p. ( dx ) , RJ
B E gJ J +k
and
(20.30)
P [ X E A, ( X, Y ) E B ) =
£P [( x , Y ) E B ) fl ( dx ) , A E gJi, B E gJ J +k .
Suppose that X and Y are independent exponentially distributed random variables. By (20.29), P [ YIX > z] = f0(X)P[Yjx > z)ae- ax dx = j0(X)e - axzae - ax dx = (1 + z) - 1 • Thus YjX has density (1 + z ) - 2 for z > 0. Since P[ X > z 1 , Y/X > z 2 ] = fz�P[ Yjx > z 2 ]a e - ax dx by • (20.30), the joint distribution of X and Y1 X can be calculated as well. Example 20.3.
Formulas (20.29) and (20.30) are constantly applied as in this example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usually silent.
270
RANDOM VARIABLES AND EXPECTED VALUES
Here is a more complicated argument of the same sort. Let X1 , , Xn be independent random variables, each uniformly distributed over (0, I ] o Let Y" be the k th smallest among the X, , so that 0 < Y1 < · · · < Y,. < 1. The X, divide [0, 1 ] into n + 1 subintervals of lengths Y1 , Y2 - Y1 , , Yn - Yn - 1 , I - Y, ; let M be the maximum of these lengths. Define 1/1n ( I, a) = P [ M < a ]. The problem is to show that EXIIIle IIp 20.4. o
o
•
• • •
{ 20.31)
1/ln ( t, a )
( n 1 )( ) T:o ( - l) k z 1 - k �
n+l ...
"
+ ,
where x + ( x + l x l )/2 denotes positive part. Separate consideration of the possibilities 0 < a < 1/2, 1/2 < a < 1, and 1 < disposes of the case n 1. Suppose it is shown that the probability 1/Jn ( l, a) satisfies the recursion ==
a
==
( 20.32)
1/Jn ( l, a )
==
n a ( l - x ) - 1 dx nl l/Jn _ 1 ( 1 - x, a ) 1
1
0
°
Now (as follows by an integration together with Pascal' s identity for binomial coefficients) the right side of (20.31) satisfies this same recursion, and so it will follow by induction that (20.31) holds for all n. In intuitive form, the argument for (20.32) is this: If [ M < a] is to hold, the smallest of the X; must have some value x in [0, a ]. If X1 is the smallest of the X, , then x2 ' . . . ' xn must all lie in [x, 1] and divide it into subintervals of length at most a; the probability of nthis is (1 - xjl) n - 11/ln - 1 ( 1 - x, a ) because x2 , . . . ' xn have probability (1 - xjl) - l of all lying in [x, 1], and if they do, they are independent and uniformly distributed there. Now (20.32) results from integrating with respect to the density for X1 and multiplying by n to account for the fact that any of x1 ' . . . ' xn may be the smallest. To make this argument rigorous, apply (20.30) for j 1 and k n - 1. Let A be the interval [0, a], and let B consist of the points ( x 1 , . 0 0 , xn ) for which 0 < X; < I, x 1 is the minimum of x 1 , , xn , and x 2 , 0 0 . , xn divide [ x1 , 1 ] into subintervals of length at most a. Then P[ X1 min X;, M < a ] == P[ X1 e A , ( X1 , . . . , Xn ) e B). Take X1 for X and ( X2 , . . . , Xn ) for Y in (20.30)0 Since X1 has density 1/1, ==
• • •
(20 .33)
P ( X1
==
.
mm X; , M < a )
=
l0a
==
==
P ( ( x, X2 , . . . , Xn )
E
dx
B] I
o
n
If C is the event that x s X; < 1 for 2 < i < n , then P(C) (1 - xjl) - t o A simple calculation shows that P [ X; - X < s, , 2 < i s n i C] n;_ 2 (s;/(l - x )); in other WOrds, given C, X2 - X, . . Xn - X are conditionally independent and uni formly distributed over [0, I - X]. Now ( x2 ' 0 0 0 ' xn ) are random variables on some probability space ( Q, !F, P); replacing P by P( · I C) shows that the integrand in (20.33) is the same as that in (20032). The same argument holds with the index 1 o
,
=
==
20. RANDOM V ARIABLES AND DISTRIBUTIONS replaced by any k (1 < k < n ) which gives (20.32). (The events [ XI< y � a] are not disjoint, but any two intersect in a set of probability 0.) SECTION
,
271 = min
X, ,
•
Sequences of Random Variables Theorem 5.2 extends to general distributions P.n ·
{ If p. n } is a finite or infinite sequence of probability mea sures on !A 1 , there exists on some probability space ( 0, ofF, P) an independent sequence { xn } of random variables such that xn has distribution P.n ·
1beorem 20.4.
PROOF.
By Theorem 5.2 there exists on some probability space an indepen of random variables assuming the values 0 and 1 dent sequence Z 1 , Z2, with probabilities P[Zn = 0] = P[ Zn = 1] = ! . As a matter of fact, Theo rem 5.2 is not needed : take the space to be the unit interval and the Zn( w ) to be the digits of the dyadic expansion of w-the functions dn( w ) of Sections and 1 and 4. Relabel the countably many random variables Zn so that they form a double array, • • •
zl l zl 2 Z21 , Z22 ,
, · · ·
,
•
•
•
•
•
•
· · ·
•
•
•
the zn k are independent. Put un = Ef- I Zn k 2 - k . The series certainly converges, and Un is a random variable by Theorem 13.4. Further, U1 , U2 , is, by Theorem 20.2, an independent sequence. Now P[ Zn; = z; , 1 � i < k] = 2 - k for each sequence z 1 , , z k of O' s k and l's; hence the 2 k possible values j2 - k , 0 < j < 2 , of Sn k = E�_ 1Zn;2 - ; all have probability 2 - k . If 0 � x < 1, the number of the j2 - k that lie in [O,x ] is [2 k x] + 1, the brackets indicating integral part, and therefore P[S,.k � x ] = ([2 kx] + 1)/2 k . Since Sn k ( w ) i Un( w ) as k i oo , [ Sn k < x] � [Un k k S x ] as k t oo , and so P[Un � x ] = lim k P[Sn k < x ] = lim k ([2 x] + 1)/2 = X for 0 � X < 1. Thus Un is uniformly distributed over the unit interval. The construction thus far establishes the existence of an independent sequence of random variables Un each uniformly distributed over [0, 1]. Let F,. be the distribution function corresponding to P.n, and put cpn( u ) = inf[ x : u S Fn( x )] for 0 < u < 1. This is the inverse used in Section 14-see (14.5). Set q>,. (u) = 0, say, for u outside (0, 1), and put Xn( w ) = cpn(Un(w)). Since cp,.( u ) � x if and only if u < Fn( x )-see the argument following (14.5 ) -P[ Xn < x ] = P[Un < Fn(x)] = Fn(x). Thus Xn has distribution function Fn. And by Theorem 20.2, XI, x2, . . . are independent. • All
• • •
• • •
272
RANDOM VARIABLES AND EXPECTED VALUES
This theorem of course includes Theorem 5.2 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov' s existence theorem in Section 36. Convolution
Let X and Y be independent random variables with distributions p. and " · Apply (20.27) and (20.29) to the planar set B = [(x, y): x + y e H] with
H E !1/ 1 :
j-oo P( H - x )p( dx ) oo = f oo P [ Y e H - x) p( dx ) . - oo
P [ X + Y e H) =
( 20.34 )
The
convolution of p. and " is the measure p. • " defined by
( 20.35 )
( p • ., )( H )
=
j-oo P( H - x )p ( dx ) , oo
H E !1/ 1 •
If X and Y are independent and have distributions p. and ,, (20.34) shows that X + Y has distribution p. • " · Since addition of random variables is commutative and associative, the same is true of convolution: p. • " = " • p. and p. • (" • '11 ) = ( p. • ") • '11 · If F and G are the distribution functions corresponding to p. and ,, the distribution function corresponding to p. • " is denoted F • G. Taking H ( - oo , y] in (20.35) shows that =
( F • G )( y) =
(20.36)
(See (17.16) for the notation
j-oo G(y - x ) dF( x). oo
dF(x).) If G has density g, then G(y - x) = JY--,;g(s) ds g(t x) dt, and so the right side of (20.36) is JY 00 fY 00 [ / 00 00 8( 1 - x) dF(x)] dt by Fubini's theorem. Thus F • G has density F • g, where =
(20.37) this holds if
G
has density
g.
If, in addition,
F
has density f,
(20.37) is
SECTION
20.
RANDOM VARIABLES AND DISTRIBUTIONS
273
denoted I • g and reduces by (16.12) to
oo ( / • g )(y) = j- g (y - x ) f(x ) dx . oo
(20. 38)
This defines convolution for densities, and p. • " has density I • g if p. and " have densities I and g. The formula (20.38) can be used for many explicit calculations. Let X1 , , Xk be independent random variables, each with the exponential density (20.10). Define gk by
EXIIIple II
• • •
20.5.
k- 1 ax ( ) - ax (20 .3 9) gk ( x ) = a ( k - l ) ! e ' put gk (x) = 0 for x < 0. Now
k=
X > 0,
1 , 2, . . . ;
1
Y = )( y) ( Y - x ) gl (x) dx, t k ( 8k - t * 8 t 8 0
which reduces to g,.(y). Thus gk = Bk - l • g 1 , and since g 1 coincides with (20. 1 0), it follows by induction that the sum XI + . . . + xk has density Bk · The corre·iponding distribution function is
X > 0, as
•
follows by differentiation.
E
ple
Suppose that X has the normal density (20.12) with m == 0 and that Y has the same density with T in place of o . If X and Y are independent, then X + Y has density mm
20.6.
1
2 '1TO A
change of variable
[ /oo T - oo
exp -
u
]
( y - x ) 2 - x 2 dx . 2 2 2o
2T
= x(o 2 + T 2 ) 1 12joT reduces this to
274
RANDOM VARIABLES AND EXPECTED VALUES
m = 0 and with o 2 + T 2 in place
Thus X + Y has the normal density with of o 2 •
•
I f p. and " are arbitrary finite measures on the line, their convolution is
defined by
(20.35) even if
they are not probability measures.
Convergence in ProbabUity Random variables Xn converge in probability to X, written Xn lim P [I Xn - X I � £ ]
(20.41 )
n
--+ p X,
if
=0
1,
for each positive £. t If xn --+ X with probability then lim supn [ I xn - XI > £ ] has probability 0, and it follows by Theorem 4.1 that Xn --+ p X. But the
converse does not hold : There exist sets A n such that P( A n )
1.
--+ 0
and
For example in the unit interval, let A 1 , A be the 2 dyadic intervals of order one, A 3 , • • • , A 6, those of order two, and so on (see P(lim supn A n ) =
(4.30) or (1.31)). If = 0.
Xn =
lA
II
and X =
0,
then Xn
--+ p X
but P[lim n Xn =
X]
If Xn --+ X with probability I, then Xn --+ p X. (ii) A necessary and sufficient condition for Xn --+ p X is that each subse quence { Xnk } contain a further subsequence { Xnk } so that k � k ( i ) implies that P [ I Xnk - XI � ; - 1 ] < By the first Borel-Cantelli lemma I f Xn
-+ p X,
given
there is probability
1
that I Xn k
2 - ;.
- XI
£ holds along some sequence No subse " quence of Xnk } can converge in probability to X, and hence none can
{
converge to X with probability It follows from
probability
1.
(ii)
1.
that if Xn
•
--+ p X
and Xn
--+ pY,
then X = Y with
It follows further that if I is continuous and xn
/( Xn ) -+ pf( X ).
--+ p X,
In nonprobabilistic contexts, convergence in probability becomes
gence in measure:
If In and
then
conver
f are real measurable functions on a measure
t This is often expressed p lim, X, - X.
SECTION
20.
275
RANDOM VARIABLES AND DISTRIBUTIONS
space ( 0, �' p.), and if p.[w: /, converges in measure to f.
lf(w) - fn (w) l
�
£] -+
0 for each £ > 0,
then
1be Glivenko-CanteUi Theorem*
The empirical distribution function for random variables XI , . . . ' distribution function Fn ( x, w) with a jump of n - I at each Xk ( w ):
xn
is the
(20.42 ) If the Xk have a common unknown distribution function F( x ), Fn ( x, w) is its natural estimate. The estimate has the right limiting behavior, according to the Glivenko-Cantelli theorem:
Suppose that X1 , X2 , are independent and have a com mon distribution function F; put Dn (w) = supx i Fn (x, w) - F(x) l . Then Dn � 0 with probability I. Theorem 20.6.
•
•
•
For each x, Fn ( x, w) as a function of w is a random variable. By right continuity, the supremum above is unchanged if x is restricted to the rationals, and therefore Dn is a random variable. The summands in (20.42) are independent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each x there is a set A x of probability 0 such that
(20 .43)
Fn( x, w) = F( x ) lim n
for "' � A x · But Theorem 20.6 says more, namely that (20.43) holds for w outside some set A of probability 0, where A does not depend on x-as there are uncountably many of the sets A x ' it is conceivable a priori that their union might necessarily have positive measure. Further, the conver gence in (20.43) is uniform in x. Of course, the theorem implies that with probability 1 there is weak convergence Fn ( x, w) � F( x ) in the sense of Section 14. As already observed, the set A x where (20.43) fails has probability 0. Another application of the strong law of large �umbers,. with /( - x ) in place of /( - x ] in (20.42), shows that (see (20.5)) lim n Fn ( x - , w) = F( x - ) except on a set Bx of probability 0. Let cp( u) =
PROOF OF THE THEOREM. 00 ,
*This topic may be omitted.
00 ,
276
RANDOM VARIABLES AND EXPECTED VALUES
inf[x: u � F(x)] for 0 < u < 1 (see (14.5)), and put x m, k = cp(k/m), m � 1, 1 � k � m. It is not hard to see that F(cp(u)- ) � u � F(cp(u)) ; hence F( x m, k - ) - F(x m, k - 1 ) < m - I , F(x m,1 - ) < m - I , and F(x m, m ) > 1 - m - 1 . Let Dm, n (w) be the maximum of the quantities I Fn (x m, k ' w) F(x m, k ) l and IFn (x m , k - , w) - F(x m, k - ) I for k = 1, . . . , m. If x m, k - 1 < X < x m, k ' then Fn (x, w) < Fn (x m, k - , w) � F(x m, k - ) + Dm , n (w) � F(x) + m - 1 + Dm, n (w)1 and Fn (x, w) > Fn (x m, k - 1 , w) > F(x m, k - 1 ) - Dm, n (w) > F(x) - m - - Dm, n(w). Together with similar arguments for the cases X < X m,1 and X > X m , m ' this ShOWS that (20.44) If w lies outside the union A of all the A xm k and Bxm k , then limn Dm ' n (w) = 0 and hence lim n Dn ( w) = 0 by (20.44). But A has probability 0. • PROBLEMS 20.1.
20.2. 20.3. 20.4.
20.5.
20.6. 20. 7. 20.8.
2. 10 f A necessary and sufficient condition for a a-field t§ to be countably generated is that t§ == a( X) for some random variable X. Hint: If t§ == k a(A 1 , A 2 , • • • ), consider X == Ej/_ 1 f( IA �< )/10 , where /(x) is 4 for x == 0 and 5 for x =F 0. If X is a positive random variable with density /, x- 1 has density /(1/x)jx 2 • Prove this by (20.16) and by a direct argument. Suppose that a two-dimensional distribution function F has a continuous density f. Show that /(x, y) == a 2 F(x, y)jax ay. The construction in Theorem 20.4 requires only Lebesgue measure on the Use the theorem to prove the existence of Lebesgue measure unit interval. k on R . First construct A k restricted to ( - n , n ] X · · · X ( - n , n ], and then pass to the limit ( n --+ oo ) . The idea is to argue from first principles, and not to use previous constructions, such as those in Theorems 12.5 and 18.2. and latitude of a random point on 19.9 f Let 8 and _, be the longitude 3 the surface of the unit sphere in R ; probability is proportional to surface area. Show that 8 and _, are independent, 8 is uniformly distributed over [0, 2 w ), and _, is distributed over [ - w/2 , + w/2] with density ! cos 4-. Suppose that A , B, and C are positive, independent random variables with distribution function F. Show that the quadratic Az 2 + Bz + C has real zeros with probability /0/0F(x 2 /4y) dF(x) dF(y). Show that X1 , X2 , are independent if a( X1 , , Xn _ 1 ) and a ( Xn ) are independent for each n. Let X0 , X1 , be a persistent, irreducible Markov chain and for a fixed state j let T1 , 72 , . . . be the times of the successive passages through j. Let Z1 == T1 and Zn == T, - T, _ 1 ,k n > 2. Show that Z1 , Z2 , . . . are indepen dent and that P [ Zn == k ] == Jj� > for n > 2. •
•
•
•
•
•
•
•
•
277 20. RANDOM VARIABLES AND DISTRIBUTIONS are non are distribution functions and p 1 , p2 20.9. Suppose that F1 , F2 , negative and add to 1. Show that F( x) = E�= 1 p,, F,, ( x ) is a distribution function. Show that, if F,(x) has density /,, ( x ) for each n, then F( x ) has density E�_ 1 p,, In ( x ). 20. 10. Let XI ' . . . ' xn be independent, identically distributed random variables and let , be a permutation of 1, 2, . . . , n. Show that P(( X1 , , X,, ) e H) == p [( xw l ' . . . ' x'IT, ) H) for H qJ". 20.11. i Ranks and records. Let X1 , X2 be independent random variables with a common continuous distribution function. Let B be the w-set where xm ( w ) == xn ( w) for some pair m ' n of distinct integers, and show that P( B) == 0. Remove B from the space 0 on which the Xn are defined. This leaves the joint distributions of the xn unchanged and makes ties impossi ble. n n Let r < > ( w ) == ( T1< n > ( w ), . . . , r,< > ( w )) be that permutation ( t 1 , , tn ) of (1, . . . , n) for which X, ( w ) < X, ( w ) < · · · < X, ( w ). Let Y, be the rank of xn among x. ' . . . ' xn : Y, == / if and only if i; < xn for exactly r - 1 values of i preceding n. n (a) Show that r< > is uniformly distributed over the n ! permutations. (b) Show that P[ Y, == r] == 1 /n , 1 < nr < n. (c) Show that Yk is measurable a( T< > ) for k < n. (d) Show that Y1 , Yi , . . . are independent. 20.12. f Record values. Let An be the event that a record occurs at time n: max k < n xk < xn . are independent and P(An) == 1/n. (a) Show that A 1 , A 2 , (b) Show that no record stands forever. (c) Let Nn be the time of the first record after time n. Show that P [ Nn == n + k] == n ( n + k - 1) - 1 ( n + k) - 1 • SECTION
• •
• • •
E
E
•
, •
•
.
•
•
•
•
•
•
•
• • •
20.13. Use Fubini's theorem to prove that convolution of finite measures is 20.14. 20.15.
20.16.
20.17. 20.18. 20.19.
commutative and associative. In Example 20.6, remove the assumption that m is 0 for each of the normal densities. Suppose that X and Y are independent and have densities. Use (20.20) to find the joint density for ( X + Y, X) and then use (20.19) to find the density for X + Y. Check with (20.38). If F( x - £) < F( x + £) for all positive £, x is a point of increase of F (see Problem 12. 9). If F( x - ) < F( x ), then x" is an atom of F. (a) Show that, if x and y are points of increase of F and G, then x + y is a point of increase of F • G. (b) Show that, if x and y are atoms of F and G, then x + y is an atom of F • G. Suppose that I' and " consist of masses an and fln at n , n == 0, 1, 2, . . . . Show that I' • " consists of a mass of E'k - oa k /Jn - k at n, n == 0, 1, 2, . . . i Show that two Poisson distributions (the parameters may differ) con volve to a Poisson distribution. Show that, if X1 , , Xn are independent and have the normal distribution (20.12) with m == 0, then the same is true of ( X1 + · · · + Xn )/ {n . • • •
278
RANDOM VARIABLES AND EXPECTED VALUES
20.20. The Cauchy distribution has density
( 20.45 )
Cu ( X)
== -1 2 u 2 ' 'IT U + X
-
oo
<x
0. (By (17.10), the density integrates to 1.) (a) Show that cu cl, == cu . Hint: Exp8.l&d the convolution integrand in partial fractions. (b) Show that, if x1 ' . . . ' xn are independent and have density Cu , then ( X1 + · · · + Xn )/n has density cu as well. Compare with the convolution law in Problem 20.19. 20.21. f (a) Show that, if X and Y are independent and have the standard normal density, then X/Y has the Cauchy density with u == 1. (b) Show that, if X has the uniform distribution over ( - ,/2, ,/2), then tan X has the Cauchy distribution with u == 1. 20.22. 18.22 f Let XI ' . . . ' xn be independent, each having the standard normal distribution. Show that •
+ 1,
X n2 == x12 + . . . + xn2 has density
( 20.46)
1
2n 12 r ( n/2)
x(
n /2) - 1e - x/2
over (0, oo ). This is called the chi-squared distribution with n degrees of
freedom. 20.13.
f
The gamma distribution has density
( 20.47) over (0, oo) for positive parameters a and u. Check that (20.47) integrates to 1. Show that
( 20.48)
f( · ; a, u ) • f( · ; a , v ) == f( · ; a , u + v ) .
Note that (20.46) is f(x; t , n/2), and from (20.48) deduce again that (20.46) is the density of x; . Note that the exponential density (20.10) is f(x; a , 1), and from (20.48) deduce (20.39) once again. 20.24. f The beta function is defined for positive u and v by
20. RANDOM VARIABLES AND DISTRIBUTIONS From (20.48) deduce that SECTION
(
B u, v)
=
f( u ) f ( v ) r( u + v )
279
·
Show that
for integral u and v . be independent, where P [ N = n ] = q n - 1p , n 20.25. 20.23 f Let N, X1 , X2 , > 1 , and each Xk has the exponential density f( x; a , 1). Show that X1 + · · · + XN has density f( x; ap , 1). 20.26. Let An m (£) == [ I Zk - Z l < £, n < k < m ] Show that Zn -+ Z with prob ability 1 if and only if Iimnlim m P(An m < £)) = 1 for all positive £, whereas Zn -+ p Z if and only if limn P(Ann < £)) == 1 for all positive £. 20.27. (a) Suppose that f: R 2 -+ R1 is continuous. Show that Xn -+ pX and Y, -+pY imply /( Xn , Y, ) -+pf( X, Y). (b) Show that addition and multiplication preserve convergence in prob ability. 20.28. Suppose that the sequence { Xn } is fundamental in probability in the sense that for £ positive there exists an � such that P[ I Xm - Xn l > £] < £ for •
•
•
.
m, n >
�-
(a) Prove there is a subsequence { Xn k } and a random variable X such that
Iim k Xn k = X with r,robability 1. Hint: Choose increasing n k such that k P[ kI Xm - Xn l > 2- ] < 2 - for m , n > n k . Analyze P ( I Xn k 1 - xn k I > 2- ]. (b) Show that Xn -+ P X. (a) Suppose that X1 < X2 s · · · and that Xn -+ pX. Show that Xn -+ X -
10.29.
10.30.
10.31.
with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure. If Xn -+ 0 with probability 1, then n - 1 E i: 1 Xk -+ 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. 2.1 7 f (a) Show that in a discrete probability space convergence in probability is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equivalence holds: Suppose that P has a nonatomic part in the sense that there is a set A such that P( A) > 0 and P( · l A ) is nonatomic. Construct random variables Xn such that Xn -+ pO but Xn does not converge to 0 with probability 1. _
280 20.32.
RANDOM VARIABLES AND EXPECTED VALUES
20.28 20.31 f Let d( X, Y) be the infimum of those positive £ for which P[ I X - Yl > £] < £. (a) Show that d( X, Y) == 0 if and only if X == Y with probability 1. Identify random variables that are equal with probability 1 and show that d is a metric on the resulting space. (b) Show that Xn --+ pX if and only if d( Xn , X) --+ 0. (c) Show that the space is complete. (d) Show that in general there is no metric d0 on this space such that Xn --+ X with probability 1 if and only if d0( Xn , X) -+ 0. SECI'ION 21. EXPECI'ED VALVES
Expected Value as Integral
The expected value of a random variable X with respect to the measure P:
X on ( 0, �, P) is the integral of
E[X] = j xdP = fox(w)P(dw). All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative X, E [X } is always defined (it may be infinite); for the general X, E[X] is defined, or X has an expected value, if at least one of E[X+] and E[X-] is finite, in which case E[X] = E[X+] - E[X-]; and X is integrable if and only if E[IXI] < oo . The integral fA X dP over a set A is defined, as before, as E[/A X] . In the case of simple random variables, the definition reduces to that used in Sections 5 through 9. Expected Valoes and Distributions
Suppose that X has distribution p,. If g is a real function of a real variable, then by the change-of-variable formula (16.17), (21 . 1 )
E (g( X)) = f oo g ( x )�& ( dx ) .
- oo
(In applying (16.17), replace T: 0 -+ D' by X: 0 -+ R 1 , p. by P, p.T - 1 by p, , and f by g.) This formula holds in the sense explained in Theorem 16.12: I t holds in the nonnegative case, so that (21 .2)
E [l g( X)I ] = j oo l g ( x )l�&(dx) ;
- oo
SECTION
21.
281
EXPECTED VALUES
if one side is infinite, then so is the other. And if the two sides of (21 .2) are finite, then (2 1 .1) holds. If p, is discrete and p. { x 1 , x 2 , . . . } = 1, then (21 .1) becomes (use Theorem 1 6 . 9)
(21 . 3 )
r
If X has density f, then (21.1) becomes (use Theorem 16.10)
E [ g( X)] =
(21 .4 )
F
100 g ( x ) f( x ) dx . - oo
is the distribution function of X and p., (21 .1) can be written E[ g( X)] = 00 g(x) dF(x) in the notation (17.16). If
/ 00
Moments By
p,
(2 1 .2),
(21 . 5 )
and
F determine all the absolute moments of X:
E (I X I k ) =
j
oo
- oo
l x l k�&(dx) =
j
oo
- oo
l x l k dF( x ) ,
k
= 1 , 2, . . .
.
Since j � k implies that l x lj s 1 + l xl k , if x ·has a finite absolute moment of order k, then it has finite absolute moments of orders 1, 2, . . . , k - 1 as well. For each k for which (21.5) is finite, X has k th moment
E [ Xk ) =
(21 .6)
100 x k�&(dx) = 100 x k dF(x ) . - oo
- oo
These quantities are also referred to as the moments of p. and of F. They can be computed via (21. 3) and (21.4) in the appropriate circumstances. EX111le 11p 21.1. Consider the normal density (20.12) with m = 0 and a == 1 . For each k, x ke - x2 12 goes to 0 exponentially as x -+ ± oo , and so finite moments of all orders exist. Integration by parts shows that 1 _ r,;-v 2w
J
oo
- oo
x ke - x2 /2 dx =
k-1 ,J2w
f
oo
- oo
x k - 2e - x2 /2 dx ,
k
= 2, 3 , . . .
.
(Apply (18. 16) to g(x) = x k - 2 and f(x) = xe - x2 12 , and let a -+ - oo , 0 b -.. oo.) Of course, E[ X] = 0 by symmetry and E[ X ] = 1 . It follows by
282
RANDOM VARIABLES AND EXPECTED VALUES
induction that
E ( X2 k ) 1 · 3 · 5 · · · ( 2k - 1 ) ,
( 21 .7 )
=
k 1 ' 2, . . . ' =
•
and that the odd moments all vanish. If the first two moments of Section 5, the variance is
( 21 .8 )
Var [ X ]
=
X are finite and E[X]
E [ (x - m ) 2]
=
=
m, then just as
in
j-oo ( x - m )2�& ( dx ) oo
From Example 21.1 and a change of variable, it follows that a random variable with the normal density (20.12) has mean m and variance o 2• Consider for nonnegative X the relation
(21 .9 ) Since P[X = t] can be positive for at most countably many values of t, the two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For X simple and nonnegative, (21.9) was proved in Section 5 ; see ( 5 .2 5 ). For the general nonnegative X, let Xn be simple random variables for which 0 s Xn i X (see (20.1)). By the monotone convergence theorem E[Xn ] i E[X]; moreover, P[Xn > t] i P[X > t], and so f000P[Xn > t] dt j J:P[X > t] dt, again by the monotone convergence theorem. Since (21.9) holds for each Xn , a passage to the limit establishes (21.9) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then X is integrable. Replacing X by X/[ X > aJ leads from (21.9) to
( 21 .10 )
1
XdP aP( X > a) + f 00P( X > t) dt,
[ X> a ]
As long as
=
a � 0.
a
a � 0, this holds even if X is not nonnegative.
Inequalities
Since the final term in (21.10) is nonnegative, E[X]. Thus
( 21 .11 )
P ( X � a] s -a1
1
aP[X � a] s f[x� X dP
1 ( X], XdP s -E a
[ X� a]
a.]
a>O
s
SECTION
21.
for nonnegative X. As in Section
EXPECTED VALUES
5, there follows the inequality
(21 .12 ) It is the inequality between the two extreme terms here that usually goes under the name of Markov; but the left-hand inequality is often useful, too. As a special case there is Chebyshev's inequality,
(21 .13 )
1
P ( I X - m l � a] s 2 Var( X ] a
(m = E[ X]). Jensen's inequality
(21 .14 )
cp ( E ( X ]) s E ( cp ( X ) ]
holds if cp is convex on an interval containing the range of X and if X and cp( X) both have expected values. To prove it, let /(x) = ax + b be a supporting line through ( E[ X], cp( E [ X]))-a line lying entirely under the graph of cp [A33 ] . Then aX( w) + b s cp( X( w )), so that aE[ X] + b < E [ cp( X)]. But the left side of this inequality is cp( E[ X]). Holder's inequality is
1 p
- +
1 q
-
=
1.
For discrete random variables, this was proved in Section 5; see (5.31). For the general case, choose simple random variables Xn and Yn satisfying 0 S I Xn l i l XI and 0 S I Yn l i I Yl ; see (20.2). Then (5.31) and the monotone convergence theorem give (21.15). Notice that (21.15) implies that if I XIP and I Yl q are integrable, then so is XY. Schwarz's inequality is the case p == q = 2:
(21 .16 ) If X and Y have second moments, then XY must have a first moment. The same reasoning shows that Lyapounov' s inequality (5.33) carries over from the simple to the general case.
284
RANDOM VARIABLES AND EXPECTED VALUES
Joint Integrals
The relation (21 .1) extends to random vectors. Suppose that ( X1 , has distribution p. in k-space and g: R k -+ R 1 • By Theorem 16. 1 2,
•
•
•
, Xk )
(21 .17) with the usual provisos about infinite values. For example, E [ X;Xi ] = fR "x ;xip.(dx). If E[ X;] = m ; , the covariance of X; and Xi is
Independence and Expected Value
Suppose that X and Y are independent. If they are also simple, then E [ XY] = E[ X]E[ Y], as proved in Section 5 -see (5.21). Define Xn by (20.2) and similarly define Yn = 1/ln( Y+ ) - 1/l n( Y-). Then Xn and Yn are independent and simple, so that E[ I XnYn l 1 = E[ 1 Xn 1 1 E [ I Yn l ], and 0 < 1 Xn 1 i l XI , 0 � I Yn l i I Yl . If X and Y are integrable, then E[ I XnYn l 1 = E [ 1 Xn 1 1 E [ I Yn 1 1 i E[ I XI ] · E[ l Yl ], and it follows by the monotone conver gence theorem that E[ I XYI ] < oo ; since XnYn -+ XY and I XnYnl < I XYI , it follows further by the dominated convergence theorem that E [ XY] = lim n E [ XnYn] = limnE[Xn]E[ Yn] = E[ X]E [ Y ]. Therefore, XY is integrable if X and Y are (which is by no means true for dependent random variables) and E[ XY] = E[ X]E[Y]. This argument obviously extends inductively: If X1 , , Xk are indepen dent and integrable, then the product xl . . . xk is also integrable and •
•
•
(21 .18) Suppose that fl1 and fl2 are independent a-fields, A lies in fl1 , X1 is measurable G1, and X2 is measurable fl2• Then /A X1 and X2 are indepen dent, so that (21 .18) gives (21 .19)
£x. x2 dP = £ X1 dP . E [ X2 ]
if the random variables are integrable. In particular, (21 .20)
SECTION
21 .
EXPECTED VALUES
285
From (21 .1 8) it follows just as for simple random variables (see (5.24)) that variances add for sums of independent random variables. It is even enough that the random variables be independent in pairs. Moment Generating Functions
The
moment generating function is defined as
(21 .21 )
M ( s ) = E [ e•X] =
J oooo e•xfl ( dx ) = J oooo e •x dF( x ) -
-
for all s for which this is finite (note that the integrand is nonnegative). Section 9 shows in the case of simple random variables the power of moment generating function methods. This function is also called the Laplace transform of p., especially in nonprobabilistic contexts. Now f0e sxp.(dx) is finite for s < 0, and if it is finite for a positive s, then it is finite for all smaller s. Together with the corresponding result for the left half-line, this shows that M(s) is defined on some interval containing 0. If X is nonnegative, this interval contains ( - oo, 0] and perhaps part of (0, oo ); if X is nonpositive, it contains [0, oo) and perhaps part of ( - oo , 0). It is possible that the interval consists of 0 alone; this happens for example, if p. is concentrated on the integers and p. { n } = p. { - n } = Cjn 2 for n = 1, 2, . . . . Suppose that M(s) is defined throughout an interval ( -s0, s0), where So > 0. Since elsxl < esx + e-sx and the latter function is integrable ,., for l s i < s0, so is Er- o l sx l kjk! = e lsxl . By the corollary to Theorem 16.7, p. has finite moments of all orders and
Thus M(s ) has a Taylor expansion about 0 with positive radius of convergence if it is defined in some ( - s0 , s0), s0 > 0. If M(s) can somehow be calculated and expanded in a series E k a k s k , and if the coefficients ak can be identified, then, since ak must coincide with E[ X k ]jk!, the moments of X can be computed: E[X k ] = a k k ! It also follows from the theory of Taylor expansions [A29] that a k k ! is the k th derivative M < k > ( s ) evaluated at s = 0:
(21 .23) This holds if
M ( k l (O ) = E [ X k ] =
/-00
oo
X kfl ( dx ) .
M(s) exists in some neighborhood of 0.
286
RANDOM VARIABLES AND EXPECTED VALUES
Suppose now that M is defined in some neighborhood of s. If " has density e s xjM(s ) with respect to p. (see (16.11)), then " has moment generating function N( u ) = M(s + u)/M(s ) for u in some neighborhood of 0. But then by (21 . 23 ) N < k >(O) = j 00 00 X k v( dx ) = f 00 00 X ke sxp.( dx )/M(s ), and since N < k >(O) = M< k >(s )/M(s), ,
M < k l (s ) =
( 21 . 24 )
oo
j- oox ke sxp. ( dx ) .
This holds as long as the moment generating function exists in some neighborhood of s. If s = 0, this gives ( 21.23 ) again. Taking k = 2 shows that M( s) is convex in its interval of definition.
Example
For the standard normal density,
21. 2.
M( s ) =
oo
j J2w - oo 1
es:xe - :x z /2 dx =
and a change of variable gives
h e •z /2J
oo
- oo
e - < x - s )z /2 dx ,
( 21 .25 )
The moment generating function is in this case defined for all s. Since es2 1 2 has the expansion e • z/2 =
f _!_ ( � ) k = f
k -0
k!
2
1 · 3 · · · (2k (2k ) ! k -0
-
1 ) s 2k
'
the moments can be read off from (21.22), which proves ( 21 . 7 ) once more. •
Exllltlple
21.3.
function
In the exponential case
(20.10)
the moment generating
(21 .26) is defined for s < a. By (21.22) the k th moment is k !a - k . The mean and • variance are thus a - 1 and a - 2.
Example ( 21 .27 )
21.4.
For the Poisson distribution (20.7),
SECTION
s M(s) and M'(s ) = Ae ce in S moments are M '(O) = A and both A.
21 .
287
EXPECTED VALUES
M"(s) = (A2e 2 s + Ae s ) M(s), the first two M"(O) = A2 + A; the mean and variance are •
Let X1 , . . . , Xk be independent random variables, and suppose that each X; has a moment generating function M; (s ) = E[e s X;] in ( - s0, s0). For ls i < s0, each exp(sX;) is integrable and, since they are independent, their product exp(sE �_ 1 X; ) is also integrable (see (21.18)). The moment gener ating function of xl + . . . + xk is therefore (21 .28)
( - s0, s0 ). This relation for simple random variables was essential to the arguments in Section 9. For simple rand_om variables it was shown in Section 9 that the moment generating function determines the distribution. This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case.
in
PROBLEMS 21.1.
Prove
1 J2w
11.2. 11.3. 21.4. 21.5. 11.6.
21.7.
1 00 e- x 12 dx -
,
oo
2
== 1 -
1 12 ,
differentiate k times with respect to t inside the integral (justify), and derive (21. 7) again. 2 n + 1 ] == X Show that, if X has the standard normal distribution, then E[ 1 1 n
2 n h/2/w . 16.23 f Suppose that X1 , X2 , . are identically distributed (not neces sarily independent) and E( I X1 1 ] < oo . Show that E[max k n i Xk 1 1 == o( n ). 20.12 f Records. Consider the sequence of records in the sense of Problem 20.12. Show that the expected waiting time to the next record is infinite. 20.20 f Show that the Cauchy distribution has no mean. Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicator random variables. Why is Theorem 16.6 not enough for the second • •
s
Borel-Cantelli lemma? Derive (21.11) by imitating the proof of (5.26). Derive Jensen's inequality (21.14) for bounded X and q> by passing to the limit from the case of simple random variables (see (5.29)).
288
21.8. 21.9.
RANDOM VARIABLES AND EXPECTED VALUES
Prove (21.9) by Fubini' s theorem. Prove for integrable X that
E [ X) -
{ P[ X > t) dt - t "
0
- oo
P[ X < t) dt.
21.10. (a) Suppose that X and Y have first moments and prove
E [ Y ] - E [ X)
...
/-cooo ( P [ X < t
o
X dP < a ( - log F( a )) +
a
and F( a ) are positive, then
l eo ( - log F( t ) ) dt < F(l ) f G
a
X> o
X dP.
21.13. Suppose that X and Y are nonnegative random variables, r > 1, and
P[ Y � t] < t - 1/y:?. , X dP for t > 0. Use (21.9), Fubini's theorem, and Holder's inequality to prove E( Y' ] < (rj( r - 1))'E[ X' ). 21. 14. Let X1 , X2 , . be identically distributed random variables with finite sec ond moment. Show that nP[ I X1 1 > £{,;' ] � 0 and n - 1 12 max k < n i X�c l • •
� P o.
21.15. Random variables X and Y with second moments are uncorre/ated if
E[ XY ] == E[ X] E( Y].
(a) Show that independent random variables are uncorrelated but that the
converse is false. (b) Show that, if X1 , , Xn are uncorrelated (in pairs), then Var[ X1 + · · · + Xn ] == Var[ X1 ] + · · · + Var[ Xn 1 · 21.16. f Let X, Y, and Z be independent random variables such that X and Y assume the values 0, 1, 2 with probability t each and Z assumes the values 0 and 1 with probabilities l and 1 . Let X' == X and Y' == X + Z (mod 3). (a) Show that X', Y', and X' + Y' have the same one-dimensional distri butions as X, Y, and X + Y, respectively, even though ( X', Y') and ( X, Y) have different distributions. • • •
21 . EXPECTED VALUES 289 (b) Show that X' and Y' are dependent but uncorrelated. (c) Show that, despite dependence, the moment generating function of X' + Y' is the product of the moment generating functions of X' and Y'. Suppose that X and Y are independent, nonnegative random variables and that E[ X] == oo and E[ Y] == 0. What is the value common to E( XY] and E[ X] E[ Y]? Use the conventions (15.2) for both the product of the random variables and the product of their expected values. What if E[ X] == oo and 0 < E[ Y] < oo? Deduce (21.18) for k == 2 from (21.17), (20.23), and Fubini' s theorem. Do not in the nonnegative case assume integrability. Suppose that X and Y are independent and that f( x , y) is nonnegative. Put g(x ) == E[f(x, Y)] and show that E[ g( X)] == E( f( X, Y)]. Show more gen erally that Ix A g( X) dP == Ix A /( X, Y) dP. Extend to / that may be negative. f The integrability of X + Y does not imply that of X and Y separately. Show that it·does if X and Y are independent. SECTION
21 .17.
21.18. 21.19.
e
21.20.
&
21.21. The dominated convergence theorem (Theorem 16.4) and the theorem on
uniformly integrable functions (Theorem 16.13) of course apply to random variables. Show that they hold if convergence with probability 1 is replaced by convergence in probability.
f Show that if E( l Xn - XI ) -+ 0, then Xn -+ p X. Show that the converse is false but holds under the added hypothesis that the xn are uniformly integrable. If E[ I xn - X I ] -+ 0, then xn is said to converge to X in the mean . 21.13. f Write d1 ( X, Y) == E[ I X - Yl /(1 + I x· - Yl )]. Show that this is a met ric equivalent to the one in Problem 20.32. 21.24. 5.10 f Extend Minkowski's inequality (5.36) to arbitrary random variables.
21.22.
Suppose that p > 1. For a fixed probability space ( 0 , S', P), consider the set of random variables satisfying E[ l X I P ] < oo and identify random variables that differ only on a set1 of probability 0. This is the space L P == L P (Q, .F, P). Put dp ( X, Y) == E 1P ( I X - Y I P ] for X, y E LP. (a) Show that L P is a metric space under dP . (b) Show that L P is complete. 2.10 16.7 21.25 f Show that L P ( Q, S', P) is separable if S' is sep arable, or if §' is the completion of a separable a-field. 21 .25 f For X and Y in L2 put ( X, Y) == E[ XY]. Show that ( · , · { is an inner product for which d2 ( X, Y) == ( X - Y, X - Y). Thus L is a Hilbert space. 5.11 f Extend Problem 5.11 concerning essential suprema to random variables that need not be simple. Consider on (1, oo) densities that are multiples of e - ax , e - x2, x - 2e - ax , x - 2 ; also consider reflections of these through 0 and linear combinations of
21.25. 20.28 21.21 21.24 f
21.26. 21.17. 21.28. 21.29.
290
21.30. 21.31.
21.32. 21.33.
21.34. 21.35.
RANDOM VARIABLES AND EXPECTED VALUES
them. Show that the convergence interval for a moment generating function can be any interval containing 0. There are 16 possibilities, depending on whether the endpoints are 0, finite and different from 0, or infinite, and whether they are contained in the interval or not. For the density Cexp( - l x l 1 12 ), - oo < x < oo , show that moments of all orders exist but that the moment generating function exists only at s == 0. 16.8 f Show that a moment generating function M( s ) defined in ( - s0 , s0 ) , s0 > 0, can be extended to a function analytic in the strip [ z: - s0 < Re z < s0 ] . If M ( s ) is defined in [0, s0 ), s0 > 0, show that it can be extended to a function continuous in [ z: 0 < Re z < s0 ] and analytic in [z: 0 < Re z < s0 ]. Use (21 .28) to find the generating function of (20.39). The joint moment generating function of ( X, Y) is defined as M( s, t ) == E( esX + t Y]. Assume that it exists for ( s, t) in some neighborhood of the origin, expand it in a double series, and conclude that E [ xm yn ] is a m + nMjas m at n evaluated at s .. t == 0. For independent random variables having moment generating functions, show by (21 .28) that the variances add. 20.23 f Show that the gamma density (20.47) has moment generating function (1 - sja ) - uk for s < a. Show that the k th moment is u( u + 1) · · · ( u + k - 1 )/a . Show that the chi-squared distribution with n de grees of freedom has mean n and variance 2 n . SECI10N 22. SUMS OF INDEPENDENT RANDOM VARIABLES
Let X1 , X2, be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series E�_ 1 Xn converges with probability 1, or as in Section 6 whether n - 1 Ek_1 Xk converges to some limit with probability 1. It is to questions of this sort that the present section is devoted. Throughout the section, Sn will denote the partial sum Ek _ 1 Xk (S0 = 0) . •
•
•
The Strong Law of Large Numbers
The central result is a general version of Theorem 6.1 .
are independent and identically distributed and have finite mean, then S,jn --+ E [ X1 ] with probability 1. Theorem 22. 1.
If X1 , X2,
•
•
.
Formerly this theorem stood at the end of a chain of results. following argument, due to Etemadi, proceeds from first principles.
The
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
291
pJtOO F.
If the theorem holds for nonnegative random variables, then ,.- 1sn = n - 1Ei:_ 1 Xt - n - 1Ei:_ 1 X; -+ E[ X(] - E[ X}] = E[ X1 ] with probability 1. Assume then that Xk > 0. Consider the truncated random variables Yk = Xk ll x. k J and their par n tial sums Sn* = Ei:_ 1 Yk . For a > 1, temporarily fixed, let un = [a ], where the brackets denote integer part. The first step is to prove s
>£
0,
u n + 1 /u n --+ a, and so it follows by (22.2) that ! E [ X1 ) � lim inf £k � lim sup £k < aE ( X1 ) a k k k k with probability 1. This is true for each a > 1 . Intersecting the correspond ing sets over rational a exceeding 1 gives Iim k Sk/k = E[X1 1 with probabil But
•
ity 1 .
Although the hypothesis that the xn all have the same distribution is used several times, independence is used only through the equation Var[ Sn* 1 = EZ - tVar[ Yk 1 , and for this it is enough that the xn be independent in pairs. The proof given for Theorem 6.1 of course extends beyond the case of simple random variables, but it requires E[Xt1 < oo .
Suppose that X1 , X2 , . . . are independent and identically distrib uted and E [ X1- 1 < oo, E[Xi1 = oo (so that E[ X1 1 = oo). Then n - 1EZ_ 1 Xk oo with probability 1 . By the theorem, n - 1 EZ_ 1 X; --+ E[ X1- 1 with probability 1, and so PROOF. it suffices to prove the corollary for the case X1 = X1+ � 0. If xn if 0 < xn � u, u xn< > = 0 if xn > u, then n - 1Ek_ 1 Xk � n - 1Ek_ 1 X�"> --+ E[Xf"> 1 by the theorem. Let u --+ oo . Corollary. -+
{
•
1be Weak Law and Moment Generating Functions
The statement and proof of Theorem 6.2, the weak law of large numbers, carry over word for word for the case of general random variables with second moments-only Chebyshev' s inequality is required. The idea can be
SECTION
22.
SUMS OF IN DEPENDENT RANDOM VARIABLES
293
used to prove in a very simple way that a distribution concentrated on [0 , oo ) is uniquely determined by its moment generating function or Laplace transform. For each A, let Y" be a random variable (on some probability space) having the Poisson distribution with parameter A. Since Y" has mean and variance A (Example 21 .4), Chebyshev' s inequality gives
Let
G"
Y"/A, so that [At] k A " E e - k ,. .
be the distribution function of
G"( t ) =
k -0
-
The result above can be restated as if t > 1 , if t < 1 .
( 22 .3)
In the notation of Section 14, G"(x) => &(x - 1) as A � oo . Now consider a probability distribution p. concentrated on [0, oo ). Let F be the corresponding distribution function. Define
s > 0;
(22 .4)
here 0 is included in the range of integration. This is the moment generating function (21 .21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21.24) gives
.
00 k k > M< = (22 .5 ) ( s ) ( - 1 ) y ke - s vp. ( dy) . 0 Therefore, for positive x and s, k k ( 1 ( 1 sy ( ) ( ) kM< k ) ( s ) = - sy , fl ( dy ) e s (22.6) E E -� k. k.
1 0
k -0
= Fix
the
1 oo fooo GsA; ) fl( dy ) . SX
SX
k-0
x > 0. If t 0 < y < x, then Gsy (xjy) --. 1 as s � oo by (22.3); if y > x, limit is 0. If p. { x } = 0, the integrand on the right in (22.6) thus
t lf y = 0, the integrand in (22.5) is 1 for k = 0 and 0 for k in the middle term of (22.6) is 1.
>
1 ; hence for y = 0, the integrand
294
RANDOM VARIABLES AND EXPECTED VALUES
converges as s --+ oo to Iro. x 1 (y) except on a set of p.-measure bounded convergence theorem then gives [sx ]
(22 .7)
lim E
s -. 00
(
_
k 1) ,
k.
k -o
0.
The
s kM < k > ( s ) = p. (O, x ] = F( x ) .
Thus M( s) determines the value of F at x if x > 0 and p. { x } = 0, which covers all but countably many values of x in [0, oo ). Since F is right-con tinuous, F itself and hence p. are determined through (22. 7) by M( s ) . In fact p. is by (22. 7) determined by the values of M( s) for s beyond an arbitrary s0: Theorem 22.2.
Let p. and be probability measures on [0, oo ). If
where s 0 > 0, then p. Corollary.
Jl
= Jl.
Let ft and /2 be real functions on [0, oo ). If
where s0 > 0, then ft = /2 outside a set of Lebesgue measure 0. /; need not be nonnegative, and they need e - sx/;(x) must be integrable over [0, oo) for s > s0 • The
not be integrable, but
For the nonnegative case, apply the theorem to the probability densities g; (x) = e - so x/;(x)jm, where m = J:e - sox/;(x) dx, i = 1, 2. For • the general case, prove that It + /2 = fi + /1 almost everywhere. PROOF.
Example 22.1. If p. 1 p. 2 = p.3, then the corresponding transforms (22.4) satisfy Mt (s)M2 (s) = M3 (s) for s � 0. If P. ; is the Poisson distribution with mean A ; , then (see (21 .27)) M; (s) = exp[A ; (e - s - 1)]. It follows by Theo rem 22.2 that if two of the P.; are Poisson, so is the third, and A t + A 2 = A 3 •
•
•
Kolmogorov's Zero-One Law
Consider the set A of w for which n - t Ek- tXk (w) --+ 0 as n --+ oo . For each m , the values of Xt (w), . . . , Xm t (w) are irrelevant to the question of -
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
295
whether or not w lies in A, and so A ought to lie in the a-field o( Xm , X,. + 1 , ). In fact, limn n - 1EZ'-JXk (w) = 0 for fixed m, and hence w lies in l A if and only if lim n n - L� - m Xk ( w) = 0. Therefore, • • •
[
A = n U n w: n-1
(22 .8)
f
N�m n> N
Xk (w ) t k-m
0 Theorem 22.4.
a
( 22.9 ) Let A k be the set where are disjoint,
PROOF.
Ak
I Sk i
�
a
but
ISi l
a ] < a - 2 Var[ Sn 1· That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of independent variables, if max k s n i Sk l is large, then I Sn l is probably large as well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi.
Suppose that X1 ,
Theorem 22.5.
(22 .10 )
P
• • •
[ 1 ms kaxs n I Sk l � 4a]
, Xn
are independent. For a > 0
< 4 max P [ I Sk l � a] . n
1 sks
Let Bk be the set where I Sk i > 4 a but I Si l < 4 a for j < k. Since the Bk are disjoint, n-1 max I Sk l � 4 a < P [ I Sn l > 2 a ] + L P ( Bk n [ I Sn l < 2 a] ) 1 sksn k-1 n-1 < P [ I Sn l > 2a ] + L P ( Bk n [ I Sn - Sk i > 2 a] ) k-1 n-1 = P ( I Sn l > 2 a ] + L P ( Bk ) P ( I Sn - Sk i > 2 a ] k-1
PROOF.
]
P[
< 2 max P [ I Sn - Sk i > 2 a]
Os k < n
Now replace 4 a by 2 a and apply the same argument to the partial sums of the random variables in reverse order: Xn , . . . , X1 ; it follows that
P
[ Oms kax< n I Sn - Sk i > 2 a ] � 2 I ms kaxs n P [I Sk l > a ] .
These two inequalities give (22.10).
•
298
RANDOM VARIABLES AND EXPECTED VALUES
If the Xk have mean 0 and Chebyshev' s inequality is applied to the right side of (22.10), and if a is replaced by a/4, the result is Kolmogorov' s inequality (22.9) with an extra factor of 64 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series
For independent Xn , the probability that EXn converges is either 0 or 1 . It is natural to try and characterize the two cases in terms of the distributions of the individual xn .
Suppose that { Xn } is an independent sequence and E[Xn ] = 0. If E Var[Xn 1 < oo , then EXn converges with probability 1. PROOF. By (22.9),
Theorem 12.6.
Since the sets on the left are nondecreasing in
r, letting r --+
oo
gives
Since E Var[ Xn ] converges, (22 .1 1 )
for each £. Let E( n, £) be the set where supj , k ni Si - Sk i > 2£ , and put E( £) = n n E(n, £). Then E (n , £) � E(£), and (22 . 11 ) implies P( E(£)) = 0. Now U E ( £ ) , where the union extends over positive rational £, contains the set where the sequence { Sn } is not fundamental (does not have the Cauchy • property), and this set therefore has probability 0. �
(
EX111ple 11
Let Xn ( w ) = rn ( w )a n , where the rn are the Rademacher functions on the unit interval-see ( 1 . 1 2) . Then xn has variance a;, and so Ea; < oo implies that Ern ( w )a n converges with probability 1. An interest ing special case is a n = n - 1 • If the signs in E ± n - 1 are chosen on the toss of a coin, then the series converges with probability 1. The alternating • harmonic series 1 - 2 - 1 + 3 - 1 + · · · is thus typical in this respect. 22.3.
SECTION
22.
299
SUMS OF INDEPENDENT RANDOM VARIABLES
If EXn converges with probability 1, then Sn converges with probability 1 to some finite random variable S. By Theorem 20.5, this implies that S,. _... p S. The reverse implication of course does not hold in general, but it does if the summands are independent.
For an independent sequence { Xn }, the Sn converge with probability 1 if and only if they converge in probability. It is enough to show that if Sn --+ p S, then { Sn } is fundamental pJtOOF. with probability 1. Since 1beorem
22.7.
P ( I Sn +j - Sn l �
t)
[
s P l sn +j - S l
i] + [
>
P l sn - S l
>
i ].
Sn -+p S implies
(22. 12) But by
sup P ( I Sn +J - Sn l lim n }� 1
> £)
=
0.
(22.10), P
[
m � I Sn +J - Sn l
1 Sj S k
>£
]
£ 4
],
and therefore
It now follows by before.
(22.12) that (22.11) holds, and the proof is completed as
•
The final result in this direction, the three series theorem, does provide necessary and sufficient conditions for the convergence of EXn in terms of the individual distributions of the xn . Let x� c) be xn truncated at c: X -
Theorem 22.8.
aeries (22.1 3 ) In
Suppose that { Xn } is independent and consider the three
E P [ I Xn l
> c] ,
order that EXn converge with probability 1 it is necessary that the three Series converge for all positive c and sufficient that they converge for some positive c.
300
RANDOM VARIABLES AND EXPECTED VALUES
Suppose that the series (22.13) converge and put m�c > = E [ X�c>]. By Theorem 22.6, E( X�c> - m�c > ) converges with probabil ity 1 , and since Em�c> converges, so does EX�c >. Since P[ Xn ¢ x�c > i.o.] = 0 by the first Borel-Cantelli lemma, it follows finally that EXn converges with probability 1 . • PROOF OF SUFFICIENCY.
Although it is possible to prove necessity in the three series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circular ity of reasoning, since the three series theorem is nowhere used in what follows. Suppose that EXn converges with probability 1 and fix c > 0. Since Xn --+ 0 with probability 1, it follows that Ex�c > converges with probability 1 and, by the second Borel-Cantelli lemma, that EP[ I xn I > c] < oo . Let M�c > and s!c > be the mean and standard deviation of s�c > = Ek - t x�c > . If s!c> --+ oo , then since the x�c> - m�c > are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that PROOF OF NECESSITY.
( 22.14)
sn< c> - Mn< c>
And since
EX�c>
s�c >;s!c> --+
Y
2
2 1 1 dt. e j V2 '1T x
converges with probability 1, s!c> --+ 0 with probability 1, so that (Theorem 20.5)
limP n [ IS�c>;s!c> l
( 22.15 ) But
=
1
also implies
> £ ] = 0.
(22.14) and (22.15) stand in contradiction: p
oo
Since
< c> < c> - Mn< c> s s n n X< sn< c >
is greater than or equal to the probability in (22.14) minus that in (22.15) , it is positive for all sufficiently large n (if x < y ) But then .
X-£
;sn< c> < y + £ '
and this cannot hold simultaneously for, say, (x - £, y + £) = ( - 1, 0) and (x - £, y + £ ) = (0, 1). Thus s!c> cannot go to oo , and the third series in (22.13) converges.
SECTION
22.
SUMS OF INDEPENDENT RAN DOM VARIABLES
301
And now it follows by Theorem 22.6 that E( X� ( ) - m �c) ) converges with probability 1 , so that the middle series in (22. 1 3) converges as well. • If Xn = rn a n , where rn are the Rademacher functions, then Ea; < oo implies that E Xn converges with probability 1 . If E Xn converges, then a n is bounded, and for large c the convergence of the series (22 . 1 3) implies E a; < oo : If the signs in E + a n are chosen on the toss of a coin, then the series converges with probability 1 or 0 according as Ea; converges or diverges. If Ea; converges but E l a nl diverges, then E + a n is • with probability 1 conditionally but not absolutely convergent. Example 22.4.
Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the most interesting cases, E xn converges not because the xn go to 0 at a high rate but because they tend to cancel each other out.t Random Taylor Series *
Consider a power series E ± z n , where the signs are chosen on the toss of a coin. The radius of convergence being 1, the series represents an analytic function in the open unit disc D0 = [z: l z l < 1] in the complex plane. The question arises whether this function can be extended analytically beyond D0 • The answer is no: With probability 1 the unit circle is the natural boundary. 1beorem 22.9.
Let { xn } be an independent sequence such that n = 0, 1, . . . .
(22.16 )
There is probability 0 that F( w,
(22.17 )
Z
00
) = E Xn ( W ) z n n-O
coincides in D0 with a function analytic in an open set properly containing D0• It will be seen in the course of the proof that the w-set in question lies in a ( X0 , X 1 , ) and hence has a probability. It is intuitively clear that if the set is measurable at all, it must depend only on the Xn for large n and hence • • •
t See Problem 22.3 . *This topic, which requires complex variable theory, may be omitted.
302
RANDOM VARIABLES AND EXPECTED VALUES
must have probability either 0 or 1. PROOF.
Since n = 0, 1 , . . .
( 22.18 )
with probability 1, the series in (22.17) has radius of convergence 1 outside a set of measure 0. Consider an open disc D = [z: l z - r l < r ] , where r E Do and r > 0. Now (22.17) coincides in D0 with a function analytic in D0 u D if and only if its expansion
about r converges at least for this holds. The coefficient am ( "' ) =
�!
lz - rl
< r. Let
m
p< > ( "' · r ) =
AD be the set of w for which
E ( t':t)
X,. ( "' ) r n -
m
n-m
-
is a complex-valued random variable measurable o ( Xm, Xm + 1 , • • • ). By the m� �root test, w E AD if and only if lim supm Vl am ( w )I < r - 1 • For each m 0 , the condition for w E AD can thus be expressed in terms of a m 0( w ), a m o 1 ( w ), . . . alone, and so AD E (J ( Xm o' Xm o 1 ' • • • ). Thus AD has a prob ability, and in fact P(AD) is 0 or 1 by the zero-one law. Of course, P( A D) = 1 if D c D0• The central step in the proof is to show that P(AD) = 0 if D contains points not in D0 • Assume on the contrary that P(AD) = 1 for such a D. Consider that part of the cir cumference of the unit circle that lies in D, and let k be an integer large enough that this arc has length exceeding 2 '1Tjk. Define +
+
if n ¥= 0 ( mod k ) , if n = 0 ( mod k ) . Let BD be the w-set where the function
( 22.19 )
G(w, z ) = E Yn( w ) z n 00
n -O
coincides in D0 with a function analytic in D0 u D. The sequence { Y0 , Y1 , . . • } has the same structure as the original se quence: the Yn are independent and assume the values ± 1 with probabili ty t
SECTION
22.
303
SUMS OF INDEPENDENT RANDOM VARIABLES
each . Since BD is defined in terms of the Yn in the same way as AD is defined in terms of the Xn , it is intuitively clear that P( BD) and P(AD) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular w each of (22.17) and (22.19) coincides in D0 with a function analytic in D0 u D, the same must be true of
( 22 . 20 )
00
F( w, z ) - G ( w, z ) = 2 E xmk ( w ) z mk . m -0
Let D1 = [ze 2 "illk : z E D). Since replacing z by ze 2 "i /k leaves the function (22.20) unchanged, it can be extended analytically to each D0 u D1, I = 1 , 2, . . . . Because of the choice of k, it can therefore be extended analyti cally to [z: l z I < 1 + E] for some positive E; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, AD n BD cannot contain a point w satisfying (22.18). Since (22.18) holds with probability 1, this rules out the possibility P(AD) = P( BD) = 1 and by the zero-one law leaves only the possibility P(AD) = P( BD) = 0. Let A be the w-set where (22.17) extends to a function analytic in some open set larger than D0• Then w E A if and only if (22.17) extends to D0 U D for some D = [z: l z - r 1 < r] for which D - D0 ¢ 0, r is rational, and r has rational real and imaginary parts; in other words, A is the countable union of AD for such D. Therefore, A lies in o( X0, X1 , ) and has probability 0. It remains only to show that P(AD) = P(BD), and this is most easily done by comparing { xn } and { yn } with a canonical sequence having the same structure. Put Zn ( w ) = ( Xn ( w ) + 1)/2, and let Tw be E:- o zn ( w )2 - n - I on the w-set A* where this sum lies in (0, 1 ]; on 0 - A* let Tw be 1, say. Because of (22.16) P(A*) = 1. Let �= o( X0, X1 , ) and let � be the a-field of Borel subsets of (0, 1 ]; then T: 0 --+ (0, 1] is measurable �/!fl. Let rn(x) be the n th Rademacher function. If M = [x: r1(x ) = u ; , i = 1, . . . , n ], where u ; = ± 1 for each i, then P(T - 1M ) = P[ w : �((a,)) = u ; , i = 0, 1, . . . , n - 1] = 2 - n , which is the Lebesgue measure A(M) of M. Since these sets form a w-system generating ffl, P(T - 1M) = A(M ) for all M in fJI (Theorem 3.3). n Let MD be the set of x for which E�-o rn + 1 (x)z extends analytically to D0 U D. Then MD lies in ffl, this being a special case of the fact that AD lies in �. Moreover, if w E A*, then w E AD if and only if Tw E MD : 1 A * n AD = A* n T - MD. Since P(A*) = 1, it follows that P(AD) = A ( MD) . This argument only uses (22.16), and therefore it applies to { Yn } and BD • as well. Therefore, P(BD) = A( MD) = P(AD). • • •
• . •
304
RANDOM VARIABLES AND EXPECTED VALUES
The Hewitt-Savage Zero-One Law *
Kolmogorov's zero-one law does not apply to the events [Sn > an i.o.] and [ Sn = 0 i.o. ], which lie in each a-field a ( Sn , Sn + 1 , ) but not in each a ( X,. , Xn + 1 , ) . If the X,, are identically distributed as well as independent, however, these events come under a different zero-one law. Let �:· k consist of the sets H in � n + " that are symmetric in the first n coordinates in the sense that if ( x1 , , x,. + k ) lies in H then so does ( x '" 1 , , x'" n ' xn + 1 , , xn + k ) for every permutation of 1, . . . , n. Then �:· " is a a-field. Let 9:,, k consist of the sets [( XI ' . . . ' xn + k ) E H) for H E �:· k ' let 9:, be the a-field generated by Uk'.... 1 9:,, k , and let .9= n�_ 1 9:, . The information in 9:, corresponds to knowing the values xn + I ( w ) , xn + 2 ( w ), . . . exactly and knowing the values X1 ( w ) , . . . , Xn ( w) to within a permutation. If n < m , then [ Sm e Hm ] e .Ym . o c � m - n c � , and so [ Sm e H i.o.] comes under the Hewitt-Savage zero-one . m law: •
•
•
•
•
•
•
•
•
•
•
•
•
.
'IT
•
Theorem 22.10. If XI ' x2 ' . . are independent and identically distributed ' and if
A
e
.9,
.
then either P(A ) = 0 or P( A) = 1.
Theorem 11.4 there Since u�- I a ( XI ' . . . ' xn) is a field, by the corollary to n is for given £ an n and a set U = [( X1 , , Xn ) e H) ( H e � ) such that P( A�U ) < ( . Let v = [( xn + l ' . . . ' x2 n> E H). From A E .92 n + I and the fact that the xk are identically distributed as well as independent it will be shown that
PROOF.
•
( 22 .21 )
•
•
P( A�U ) = P ( A� V ) .
This will imply P(A � V) < £ and P(A�(U n V)) � P( A�U) + P(A � V) < 2£. Since U and V are independent and have the same probability, it will follow that P( .A ) is within £ of P(U) and hence P 2 (A) is within 2£ of P 2 (U) == P( U)P( V) = P( U n V), which is in tum within 2£ of P(A ). But then 1P 2 ( A ) - P(A ) I < 4£ for all £ , and so P( .A ) must be 0 or 1. It remains ntok prove (22.21). If B -E .92 n. k ' then B == [( X1 , , X2 n + k ) e J) for some J in &f 2 · and since •
s
•
•
'
it follows that P( B n U) == P( B n V). But since U k'_ 1 .92 n. k is a field generating .92 n , P( B n U ) == P( B n V) holds for all B in .92 n (Theorem 10.4). Therefore P(Ac n U) == P(Ac n V); the same argument shows that P(A n Uc) == P(A n • vc ) . Hence (22.21 ) . * This topic may be omitted.
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIAB LES
305
PROBLEMS 22. 1.
is an independent sequence and Y is measurable Suppose that X1 , X2 , ) for each n. Show that there exists a constant a such that a( Xn , Xn + 1 , P[ Y == a] == 1.
22.2.
21.12 f Let XI ' x2 ' . . . be independent random variables with distribution functions F1 , F2 , (a) Show that P[supn Xn < oo ] is 0 or 1 and is in fact 1 if and only if En (1 - F, ( x)) < oo for some x. (b) Suppose that X == supn Xn is finite with probability 1 and let F be its distribution function. Show that F( X ) == n � F, (X ) . (c) Show further that E[ x+ ] < oo if and only if En fx a Xn dP < oo for some a .
• • •
• • •
•
• •
•
-1
II
>
22.3.
Assume { Xn } independent and define X!c > as in Theorem 22.8. Prove that for E I xn I to converge with probability 1 it is necessary that EP[ I xn I > c) and E E( 1 X! c > 1 ] converge for all positive c and sufficient that they converge for some positive c. If the three series (22.13) converge but EE( I X!c > 1 1 == oo , then there is probability 1 that EXn converges conditionally but not absolutely.
22.4.
f (a) Generalize the Borel-Cantelli lemmas: Suppose Xn are nonnega tive. If E E( Xn ] < oo , then EXn converges with probability 1. If the Xn are independent and uniformly bounded, and if EE[ Xn ] = oo , then EXn di verges with probability 1. (b) Construct independent, nonnegative Xn such that EXn converges with probability 1 but EE[ Xn 1 diverges. For an extreme example, arrange that P[ Xn > 0 i.o. ) == 0 but E[ Xn ] = oo .
22.5.
Show under the hypotheses of Theorem 22.6 that EXn has finite variance and extend Theorem 22.4 to infinite sums.
22.6.
are independent, each with the Cauchy 20.20 f Suppose that X1 , X2 , distribution (20.45) for a common value of u . (a) Show that n - 1 EZ .... 1 Xk does not converge with probability 1 to a constant. Contrast with Theorem 22.1. (b) Show that P[ n - 1 max k s n xk < x] --+ e - ujwx for X > 0. Relate to Theorem 14.3.
12.7.
If X1 , X2 , are independent and identically distributed, and if P[ X1 > 0] == 1 and P[ X1 > 0] > 0, then En Xn == oo with probability 1. Deduce this from Theorem 22.1 and its corollary and also directly: find a positive £ such that xn > ( infinitely often with probability 1.
11.8.
Suppose that XI ' x2 ' . . . are independent and identically distributed and show that En Pl i Xn l > an ] == oo for each a , E[ I X1 1 ] == oo . Use (21.9) to 1 and conclude that supnn - 1 Xn l == oo with probability 1. Now show that supn n - 1 1 sn I == oo with probability 1. Compare with the corollary to Theo rem 22.1.
• •
•
• • •
306
22.9.
RANDOM VARIABLES AND EXPECTED VALUES
Wald ' s equation. Let XI ' x2 ' . . . be independent and identically distributed
with finite mean, and put sn == x. + . . . + xn . Suppose that T is a stopping time: T has positive integers as values and [ T == n ] E a( XI ' . . . ' xn ); see Section 7 for examples. Suppose also that E[ T] < oo. (a) Prove that (22.22)
22.10.
22.11.
22.12.
22. 13.
22.14. 22.15.
22. 16.
(b) Suppose that xn is ± 1 with probabilities p and q , p :F q , let T be the first n for which sn is - a or b ( a and b positive integers), and calculate E[ T ) . This gives the expected duration of the game in the gambler's ruin problem for unequal p and q. 20.12 f Let Zn be 1 or 0 according as at time n there is or is not a record in the sense of Problem 20.12. Let Rn == Z1 + · · · + Zn be the number of records up to time n. Show that R n/log n -+pl. 22.1 f (a) Show that for an independent sequence { Xn } the radius of n convergence of the random Taylor series En Xn z is r with probability 1 for some nonrandom r. (b) Suppose that the xn have the same distribution and P[ XI :F 0] > 0. Show that r is 1 or 0 according as log + 1 X1 1 has finite mean or not. Suppose that X0 , X1 , . . are independent and each is uniformly distributed n ;x over [0, 2w]. Show that with probability 1 the series Ene ,. z has the unit circle as its natural boundary. In the proof of Kolmogorov's zero-one law, use the corollary to Theorem 11.4 in place of Theorem 4.2. Hint: See the first part of the proof of the Hewitt-Savage zero-one law. Prove (what is essentially Kolmogorov's zero-one law) that if A is indepen dent of a w-system gJ and A e a(gJ), then P(A ) is either 0 or 1. 11.3 f Suppose that .JJI is a semiring containing Q. (a) Show that if P(A n B ) < bP( B) for all B e .JJI, and if b < 1 and A e a( .JJI ) , then P(A ) == 0. (b) Show that if P( A n B) S P( A)P( B) for all B e .JJI, and if A e a(..w') , then P( A ) is 0 or 1. (c) Show that if aP( B) < P(A n B) for all B e .JJI, and if a > 0 and A e a( .JJI ) , then P( A ) == 1. (d) Show that if P( A )P( B ) < P( A n B) for all B e .JJI, and if A e a(..w' ) , then P( A ) is 0 or 1. (e) Reconsider Problem 3.20. 22.14 f Burstin ' s theorem. Let I be a Borel function on [0, 1] with arbitrarily small periods: For each £ there is a p with 0 < p < £ such that l( x ) == l( x + p ) for 0 < x < 1 - p. Show that such an I is constant almost everywhere: (a) Show that it is enough to prove that P< r 1 B) is 0 or 1 for every Borel set B, where P is Lebesgue measure on the unit interval •
SECTION
23.
307
THE POISSON PROCESS
Show that r 1 B is independent of each interval [0, X], and conclude that P< r 1 B ) is 0 or 1. (c) Show by example that f need not be constant. (b)
SECI'ION 23. THE POISSON PROCESS Characterization of the Exponential Distribution
Suppose that
X
has the exponential distribution with parameter
P [ X > x ] = e - ax ,
(23. 1 ) The definition
(23.2)
a:
X � 0.
(4.1) of conditional probability then gives
P [ X > x + yi X > x] = P [ X > y ],
X , y � 0.
Imagine X as the waiting time for the occurrence of some event such as the arrival of the next customer at a queue or telephone call at an exchange. As observed in Section 14 ( see (14.6)), (23.2) attributes to the waiting-time mechanism a lack of memory or aftereffect. And as shown in Section 14, the condition (23.2) implies that X has the distribution (23.1) for some positive «. Thus if in the sense of (23.2) there is no aftereffect in the waiting-time mechanism, then the waiting time itself necessarily follows the exponential law. The Poisson Process
Consider next a stream or sequence of events, say arrivals of calls at an exchange. Let X1 be the waiting time to the first event, let X2 be the waiting time between the first and second events, and so on. The formal model consists of an infinite sequence X1 , X2 , of random variables on some probability space, and sn = XI + . . . + xn represents the time of occurrence of the n th event; it is convenient to write S0 = 0. The stream of events itself remains intuitive and unformalized, and the mathematical definitions and arguments are framed exclusively in terms of the Xn. If no two of the events are to occur simultaneously, the Sn must be strictly increasing, and if only finitely many of the events are to occur in each finite interval of time, sn must go to infinity: •
•
•
n
308
RANDOM VARIABLES AND EXPECTED VALUES
This condition is the same thing as
(23.4 )
n
Throughout the section it will be assumed that these conditions hold everywhere-for every w. If they hold only on a set A of probability 1, and if Xn ( w) is redefined as Xn ( w) = 1, say, for w � A , then the conditions hold everywhere and the joint distributions of the xn and sn are unaffected.
Condition 0°.
For each w, (23.3) and (23.4) hold.
The arguments go through under the weaker condition that (23.3) and (23.4) hold with probability 1, but they then involve some fussy and uninteresting details. There are at the outset no further restrictions on the X;; they are not assumed independent, for example, or identically distrib uted. The number N1 of events that occur in the time interval [0, t] is the largest integer n such that sn < t:
N, = max ( n : Sn � t ) . Note that N, = 0 if t < S1 = X1 ; in particular, N0 = 0. The number of events in (s, t] is the increment N, - Ns . ( 23 .5 )
•
3 N1
2 N1
1 N1
=0
So = 0
s,
=
=
=
2
1
x,
�
=
x,
+
x2
From (23.5) follows the basic relation connecting the N, with the
( 23.6 ) From this follows
( 23.7 ) Each N, is thus a random variable.
Sn :
SECTION
t
23.
309
THE POISSON PROCESS
stochastic process,
that is, a collection of The collection [ N,: > 0] is a random variables indexed by a parameter regarded as time. Condition oo can be restated in terms of this process:
Condition 0°. For each w, N,( w) is a nonnegative integer for t > 0, N0( w ) 0, and lim, _. 00 N,( w) = oo ; further, for each w , N,( w ) as a function of t and right-continuous, and at the points of discontinuity the is nondecreasing saltus N,( w ) - sups < tNs ( w ) is exactly 1. =-
It is easy to see that (23.2) and (23.4) and the definition (23.5) give random variables N, having these properties. On the other hand, if the stochastic process [ N,: > 0] is given and does have these properties, and if random variables are defined by Sn( w ) = inf[t: N,( w ) > n] and Xn( w ) = Sn ("') - Sn _ 1 ( w ), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original N,. Therefore, anything that can be said about the Xn can be stated in terms of the N,, and conversely. The points S1 ( w ), S2 ( w ), . . . of (0, oo) are exactly the discontinuities of N,( w ) as a function of because of the queueing example, it is natural to call them The program is to study the joint distributions of the N1 under conditions on the waiting times Xn and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect:
t
arrival times.
t;
Condition 1 °. The xn are independent and each is exponentially distributed with parameter a. In this case P [ Xn > 0] = 1 for each n and n - 1Sn --+ a- 1 by the strong
and so (23.3) and (23.4) hold with probability 1 ; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition 1 o, Sn has the distribution function specified by 1 (20.40), so that P[N1 � n] = Er:_ n ( ) / by (23.6), and
law of large numbers (Theorem
22.1),
e - at i ! " at ) ( , (23 .8) P [ Nt = n ] = e_,. n ! ' n = 0, 1 , . . . . Thus N, has the Poisson distribution with mean at. More will be proved 0
;
Presently.
Condition
N,1,
•
•
•
2°
, N," -
(i)
N,k - 1
For 0 < t1 < are independent.
· · ·
1; t is a fixed discontinuity if the probability of this is positive. The condition that there be no fixed discontinuities is therefore
( 23 .17 )
P ( Sn = t ] = 0,
t > 0,
n > 1;
that is, each of S 1 , S2 , has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that N,( "' ) has a discontinuity somewhere (and indeed has infinitely many of them). But (23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition. •
•
•
If Condition 0° holds and [ N,: t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution.
Theorem 23.3.
This is Prekopa's theorem. The conclusion is not that [ N,: t > 0] is a Poisson process, because the mean of N, - Ns need not be proportional to t - s. If cp is an arbitrary nondecreasing, continuous function on [0, oo) with cp (O) = 0, and [ N,: t > 0] is a Poisson process, then Ncp satisfies the conditions of the theorem. t The problem is to show for t' < t " that N,,, - N,, has for some A > 0 a Poisson distribution with mean A, a unit mass at 0 being regarded as a Poisson distribution with mean 0. PROOF. *
t This is in fact the general process satisfying them; see Problem 23.9. * This proof carries over to general topological spaces ; see Timothy C. Brown, A mer. Math . Month(v, 91 ( 1984) 1 16- 1 23.
23. THE POISSON PROCESS The procedure is to construct a sequence of partitions
315
SECTION
(23 .18) of
t'
=
t n O < t nl < . . . < t n
rII
=
t"
[t ', t "] with three properties. First, each decomposition refines the preced
ing one: each t n k is a t n + 1 , j . Second, r,.
E p [ Nl,. k - Nln . k - 1
(23 .19)
>
k-l for
some finite
(2 3 .20 )
1] j A
A and max P [ N,nk - N,n,k - 1 > l 2 1Sk< r,.
] --+ 0.
Once the partitions have been constructed, the rest of the proof is easy: Let Zn k be 1 or 0 according as N,nlr. - N,n,k- 1 is positive or not. Since [ N,: t � 0] has independent increments, the Zn k are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = Ek _ 1 Zn k satisfies P[ Zn = i] --+ e - A A;/i! Now N,, - N,, > Zn , and there is strict inequality if and only if N1nlr. - N,n k- 1 > 2 for some k. Thus (23.21) implies i A P[N,, - N,, ¢ Zn 1 --+ 0, and there1ore P[N,, - N,, = i] = e - A'/i! To construct the partitions, consider for each t the distance D, = infm � 1 1 t - Sm l from t to the nearest arrival time. Since Sm --+ oo , the infimum is achieved. Further, D, = 0 if and only if Sm = t for some m , and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P [ D1 = 0] = 0. Choose 81 so that 0 < 81 < n - 1 and P[ D1 < 8,] < n - 1 • The intervals (t - 81 , t + 8,) for t' < t < t" cover [t', t "]. Choose a finite subcover, and in (23.18) take the t n k for 0 < k < rn to be the endpoints (of intervals in the subcover) that are contained in (t', t"). By the construction, ..
·
(23 .22 ) and the probability that ( t n, k 1 ' t n k 1 contains some sm is less than -
n - 1 . This
gives a sequence of partitions satisfying (23.20). Inserting more points in a Partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one.
316
RANDOM VARIABLES AND EXPECTED VALUES
To prove (23.21) it is enough (Theorem 4.1) to show that the limit superior of the sets involved has probability It is in fact empty: If for infinitely many N, (w ) - N, - 1 (w ) > 2 holds for some k < n , then by discontinuity points (arrival (23.22), N,( w ) as a function of has in times) arbitrarily close together, which requires Sm ( w ) e for in finitely many in violation of Condition 0°. It remains to prove (23.19). If Zn k and Zn are defined as above and Pn k = P [ Zn k = 1], then the sum in (23.19) is E k Pn k = E[ Zn 1 · Since Zn + t ;::: Zn , E k Pn k is nondecreasing in Now
n,
""
"·"
t
0. [t', t"]
r [t ', t"]
m,
n.
P ( N,, - N,, = 0 ) = P ( Zn k = 0, k � rn ) =
r ,
.n
k�l
[ r, l
( 1 - Pn k ) < exp - E Pn k
·
k-1
If the left-hand side here is positive, this puts an upper bound on E k Pn k ' and (23.19) follows. But suppose P[N,, - N, , = = If is the midpoint of and then since the increments are independent, one of P[Ns - N,, = and P[N,, - Ns = must vanish. It is therefore possible to find a nested sequence of intervals [u m , vm ] such that vm - u m --+ and the event A m = [Nv - Nu > 1] has probability 1. But then P(n m A m ) = 1, and if t is the point common to the [ u m , vm ], there is an arrival at with probability • 1, contrary to the assumption that there are no fixed discontinuities.
t' 0]
0] 0. s
t",
m
0]
m
0
t
Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be · essentially independent if the arrivals to time cannot seriously deplete the population of potential arrivals, so that Ns has for > negligible effect on N, - Ns . And the condition that there are no fixed discontinuities is entirely natural. These conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, prede termined times. If the arrival rate is essentially constant, this leads to the following condition.
s
t s
Condition 3°. (i) For 0 < t1 < · · · < tk the increments N11, N12 N, , . . . , N1 - N1 are independent. (ii) The distribution of N1 - Ns depends only on the difference t - s. Theorem 23.4. Conditions 1°, 2°, and 3° are equivalent in the presence of Condition 0°. Obviously Condition 2° implies 3°. Suppose that Condition 3° holds. If J, is the saltus at t (J1 = N, - sups < 1 Ns ), then [N, - N, _ n- 1 1
k
k-1
PROOF.
2:
SECTION
23.
317
THE POISSON PROCESS
1 ] � [ J1 > 1 ], and it follows by (ii) of Condition 3° that P[ J, > 1] is the same
t.
for all But if the value common to P[ J, > 1] is positive, then by the indepe ndence of the increments and the second Borel-Cantelli lemma there is probability 1 that 11 > 1 for infinitely many rational in (0, 1 ), for example, which contradicts Condition 0° . By Theorem 23.3, then, the increments have Poisson distributions. If and is the mean of N,, then N, - Ns for < must have mean + thus = must by (ii) have mean Therefore, f satisfies Cauchy' s functional equation [ A20] and, being nondecreasing, must = for > 0. Condition 0° makes = 0 impossible . have the form •
t
s t f(t) - f(s) f(t - s); f(t) f(s) f(t - s). a f( t) at a
f( t)
One standard way of deriving the Poisson process is by differential equations.
Condition 4°. If 0 < t1 < tegers, then
···
0] has no fixed discontinuities. The occurrences of o( h) in (23.23) and (23.24) denote functions, say +1 ( h ), and f/>2 (h), such that h Iq,; ( h) -+ 0 as h � 0; the f/>; may a priori depend on k, t1, , tk, and n1, , nk as well as on h. It is assumed in as h � 0 .
-
•
•
•
•
•
•
(23.23) and (23.24) that the conditioning events have positive probability, so that the conditional probabilities are well defined.
1beorem
23.5.
condition 0° .
Conditions 1 ° through 4° are all equivalent in the presence of
2 ° -+ 4 °. For a Poisson process with rate a, the left-hand sides of (23.23) and (23.24) are e - ahah and 1 - e - ah - e - ahah, and these are «h + o(h) and o(h), respectively, because e - ah = 1 - ah + o(h). And by PROOF OF
the argument -·
PROOF
[�i
in the preceding proof, the process has no fixed discontinui-
ti,
ni;
.
OF 4° -+ 2 ° . Fix k, the and the denote by A the event j � k]; and for > 0 put = P [ N," + t - N," = i A ] . It will
== ni,
t
Pn (t)
n
318
RANDOM VARIABLES AND EXPECTED VALUES
be shown that (23 .25 )
r , (at _ ,. t = e n ( P ) n! '
n
This will also be proved for the case in which 2° will then follow by induction. If t > 0 and I t - s I < n - 1 , then
= 0, 1 , . . . . Pn(t) = P[N1 = n]. Condition
As n j oo , the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[N, = n] is continuous at t. The same kind of argument works for conditional probabilities and for t = 0, and so Pn ( t) is continuous for t > 0. To simplify the notation, put D, = N,k + t - N,k . If D, + h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities,
Pn( t + h ) = Pn( t ) P ( D, + h - D, = OIA n [ D, = n ) ] +pn_ 1 ( t ) P ( D, + h - D, = l i A n [ D, = n - 1 ]] n-2 + E Pm ( t ) P [ D, + h - D, = n - m i A n [ D, = m ]] . m =O
For n < 1, the final sum is absent, and for n = 0, the middle term is absent as well. This holds in the case Pn(t) = P[N, = n ] if D, = N, and A = 0. (If t = 0, some of the conditioning events here are empty; hence the assump tion t > 0.) By (23.24), the final sum is o( h) for each fixed n. Applying (23.23) and (23.24) now leads to
Pn ( t + h ) = Pn(t )( l - ah ) + Pn - l (t )ah + o ( h ) , and letting
h � 0 gives
( 23 .26)
In the case n = 0, take p _ 1 ( t) to be identically 0. In (23.26), t > 0 and p�( t) is a right-hand derivative. But since Pn( t) and the right side of th e equation are continuous on [0, oo ), (23.26) holds also for t = 0 and p� ( t) can be taken as a two-sided derivative for t > 0 [A22].
SECTION
23.
319
THE POISSON PROCESS
Now (23.26) gives [A23]
Since Pn(O) is 1 or 0 as
n = 0 or n > 0, (23.25) follows by induction on n. •
Stochastic Processes
stochastic process-that
t
The Poisson process [N1 : > 0] is one example of a is, a collection of random variables (on some probability space ( 0, �, P)) indexed by a parameter regarded as representing time. In the Poisson case, time is In some cases the time is Section 7 concerns the sequence { Fn } of a gambler's fortunes; there represents time, but time that increases in jumps. Part of the structure of a stochastic process is specified by its For any finite sequence 1 , , of time points, the k-dimensional random vector ( N1 , , N111 ) has a distribution p. 11 ... 111 over R k . These measures p. 1 1k are the finite-dimensional distributions of the process. Condition 2° of this section in effect specifies them for the Poisson case:
continuous.
discrete: n
dimensional distributions.
t
1
n1
n
1
•
•
•
•
•
tk
finite
•
•
t
t
n 0 t0
if 0 < < · · · < k and 0 < 1 < · · · < k (take = = 0). The finite-dimensional distributions do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed w, N1 ( w) as a function of has the regularity properties given in the second version of Condition 0°. These properties are used in an essential way in the proofs. Suppose that is t or 0 according as is rational or irrational. Let N1 be defined as before, and let
t
/(t)
(23 .28)
t
M1 ( w ) = N1 (w ) + f ( t + X1 ( w ) ) .
If R is the set of rationals, then P[w: f(t + X1 (w)) ¢ 0] = P[w: X1 (w) e R - t ] = 0 for each t because R - t is countable and X1 has a density. Thus P[M1 = N1 ] = 1 for each t, and so the stochastic process [M,: t > 0] has the same finite-dimensional distributions as [N1 : t > 0]. For w fixed,
320
RANDOM VARIABLES AND EXPECTED VALUES
t
however, M1 ( w ) as a function of is everywhere discontinuous and is neither monotone nor exclusively integer-valued. The functions obtained by fixing w and letting vary are called the or of the process. The example above shows that the finite-dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to place conditions on the character of the sample paths as well as on the finite-dimensional distributions. Condition 0° was imposed throughout this section to ensure that the sample paths are nondecreasing, right-continuous, integer-valued step functions, a natural condition if N, is to represent the number of events in [0, Stochastic processes in continuous time are studied further in Chapter 7.
path
t
functions sample paths
t].
PROBLEMS
Assume the Poisson processes here satisfy Condition 0° as well as Condition 1°. 13.1. 13.2.
13.3.
Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. 20.23 f Show that the time Sn of the nth event in a Poisson stream has the gamma density /( x; a, n) as defined by (20.47). This is sometimes called the Er/ang density. Let A , t - SN1 be the time back to the most recent event in the Poisson stream (or to 0) and let B, SN 1 - t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as X1 (exponentially with parameter a), and that A, is distributed as min{ X1 , t } : P [ A, S t] is 0, 1 - e - or 1 as x < 0, 0 < x < t, or x � t. f Let L, A + B, SN1 1 - SN1 be the length of the interarrival interval covering t. (a) Show that L, has density ==
==
,
+
ax ,
13.4.
==
==
,
+
if O < x < t, if X > I. (b)
13.5.
Show that E[ L,] converges to 2 E[ X1 ] as t --+ oo . This seems paradoxi cal because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. Merging Poisson streams. Define a process { N, } by (23.5) for a sequence { Xn } of random variables satisfying (23.4). Let { X� } be a second sequence of random variables, on the same probability space, satisfying (23.4) and define { N,' } by N,' max[ n: X{ + · · · + X� < t ]. Define { N," } by N," N, + N,'. Show that, if a( X1 , X2 , ) and a( X{, X2 , . . . ) are independent ==
=
•
•
•
23.
SECTION
and { � } and { N, } are Poisson processes with respective rates a and fJ, then { N, } is a Poisson process with rate a + {J . f The n th and (n + 1)st events in the process { N, } occur at times Sn and '
"
2.1.6.
321
THE POISSON PROCESS
sn + 1 ·
(a) Find the distribution of the number Ns,.
+
1
- Ns,. of events in the other
process during this time interval. (b) Generalize to Ns - Ns,. . 23.7. For a Poisson stream and a bounded linear Borel set A , let N(A ) be the number of events that occur at times lying in A . Show that N( A ) is a random variable having the Poisson distribution with mean a times the Lebesgue measure of A . Show that if A and B are disjoint, then N( A ) and N( B) are independent. Suppose that X1 , X2 , are independent and exponentially distributed with 23.8. parameter a, so that (23.5) defines a Poisson process { N, }. Suppose that are independent and identically distributed and that Y1 , Y2 , a( x1 ' x2 ' . . . ) and a( y1 ' Yi ' . . . ) are independent. Put z, == Ek N yk . This l is the compound Poisson process. If, for example, the event at time Sn in the original process represents an insurance claim, and if Y, represents the amount of the claim, then Z, represents the total claims to time t. (a) If Yk == 1 with probability 1, then { Z, } is an ordinary Poisson process. (b) Show that Z, has independent increments and that Zs + r - Zs has the same distribution as Z, . (c) Show that, if Yk assumes the values 1 and 0 with probabilities p and 1 p (0 < p < 1), then { Z, } is a Poisson process with rate pa. 11.9. Suppose a process satisfies Condition 0° and has independent, Poisson-dis tributed increments and no fixed discontinuities. Show that it has the form { Ncr> < I ) } , where { N, } is a standard Poisson process and cp is a nondecreas ing, continuous function on [0, oo) with cp(O) == 0. 11. 10. If the waiting times Xn are independent and exponentially distributed with parameter a, then Sn/n --+ a - 1 with probability 1, by the strong law of large numbers. From lim, _ 00 N, == oo and SN < t < SN + 1 deduce that lim, - 00 N,/t == a with probability 1. 1.1.1 1. f (a) Suppose that X1 , X2 , . are positive and assume directly that Sn/n --+ m with probability 1, as happens if the Xn are independent and identically distributed with mean m. Show that lim , N,/t == 1/m with probability 1. (b) Suppose now that Sn/n --+ oo with probability 1, as happens if the Xn are independent and identically distributed and have infinite mean. Show that lim, N,/t == 0 with probability 1. The results in Problem 23.11 are theorems in renewal theory: A component of some mechanism is replaced each time it fails or wears out. The Xn are the lifetimes of the successive components, and N, is the number of replacements, or renewals, to time t. 11. 1 2. 20.8 23.11 f Consider a persistent, irreducible Markov chain and for a fixed state j let Nn be the number of passages through j up to time n. Show that Nn/n --+ 1 /m with probability 1, where m = Ek' 1 kfj�k > is the "'
•
•
•
•
•
•
s
-
1
• •
1
322
RANDOM VARIABLES AND EXPECTED VALUES
mean return time (replace 1/m by 0 if this mean is infinite). See Lemma in Section 8.
3
23.13. Suppose that X and Y have Poisson distributions with parameters a and {J. Show that I P( X = i] - P[Y = i] l < I a - fJ I . Hint: Suppose that a < fJ and represent Y as X + D, where X and D are independent and have Poisson distributions with parameters
23.14.
a and fJ - a.
f Use the methods in the proof of Theorem 23.2 to show that the error in (23.15) is bounded uniformly in i by l A - A n I + A n max k Pnk .
SECI'ION 24. QUEUFS AND RANDOM WALK *
Results in queueing theory can be derived from facts about random walk, facts having an independent interest of their own. The Single-Server Queue
Suppose that customers arrive at a server in sequence, their arrival times being 0, A 1 , A 1 + A 2 , The customers are numbered 0, 1, 2, . . . , and customer n arrives at time A 1 + +A n , customer 0 arriving by conven tion at time 0. Assume that the sequence { A n } of interarrival times is independent and that the An all have the same distribution concentrated on (0, oo ) . The A n may be exponentially distributed, in which case the arrivals form a Poisson process as in the preceding section. Another possibility is P [ A n = a] = 1 for a positive constant a; here the arrival stream is determin istic. At the outset, no special conditions are placed on the distribution common to the A n . Let B0, B 1 , . . . be the service times of the successive customers. The Bn are by assumption independent of each other and of the A n , and they have a common distribution concentrated on (0, oo ) . Nothing further is assumed initially of this distribution: it may be concentrated at a single point for example, or it may be exponential. Another possibility is that the service-time distribution is the convolution of several exponential distributions; this represents a service consisting of several stages, the service times of the constituent stages being dependent and themselves exponentially distrib uted. If a customer arrives to find the server busy, he joins the end of a queue and his service begins when all the preceding customers have been served. If he arrives to find the server free, his service begins immediately; the server is by convention free at time 0 when customer 0 arrives. • • •
•
· · ·
* This section may be omitted.
SECTION
24.
QUEUES AND RANDOM WALK
323
Airplanes coming into an airport form a queueing system. Here the A n are the times between arrivals at the holding pattern, Bn is the time the plane takes to land and clear the runway, and the queue is the holding pattern itself. Let Wo , WI , W2 , . . . be the waiting times of the successive customers: wn is the length of time customer n must wait until his service begins; thus W,. + Bn is the total time spent at the service facility. Because of the convention that the server is free at time 0, W0 = 0. Customer n arrives at time A1 + · · · + A n ; abbreviate this as t for the moment. His service begins at time t + Wn and ends at time t + Wn + Bn. Moreover, customer n + 1 arrives at time t + A n +1 . If customer n + 1 arrives before t + Wn + Bn (that is, if t + A n + t < t + Wn + Bn , or Wn + Bn - A n +1 > 0), then his waiting time is Wn + t = (t + Wn + Bn ) - (t + A n + t ) = Wn + Bn - A n +1 ; if he arrives after t + Wn + Bn , he finds the server free and his waiting time is W,. + 1 = 0. Therefo�e, Wn + t = max[O, Wn + Bn - A n + 1 ]. Define
(24.1)
n
=
1 , 2, . . . ;
the recursion then becomes
n = 0, 1 , . . . .
(24.2 ) Note that Wn is measurable
o( X1 ,
•
•
•
, Xn ) and hence is measurable
Bn - 1 ' A 1 , . . . ' A n ) . The equations (24.1) and ( 24.2) also describe inventory systems. Suppose
a( Bo , . . . '
that on day n the inventory at a storage facility is augmented by the amount Bn l and the day' s demand is A n . Since the facility cannot supply what it does not have, (24.1) and (24.2) describe the inventory Wn at the end of day n. A problem of obvious interest in queueing theory is to compute or approximate the distribution of the waiting time Wn. Write -
(24 . 3 )
S0 = 0,
The following induction argument shows that
(24 .4) this is obvious for n = 0, assume that it holds for a particular n. If wn + xn + 1 is nonnegative, it coincides with wn + 1 ' which is therefore the maximum of Sn - Sk + Xn + 1 = Sn + 1 - Sk over O < k < n; this maxi mum, being nonnegative, is unchanged if extended over 0 < k < n + 1, As
324
RANDOM VARIABLES AND EXPECTED VALUES
whence (24.4) follows for n + 1. On the other hand, if Wn + Xn + 1 < 0, then sn + 1 - sk = sn - sk + xn + 1 < 0 for 0 < k < n, so that max 0 � k � n + 1 ( Sn + 1 - Sk ) and Wn + 1 are both 0. Thus (24.4) holds for all n. Since the components of ( Xh X2 , , Xn ) are independent and have a common distribution, this random vector has the same distribution over R n as ( Xn , Xn h . . . , X1 ). Therefore, the partial sums S0, S 1 , S2 , , Sn have the same joint distribution as 0, Xn , Xn + xn 1 ' . . . ' xn + . . . + XI ; that is, they have the same joint distribution as Sn -- Sn , Sn - Sn _ 1 , Sn - Sn_ 2 , , Sn - S0• Therefore, (24.4) has the same distribution as max o � k � n sk . The following theorem sums up these facts. • • •
•
•
•
• • •
Let { Xn } be an independent sequence of random variables with common distribution, and define Wn by (24.2) and Sn by (24.3). Then (24.4) holds and wn has the same distribution as max o � k � n sk . Theorem 24.1.
In this and the other theorems of this section, the Xn are any indepen dent random variables with a common distribution; it is only in the queueing-theory context that they are assumed defined by (24.1) in terms of positive, independent random variables A n and Bn . The problem is to investigate the distribution function
( 24.5 )
Mn ( x )
=
P ( Wn s x ]
=
]
[
Sk < x , P max n � � O k
X > 0.
(Since wn > 0, M(x) = 0 for X < 0.) By the zero-one law (Theorem 22.3), the event [sup k Sk = oo"] has probability 0 or 1. In the latter case max0 � k � n sk --+ oo with probability 1, so that
( 24.6 ) for every x, however large. On the other hand, if [sup k Sk probability 0, then
( 24.7 )
M( x )
=
[
P sup Sk k>O
s
x
= oo ]
]
is a well-defined distribution function. Moreover, [max0 k � n Sk x] � [sup k Sk s x] for each x and hence
1, then E[ Xn1 > 0 by (24.1), and since n - 1sn --+ E[ X1 ] with probability 1 by the strong law of large numbers, supn Sn = oo with probability 1. In this case (24.6) holds: the waiting time for the n th customer has a distribution that may be said to move out to + oo ; the service time is so large with respect to the arrival rate that the server cannot accommodate the customers. It can be shown that supn Sn = oo with probability 1 also if p = 1 , in which case E[ Xn 1 = 0. The analysis to follow covers only the case p < 1 . Here the strong law gives n - 1Sn --+ E[ X1 ] < 0, Sn --+ - oo , and supn Sn < oo with probability 1, so that (24.8) applies. is
Random Walk and Ladder Indices
it helps to visualize the Sn as the successive positions in a random walk. The integer n is a ladder index for the random walk if Sn is the maximum position up to time n:
To calculate (24.7)
(24 .1 0 )
Note that the maximum is nonnegative because S0 = 0. Define a finite measure L by (23 .11) L ( A ) =
P[ max n sk n- l OO
P sup Sn e A
] = ( 1 - p ) l/J ( A ) ,
A c [0, oo ) .
If A c (0, oo), then P[Hn e A] = p n - 1L(A), as will be seen in the course of the proof; thus (24.15) does not assert the independence of the Hn - obvi ously Hn = 0 implies that Hn + 1 = 0.
327 24. QUEUES AND RANDOM WALK (i) For n = 1, (24.15) is just the definition (24.1 1). Fix subsets PROOF. .A1 , . . . , A n of (0, oo) and for k > n - 1 let Bk be the w-set where H; A; for i � n - 1 and T1 + · · · + Tn - l k. Then Bk E o( X1, . . . , Xk ) and so independence and the assumption that the X; all have the same distribuSECTION
=
by tion,
E
Summing over m = 1, 2, . . . and then over k = n - 1, n, n + 1, . . . gives P[H; e A ; , i � n] = P [ H; E A ; , i < n - 1] L ( A n ). Thus (24.15) follows by induction. Taking A 1 = · · · = A n = (0, oo) in (24.15) shows that the probability of at least n ladder indices is P[H; > 0, i < n] = p n . (ii) If p = 1, then there must be infinitely many ladder indices with probability 1. In this case the Hn are independent random variables with distribution L concentrated on (0, oo ) ; since 0 < E[ Hn ] < oo, it follows by the strong law of large numbers that n - 1 E'ic,_1Hk converges to a positive number or to oo, so that (24.13) is infinite with probability 1. (iii) Suppose that p < 1. The chance that there are exactly n ladder indices is p n - p n +l = p n (1 - p) and with probability 1 there are only finitely many. Moreover, the chance that there are exactly n ladder indices and that H; E A ; for i < n is L(A 1 ) · • • L(A n ) - L(A 1 ) · • · L(A n )L(O, oo ) = L(A 1 ) • • • L(A n )(1 - p). Conditionally on there being exactly n ladder indices, H1 , . . . , Hn are therefore independent, each with distribution p - 1L. Thus the probability that there are exactly n ladder indices and H1 + · · · + Hn E A is (1 - p)L n *(A) for A c (0, oo). This 0 holds also for n = 0: (1 - p )L * consists of a mass 1 - p = P[sup k Sk = 0] at the point 0. Adding over n = 0, 1, 2, . . . gives (24.16). Note that in this case there are with probability 1 only finitely many positive terms on the • right in (24.13). Exponential Right Tail
Further information about the p and 1/1 in (24.16) can be obtained by a time-reversal argument like that leading to Theorem 24.1. As observed before, the two random vectors (S0, S1 , . • . , Sn ) and (Sn - Sn, Sn S,._ 1 , • • • , Sn - S0) have the same distribution over R n + t . Therefore, the event [max o k n sk < sn E A] has the same probability as the event
.. E ( X1 )s ( s - " ) !F+ ( s ) - s + " ( 1 - F(O)) '
s > 0.
This, a form of the Khinchine-Po//aczek formula, is valid under the hypothe ses of Theorem 24.5. Because of Theorem 22.2, it determines the distribu tion of supn 0Sn . 'i!!
SECTION
24.
333
QUEUES AN D RAN DOM WALK
Example 24. 2.
Suppose that the queueing variables A ,, are exponen tially distributed with parameter a, so that the arrival stream is a Poisson process . Let b = E [ Bn], and assume that the traffic intensity p = ab is less than 1 . If the B, have distribution function B( x ) an d transform ffl( s ) = JOOe -sx dB ( x ), then F(x ) has densityt ae axdl( a ) ,
F (x) = '
ae ax
J
x < O
e - a v dB ( y ) ,
y>x
X>
0.
Note that F(O) = !11 ( a). Thus Theorem 24.5 applies for A = a, and revers ing integrals shows that �+ (s) = a( a - s ) - 1 ( 91(s) - F(O)). Therefore, (24. 3 4) becomes (24.35)
.K
(s ) =
(1 - p )s s - a + adl( s ) '
the
usual form of the Khinchine-Pollaczek formula. If the B,. are exponen tially distributed with parameter p, this checks with the Laplace transform • of the distribution function M derived in Example 24.1. Queue Size
Suppose that p < 1 and let Qn be the size of the queue left behind by customer n when his service terminates. His service terminates at time A I + . . . + A n + wn + Bn , and customer n + k arrives at time A l + A n + k · Therefore, Qn < k if and only if A 1 + · · + A n + k > A 1 + + + A n + Wn + Bn . (This requires the convention that a customer arriving exactly when service te rminates is not counted as being left behind in the queue.) Thus ·
·
·
·
·
·
·
k
> 1.
Since W,. is measurable o( B0, , Bn _ 1 , A 1 , , A n ) (see (24.2)), the three random variables Wn , Bn , and - ( A n + l + + A n + k ) are independent. If Mn , B, and Vk are the distribution functions of these random variables, then by the convolution formula, P[Qn < k ] = j 00 00 Mn ( -y )( B • Vk )( dy ). If M is the distribution function of supn Sn , then (see (24.8)) M,. converges to M and hence p [ Q n < k ] converges to I 00 00 M( -y )( B * vk )( dy ), which is the • • •
• • •
·
tJust
as
·
·
(20.37) gives the density for X + Y, f� 00 g( x - y ) dF( x ) gives the density for X - Y.
334
RANDOM VARIABLES AND EXPECTED VALUES
same as /00 00 Vk ( -y )( M * B )( dy) because convolution is commutative. Dif ferencing for successive values of k now shows that = k ) = qk , P(Qn n -+ oo
( 24.36)
lim
k > 1,
where
These relations hold under the sole assumption that p < 1. If the A n are exponential with parameter a, the integrand in (24.37) can be written down explicitly:
( 24.38) Example
Suppose that the A n and Bn are exponential with parameters p, as in Example 24.1. Then (see (24.29)) M corresponds to a mass of 1 - a;P = 1 - p at 0 together with a density p(p - a ) e - < IJ - a > x over (0, oo ) . By (20.37), M * B has density 24.3. a and
1
fJe - fl < y - x) dM ( x ) = ( fJ
[O. y ]
- a ) e - < fl - a) y .
By (24.38),
Use inductive integration by parts (or the gamma function) to reduce the last integral to k ! Therefore, qk = (1 - p)pk , a fundamental formula of • queuing theory. See Examples 8.11 and 8.12.
C HAP T ER
5
Convergence of Distributions
SECI'ION 25. WEAK CONVERGENCE
Many of the best known theorems in probability have to do with the asymptotic behavior of distributions. This chapter covers both general methods for deriving such theorems and specific applications. The present section concerns the general limit theory for distributions on the real line, and the methods of proof use in an essential way the order structure of the line. For the theory in R k , see Section 29. Definitions
Distribution functions Fn were in Section the distribution function F if
14 defined to converge weakly to
(25 .1 ) for every continuity point x of F; this is expressed by writing Fn � F. Examples 14.1, 14.2, and 14.3 illustrate this concept in connection with the asymptotic distribution of maxima. Example 14.4 shows the point of allowing (25.1) to fail if F is discontinuous at x; see also Example 25.4. Theorem 25.8 and Example 25.9 show why this exemption is essential to a useful theory. If p. n and p. are the probability measures on ( R 1 , !1/ 1 ) corresponding to Fn and F, then Fn � F if and only if lim p. n ( A )
(25 .2 ) for every A of the form
n
A
=
(
-
oo ,
=
p. ( A )
x] for which p. { x } = O-see
(20.5). In 335
336
CONVERGENCE OF DISTRIBUTIONS
this case the distributions themselves are said to converge weakly, which is expressed by writing P. n � p.. Thus Fn � F and P. n => p. are only different expressions of the same fact. From weak convergence it follows that (25.2) holds for many sets A besides half-infinite intervals; see Theorem 25.8.
Example
Let Fn be the distribution function corresponding to a unit mass at n: Fn = /[ n . oo > · Then lim n Fn ( x ) = 0 for every x, so that (25. 1 ) is satisfied if F( x) = 0. But Fn � F does not hold, because F is not a distribution function. Weak convergence is in this section defined only for functions Fn and F that rise from 0 at - oo to 1 at + oo -that is, it is defined only for probability measures p. n and p.. t For another example of the same sort, take Fn as the distribution function of the waiting time W, for the n th customer in a queue. As observed after (24.9), Fn (x) --+ 0 for each x if the traffic intensity exceeds 1 . 25. 1.
•
Example
Poisson approximation to the binomial. Let P. n be the binomial distribution (20.6) for p = 'A/n and let p. be the Poisson distribu tion (20.7). For nonnegative integers k , ( n ) "' k A n - k P. n { k } = k n 1 - n n k k-1 'A/n 'A 1 1 i ) ( x = n 1 -k n k! ( 1 - 'Ajn ) i = O 25. 2.
( )(
)
(
)
if n > k . As n --+ oo the second factor on the right goes to 1 for fixed k , and so P. n { k } --+ p. { k } ; this is a special case of Theorem 23.2. By the series form of Scheffe's theorem (the corollary to Theorem 16. 1 1 ), (25.2) holds for every set A of nonnegative integers. Since the nonnegative integers support p. and the P. n ' (25.2) even holds for every linear Borel set A. Certainly P. n converges • weakly to p. in this case.
Example 25.3. Let P. n correspond to a mass of n - 1 at each point kjn, k = 0, 1 , . . . , n - 1 ; let p. be Lebesgue measure confined to the unit interval. The corresponding distribution functions satisfy Fn (x) = ([nx] + 1 )/n --+ F(x) for 0 < x < 1 , and so Fn � F. In this case (25. 1) holds for every x, but (25.2) does not as in the preceding example hold for every Borel set
A:
t There is (see Section 28) a related notion of vague convergence in which p. may be defective in the sense that p.( R 1 ) < 1 . Weak convergence is in this context sometimes called complete convergence.
SECTION
25.
WEAK CONVERGENCE
337
is the set of rationals, then P. n (A) = 1 does not converge to p. (A) = 0. Despite this, P. n does converge weakly to p.. •
if A
Example
If P. n is a unit mass at x n and p. is a unit mass at x, then p , :::o p. if and only if x n --+ x. If x n > x for infinitely many n , then (25.1) • fails at the discontinuity point of F. 25.4.
Unifonn Distribution Modulo 1*
For a sequence x1 , x 2 , of real numbers, consider the corresponding sequence of their fractional parts { xn } == xn - [ xn ]. For each n , define a probability measure p. n •
•
•
by
(25 .3)
P.n ( A ) == -1 , # ( k : 1 < k < n , { xk } E A ] ; n
has mass n - 1 at the points { x1 }, , { xn }, and if several of these points coincide, the masses add. The problem is to find the weak 1imi t of { p. n } in number-theoretically interesting cases. Suppose that xn == nplq, where plq is a rational fraction in lowest terms. Then { kplq } and { k'plq } coincide if and only if (k - k')plq is an integer, and since p and q have no factor in common, this holds if and only if q divides k - k'. Thus for any q successive values of k the fractional parts { kp1q } consist of the q points
P.n
•
0 1
•
•
,..., q'q
(25 .4)
q-1 q
in
some order. If O < x < 1, then P.n [O, x] == P.n [O, y ], where y is the point ilq just left of x, namely y == [ qx]lq. If r == [ nlq], then the number of { kpjq } in [0, y] for 1 s k < n is between r([ qx] + 1) and ( r + 1)([ qx] + 1) and so differs from nq- 1 ([ qx] + 1) by at most 2 q. Therefore, {25 .5)
- [ qx 1 + 1 2q p." [ 0 ' x 1 < n ' q
0 < X < 1.
If _, consists of a mass of q - 1 at each of the points (25.4), then (25.5) implies that l'n � _, . Suppose next that xn == n 8, where 8 is irrational. If pIq is a highly accurate rational approximation to 8 (q large), then the measures (25.3) corresponding to { n fJ } and to { npIq } should be close to one another. If ., is Lebesgue measure restricted to the unit interval, then ., is for large q near ., (see Example 25.3), and so there is reason to believe that the measures (25.3) for { n8 } will converge weakly to
P.
If the #Ln defined by (25.3) converge weakly to Lebesgue measure restricted to the unit interval, the sequence x1 , x 2 , is said to be uniformly distributed modulo 1. In •
•
This topic may be omitted.
•
•
338
CONVERGENCE OF DISTRIBUTIONS
this case every subinterval has asymptotically its proportional share of the points { x n } ; by Theorem 25.8 below, the same is true of every subset whose boundary has Lebesgue measure 0.
Theorem 25.1. PROOF.
Define
For () irrational, { n6 } is uniformly distributed modulo 1 . #Ln by (25.3) for xn #Ln ( X , y]
( 25 .6)
n6. It suffices to show that
==
--+ y -
X,
0 < x < y < 1.
Suppose that 0 < £ < min { x , 1 - y } .
(25 .7)
It follows by the elementary theory of continued fractionst that for n exceeding some n 0 == n 0 ( £ , () ) there exists an irreducible fraction pIq (depending on £ , n , and () ) such that
p n 6 - -q < £ .
q a. that is. if and only if
if x < a , if x > a . If (25.1 1 ) holds for all positive £ , Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.10). Convergence in probability in this new sense will be denoted xn � a, in accordance with the theorem just proved. Example 1 4.4 restates the weak law of large numbers in terms of this concept. Indeed, if X1 , X2 , . . . are independent, identically distributed ran dom variables with finite mean m , and if Sn = X1 + + Xn , the weak law 1 of large numbers is the assertion n - sn � m. Example 6.4 provides another illustration : If Sn is the number of cycles in a random permutation on n letters, then Snf log n � 1 . ·
·
·
Example 25. 7. Suppose that Xn � X and 8n � 0. Given £ and 11 choose x so that P [ I XI > x ] < 11 and P[ X = ± x] = 0, and then choose n 0 so that n � n 0 implies that 1 8n l < £/X and I P [ Xn < y] - P [ X < y ] l < 11 for y = ± x. Then P [ l 8n Xn 1 > £] < 311 for n > n 0 . Thus Xn � X and 8n --+ 0 imply that 8n Xn � 0, a restatement of Lemma 2 of Section 14 (p. 195). • The
asymptotic properties of a random variable should remain un affected if it is altered by the addition of a random variable that goes to 0 in probability. Let ( Xn , Yn ) be a two-dimensional random vector. Theorem 25.4.
�
X and Xn - Yn
�
0, then Yn
Suppose that y ' < x < y " and P [ X = y'] x - y' and £ < y" - x, then
PROOF.
(
x
as
n -+
_ lim lim sup n P ( I x,� u ) Y,. l > £ ]
( 25 .14) for positive
u,
u
£ , then Y,.
-+
oo , if
two-dimen
x< u> => X as
u -+
oo .
== 0
X.
Replace X,. by X,� u> i n (25.12). If P( X = y') = 0 P( x< u> == y ' ] and P[ X = y"] == 0 P[ x p., there exist random variables Yn and Y having these measures as distributions and satisfying Y,, � Y. According to the following theorem, the Yn and Y can be constructed on t he same probability space, and even in such a way that Yn( w ) � Y( w ) for every w- a condition which is of course much stronger than Yn => Y. This result, Skorohod 's theorem , makes possible very simple and transparen t proofs of many important facts.
SECTION
25.
343
WEAK CONVERGENCE
neorem 25.6. Suppose that P.n and p. are probability measures on ( R 1 , 9l 1 ) and p. n => p.. There exist random variables Yn and Y on a common probability space ( 0 , !F .. P ) such that Yn has distribution p. , , Y has distribution p., and Yn ( w ) __... Y( w ) for each w.
For the probability space ( 0, !F, P ), take 0 (0, 1 ), let !F consist of the Borel subsets of (0, 1), and for P( A ) take the Lebesgue measure of A . The construction is related to that in the proofs of Theorems 14.1 and 20 .4. Consider the distribution functions F, and F corresponding to P.n and p.. For 0 < w < 1 , put Yn( w ) = in f[ x : w < Fn( x )] and Y( w) = in f[ x : w < F(x )]. Since w < Fn(x) if and only if Yn( w ) < x (see the argument follow ing (14.5)), P [ w: Yn( w ) < x] = P[w: w < Fn( x )] = Fn( x ). Thus Yn has distribution function Fn; similarly, Y has distribution function F. It remains to show that Yn( w ) --. Y( w ). The idea is that Y,, and Y are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 < w < 1 . Given £, choose x so that Y( w ) - £ < x < Y( w) and p. { x } = 0. Then F( x) < w; Fn( x ) --. F( x ) now implies that, for n large enough, Fn( x ) < w and hence Y( w ) - £ < x < Y,, ( w ). Thus lim infn Yn( w ) > Y( w ). If w < w' and £ is positive, choose a y for which Y( w') < y < Y( w') + £ and p. { y } = 0. Now w < w' < F( Y( w')) < F( y ), and so, for n large enough, w < Fn( Y ) and hence Yn( w ) < y < Y( w') + £. Thus lim supn Yn( w) < Y( w') if w < w'. Therefore, Yn ( w) __... Y( w) if Y is continuous at w. Since Y is nondecreasing on (0, 1 ), it has at most countably many discontinuities. At discontinuity points w of Y, redefine Yn( w) = Y( w ) = 0. With this change, Yn( w) --. Y( w ) for every w. Since Y and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still p. n and p.. • =
PROOF.
Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in R k (Theorem 29.6) is more complicated. The following mapping theorem is of very frequent use. lbeorem 25.7. Suppose that h : R1 --. R1 is measurable and that the set Dh of its discontinuities is measurable . t If P. n => p. and p.( Dh ) = 0, then P.n h - 1 � ,., h - 1 . t
That Dh lies in 9l1 is generally obvious in applications. In point of fact. it always holds (even
h is not measurable): Let A ( ( , B ) be the set of x l x - Y l < B. l x - z l < B. and l h ( y ) - h ( z ) l > ( . if
A( ( . B ). where
( and
B
for which there exist y and Then
range over the positive rationals.
A( (. B )
z
such that
is open and Dh
=
U( n8
344
CONVERGENCE OF DISTRIBUTIONS
Recall (see (1 3.7)) that
p.h - 1
has value p.( h - 1A ) at
A.
PROOF. Consider the random variables Yn and Y of Theorem 25.6. Since Yn ( w ) � Y( w ), Y(w) � Dh implies that h( Y,,( w)) � h ( Y(w)). Since P[w: Y( w ) E Dh] = p.(Dh) = 0, h(Yn (w)) � h(Y(w)) holds with probability 1. Hence h ( Yn ) => h ( Y) by Theorem 25.2. Since P[h(Y) E A] = P[ Y E h - 1A] = p. ( h - 1A ), h ( Y) has distribution p.h - 1 ; similarly, h ( Yn ) has distribution P. n h - 1 • Thus h ( Yn ) => h(Y) is the same thing as P. n h- 1 => p.h - 1 • • Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary 1.
Take
If Xn => X and P[ X E Dh] = 0, then h( Xn ) => h( X).
X = a:
Corollary 2.
If Xn � a and h is continuous at a, then h( Xn ) => h(a).
From Xn => X aXn + b => aX + b. Suppose also a ) Xn => 0 by Example 25.7, and now an Xn + bn => aX + b follows
it follows directly by the theorem that that a n � a and bn � b. Then (a n so (anXn + bn) - (aX, + b) => 0. And by Theorem 25 . 4 : If Xn => X, an � a, and bn � b, then anXn + bn => aX + b. This fact was stated and proved • differently in Section 14-see Le mma 1 on p. 195. Example 25.8.
By definition, p. n => p. means that the corresponding distribution func tions converge weakly. The following theorem characterizes weak conver gence without reference to distribution functions. The boundary aA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, aA is the closure of A minus its interior. A set A is a p.-continuity set if it is a Borel set and p.( a A ) = 0. Theorem 25.8. The following three conditions are equivalent. (i) p. n p.; (ii) f f dp. n � f fdp. for every bounded, continuous real function f; (iii) P.n( A ) � p.(A) for every p.-continuity set A. :::0
Suppose that P.n => p. and consider the random variables Yn and Y of Theorem 25.6. Suppose that f is a bounded function such that p.( D1) = 0, where D1 is the set of points of discontinuity of f. Since P [ Y E D1] = p.( D1 ) = 0, f( Yn) � f( Y) with probability 1, and so by change of variable (see (21 .1)) and the bounded convergence theorem, ffdp.n = E[/( Yn)] � PROOF.
SECTION
25.
WEAK CONVERGENCE
345
E[ /( Y )] = f fdp.. Thus P.n => p. and p.( D1 ) = 0 together imply that f fdp.n __. f f dp. if f is bounded. In particular, (i) implies (ii). Further, if f = IA ' then D1 = aA , and from p.( aA ) = O and P.n => p. follows p.,,( A ) = f fdp.n __. f I dp. = p. ( A ). Thus (i) also implies (iii). Since a( - 00 , x ] = { X } , obviously (iii) implies (i). It therefore remains only to deduce p. n => p. from (ii). Consider the corresponding distribution functions. Suppose that x < y, and let /( t ) be 1 for t < x, 0 for t > y, and interpolate linearly on [x, y]: /( t ) = ( y - t )/( y - x ) for x < t < y . Since Fn (x ) < f fdp. n and f fdp. < F( y ), it follows from (ii) that lim supn Fn(x ) < F( y ); letting y � x shows that lim supn Fn(x ) < F( x ). Similarly, F( u ) < lim infn Fn( x ) for u < x and hence F( x - ) < lim infn Fn( x). This implies • convergence at continuity points. The function f in this last part of the proof is uniformly continuous. Hence P.n => p. follows if J fdp.n � f fdp. for every bounded and uniformly continuous f. Example 25. 9. The distributions of Example 25.3 satisfy P.n => p., but Hence this A P n ( A ) does not converge to p.( A) if A is the set of rationals. • cannot be a p.-continuity set; in fact, of course, a A = R 1 •
The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when p.( aA ) > 0. Since F(x) - F(x - ) = p. { x } = p.( a( - 00 , X )), it is therefore natural in the original definition tO allow (25.}) to fail when x is not a continuity point of F. Belly's 1beorem
One of the most frequently used results in analysis is the Helly selection theorem: Theorem 25.9. For every sequence { Fn } of distribution functions there exists a subsequence { Fn " } and a nondecreasing, right-continuous function F such that lim k Fn "(x) = F(x) at continuity points x of F. An application of the diagonal method [A14] gives a sequence PROOF. { n k } of integers along which the limit G( r ) = lim k Fn"( r ) exists for every rational r. Define F( x) = inf(G( r ): x < r ] . Clearly F is nondecreasing. To each x and £ there is an r for which x < r and G ( r ) < F( x ) + £. If x s y < r, then F( y ) < G(r) < F(x) + £. Hence F is continuous from the right. If F is continuous at x, choose y < x so that F( x ) - £ < F( y ) ; now choose rational r and s so that y < r < x < s and G(s ) < F( x) + £. From
346
CONVERGENCE OF DISTRIBUTIONS
F( x ) - £ < G ( r ) < G(s ) < F(x ) + £ and Fn ( r ) < Fn (x ) < Fn( s ) it follows that as k goes to infinity Fn k(x) has limits superior and inferior within £ of F( x ). • The F in this theorem necessarily satisfies 0 < F(x) < 1 . But F need not be a distribution function; if Fn has a unit jump at n , for example, F( x ) 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures P.n on (R 1 , !1/ 1 ) is said to be tight if for each £ there exists a finite interval ( a, b] such that p. n( a, b] > 1 - £ for all n . In terms of the corresponding distribution functions Fn, the condition is that for each £ there exist x and y such that Fn(x) < £ and F,, ( y ) > 1 - £ for all n . If p. n is a unit mass at n , { p. n } is not tight in this sense- the mass of p. n "escapes to infinity." Tightness is a condition preventing this escape of mass. =
Theorem 25.10. Tightness is a necessary and sufficient condition that for every subsequence { p. n k } there exist a further subsequence { p. n } and a probability measure p. such that p. n k ( J ) � p. as j � oo . k(})
Only the sufficiency of the condition in this theorem is used follows.
in
what
Apply Helly' s theorem to the subsequence { F,, " } of corresponding distribution functions. There exists a further subsequence { Fn k j ) } such that lim 1.Fn k ( J )( x ) = F(x ) at continuity points of F, where F is � nonaecreasing and right-continuous. There exists by Theorem 12.4 a measure p. on ( R 1 , !1/ 1 ) such that p.(a, b] = F( b ) - F( a ) . Given £, choose a and b so that P. n( a , b ] > 1 - £ for all n , which is possible by tightness. By decreasing a and increasing b, one can ensure that they are continuity points of F. But then p. ( a, b] > 1 - £. Therefore, p. is a probability mea • sure, and of course p. n k ( J ) � p. . PROOF :
SUFFICIENCY.
If { p. n } is not tight, there exists a positive £ such that for each finite interval ( a, b ], p. n( a, b] < 1 - £ for some n . Choose n " so that P. n k ( - k , k ] < 1 - £. Suppose that some subsequence { P. n A ( J ) } of { P. n A } we re to converge weakly to some probability measure p. . Choose ( a, b] so that p. { a } = p. { b } = 0 and p. ( a, b ] > 1 - £. For large enough }, ( a , b) c ( - k ( j ), k ( j )], and so 1 - £ > P. n A (J ) ( - k ( j ), k ( j )] > p. , A ( J ) ( a, b] � p. ( a � b] . • Thus p. ( a, b ] < 1 - £, a contradiction. NECESSITY.
Corollary. If { P.n } is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure p. , then p. n � p. .
SECTION
25.
WEAK CONVERGENCE
347
By the theorem, each subsequence { p. n " } contains a further subsequence { p. n } converging weakly ( j � oo) to some limit, and that limit must by hypothesis be p.. Thus every subsequence { p., A } contains a further su bsequence { p. n lc ( J ) } converging weakly to p.. Suppose that p. n � p. is false. Then there exists some x such that p { x } = 0 but P. n( - oo, x] does not converge to p.( - oo , x]. But then there exists a positive £ such that I P. n "( - oo , x] - p. ( - oo , x ] l > £ for an infinite sequence { n k } of integers, and no subsequence of { P. n A } can converge • weakly to p.. This contradiction shows that P.n � p.. PROOF.
A(})
If ,., n is a unit mass at X n ' then { ,., n } is tight if and only if { X n } is bounded. The theorem above and its corollary reduce in this case to standard facts about the real line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Let P.n be the normal distribution with mean m n and variance on2• If m n and on2 are bounded, then the second moment of P.n is bounded, and it follows by Markov' s inequality (21 .12) that { P.n } is tight. The conclusion of Theorem 25.10 can also be checked directly: If { n k (J) } is chosen so that lim1.m n lc ( J ) = m and lim 1.on2lc ( J ) = o 2, then P.n A t J ) � p., where p. is normal with mean m and variance o2 (a unit mass at m if o2 = 0). If m n > b, then P.n(b, oo) > ! ; if m n < a, then p. , ( - oo , a] > t · Hence { p. n } cannot be tight if m n is unbounded. If m n is bounded, say by K, then P n ( - oo , a ] > v ( - oo, ( a - K )on- 1 ], where v is the standard normal distri bution. If on is unbounded, then v ( - oo, (a - K )on- 1 ] � i along some subsequence, and { p. n } cannot be tight. Thus a sequence of normal distribu • tions is tight if and only if the means and variances are bounded. Example 25. 10.
Integration to the Limit 1beorem 25. 1 1.
If Xn � X, then E [ I XI ] < lim infn E [ 1 Xn 1 1 -
Apply Skorohod ' s Theorem 25.6 to the distributions of X, and X: There exist on a common probability space random variables Y,, and Y such that Y = lim n Yn with probability 1 , Yn has the distribution of Xn , and Y has the distribution of X. By Fatou ' s lemma, E[ I Yl ] < lim inf, E[ I Y,, l ]. Since 1 XI and 1 Yl have the same distribution, they have the same expected value (see (21 .6)), and similarly for 1 xn 1 and I Yn 1 • PROOF.
The random variables Xn are said to be uniformly integrable if (25 . 1 5 )
lim sup a , .....
oo
£
[ I xn 1 � a J
I X,. I dP
=
0;
348
CONVERGENCE OF DISTRIBUTIONS
see (16.21). This implies (see (16.22)) that sup E ( I Xn l ) < 00 .
( 25 .16 )
n
If Xn � X and the Xn are uniformly integrable, then X is
Theorem 25. 12.
integrable and ( 25 .17)
Construct random variables Yn and Y as in the preceding proof. Since Yn � Y with probability 1 and the Yn are uniformly integrable in the sense of (16.21 ), E [ Xn ] = E [ Yn ] � E [ Y ] = E [ X] by Theorem 16.13. • PROOF.
If supn E [ I xn 1 1 + ( ] < 00 for some positive integrable because
£'
then the xn are uniformly
(25 .18 )
Since Xn :::o X implies that x; � xr by Theorem 25.7, there is the follow ing consequence of the theorem.
Let r be a positive integer. If Xn :::o X and supn E [ I Xn l r + f ] < > 0, then E [ I X I r] < 00 and E [ x;] � E[ xr].
Corollary.
here
w
£
oo ,
The Xn are also uniformly integrable if there is an integrable random variable Z such that P[ I Xn l > t ] s P[ I Z I > t ] for t > 0, because then ( 2 1 . 1 0 ) gives
1
[ I X,.I � a)
I X,. I dP
a ]
I Z I dP.
From this the dominated convergence theorem follows again. PROBLEMS 25.1.
(a) Show by example that distribution functions having densities can
converge weakly even if the densities do not converge. Hint: Consider /,. ( x ) == 1 + cos 2wnx on [0, 1]. (b) Let /,. be 211 times the indicator of the set of x in the unit interval for which d ,. + (x) == · · · == d2 n ( x) == 0, where d" (x) is the k th dyadic digi t. 1
SECTION
25.
349
WEAK CONVERGENCE
Show that In ( x) --+ 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine In ( x) == 0 for all n , so that /,. ( x) --+ 0 everywhere. Show that the distributions -corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which p.,. ( A ) --+ p.( A ) fails but in which all the measures come from continuous densities on 25.2. 25.3.
[0, 1 ]. 14.12 f Give a simple proof of the Glivenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous. Initial digits. (a) Show that the first significant digit of a positive number x is d (in the scale of 10) if and only if {log1 0x } lies between log1 0 d and log1 0( d + 1), d == 1, . . . , 9, where the brackets denote fractional part. (b) For positive numbers x 1 , x2 , , let Nn ( d ) be the number among the first n that have initial digit d. Show that •
•
•
d == 1 , . . . , 9,
( 25 .19)
if the sequence log1 0xn , n == 1, 2, . .n. , is uniformly distributed modulo 1. This is true, for example, of xn == f if log1 08 is irrational. (c) Let Dn be the first significant digit of a positive random variable Xn . Show that
( 25 .20) lim P [ Dn == d ] == log 1 0( d + 1 ) - log1 0 d , n
25.4.
d == 1 , . . . , 9,
if {log1 0 Xn } � U, where U is uniformly distributed over the unit interval. Show that for each probability measure p. on the line there exist probability measures #Ln with finite support such that #Ln � p.. Show further that P.n{ x } can be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every p, is the weak limit of some sequence from the set. The space of distribution functions is thus separable in the Levy metric (see Problem
14.9).
25.5. 25.6.
25.7. 25.8.
Show that (25.10) implies that P([ X < X ]�[ xn < X ]) --+ 0 if P[ X == X ] == 0. For arbitrary random variables Xn there exist positive constants an such that an xn � 0. Generalize Example 25.8 by showing for three-dimensional random vectors ( A n , Bn , Xn ) and constants a and b, a > 0, that, if An � a, Bn � b, and Xn � X, then AnXn + Bn � aX + b . Suppose that Xn � X and that hn and h are Borel functions. Let E be the set of x for1 which hnxn --+ hx fails for some sequence xn --+ x. Suppose that E e 9P and P[ X e E] == 0. Show that hn Xn � hX.
JSO
CONVERGENCE OF DISTRIBUTIONS
Suppose that the distributions of random variables X,, and X have densities /,. and f. Show that if fn( x) --+ /( x ) for x outside a set of Lebesgue measure 0, then Xn � X. 25. 10. f Suppose that Xn assumes as values y, + k B,. , k == 0, + 1, . . . , where with n in Bn > 0. Suppose that Bn --+ 0 and that, if kn is an integer varying 1 such a way that Yn + knBn --+ x, then P[ Xn == Yn + knBn 1 Bn- --+ /( x), where I is the density of a random variable X. Show that xn = X. 25. 11. f Let Sn have the binomial distribution with parameters n and p. Assume as known that 25.9.
( 25 .21)
25.12. 25. 13.
25. 14.
P [ S,. = k n ] ( np ( l - p ))
1
--+ ..ff; e - ' 2 /2
if ( kn - np)( np(1 - p)) - 1 12 1--+ x. Deduce the DeMoivre-Laplace the orem: ( Sn - np)( np( 1 - p)) - 12 = N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. Prove weak convergence in Example 25.3 by using Theorem 25.8 and the theory of the Riemann integral. (a) Show that probability measures satisfy #Ln = p. if P. n ( a , b] --+ p.( a , b) whenever I' { a } == I' { b } == 0. (b) Show that, if I fdp.n --+ I fdp. for all continuous f with bounded support, then ,.,. n = ,.,. . f Let I' be Lebesgue measure confined to the unit interval; let l'n COrrespond tO a mass Of Xn . Xn 1 at SOme point in ( Xn . Xn . ] where 0 == Xn o < xn 1 < · · · < xnn == 1. Show by considering the distribution functions that p.,. = p. if max ; < n( x, ; - xn . ; - 1 ) --+ 0. Deduce that a bounded Borel function continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17.6. 2.15 5.16 f A function f of positive integers has distribution function F if F is the weak limit of the distribution function Pn [ m : /( m ) < x] of I under the measure having probability 1/n at each of 1, . . . , n (see (2.20)). In this case D [ m : /( m ) < x] == F(x) (see (2.21)) for continuity points x of F. Show that q>(m)/m (see (2.23)) has a distribution: (a) Show by the mapping theorem that it suffices to prove that f( m) .. log(q>( m )/m) == EPBP ( m)log(1 - 1/p) has a distribution. (b) Let fu ( m) == EP "B1 ( m )log(1 - 1/p), and show by (5.41) that fu has distribution function �\ X ) == P{E u Xp log(1 - 1/p) < x], where the XP are independent random variables (one for each prime p) such that P[ XP .. 1] == 1/p and P( XP == 0] == 1 - 1/p. (c) Show that EP XP log(1 - 1/p) converges with probability 1. Hint: Use Theorem 22.6. (d) Show that limu _ oo supn En l l/ - fu l 1 == 0 (see (5.42) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that f has the distribution of the sum in (c). i
-
.
I_ 1,
I_
•.
25.15.
1 /2
s
s
I ,
26. CHARACTERISTIC FUNCTIONS 351 25.1 6. 5.16 f Suppose that /( n ) --+ x. Show that Pn [ m : 1/( m ) - x l > £] --+ 0. From Theorem 25.12 conclude that En [/] --+ x (see (5.42); use the fact that f is bounded). Deduce the consistency of Cesaro summation [A30]. 25 .1 7. For A E 911 and T > 0, put "T( A ) == A ([ - T, T] n A )/2T, where " is Lebesgue measure. The relative measure of A is SECTION
p( A)
( 25 .22)
==
lim AT ( A ) ,
r-. oo
provided that this limit exists. This is a continuous analogue of density (see (2.21)) for sets of integers. A Borel function f has a distribution under "T; if this converges weakly to F, then
p ( x : /( X )
( 25 .23 )
0 ; this is an f of the kind in Problem 25.18. Construct a tight sequence for which f lx l aP.n ( dx) --+ oo ; construct one for which f l x l aP.n( dx) == 00 . 25.20. 23.4 t
Show that the random variables A, and L, in Problems 23.3 and 23.4 converge in distribution. Show that the moments converge. 25.21. In the applications of Theorem 9.2, only a weaker result is actually needed: For each K there exists a positive a == a( K) such that if E[ X] == 0, E[ X2 ] == 1, and E [ X4 ] < K, then P[ X > 0] > a. Prove this by using tightness and the corollary to Theorem 25.12. 25.22. Find uniformly integrable random variables Xn for which there is no integrable Z satisfying P[ I Xn l > t] � P[ I ZI > t] for t > 0. SECI'ION 26. CHARACI'ERISTIC FUNCI'IONS Definition
The characteristic function of a probability measure for real t by
oo cp ( t ) J eitxll ( dx ) - oo =
p.
on the line is defined
352
CONVERGENCE OF DISTRIBUTIONS
see the end of Section 16 for integrals of complex-valued functions.t A random variable X with distribution p. has characteristic function
cp( t )
=
E[e
;'x ]
=
;ul-' ( dx ) . oo e J - oo
The characteristic function is thus defined as the moment generating func tion but with the real argument s replaced by it; it has the advantage that it always exists because e itx is bounded. The characteristic function in non probabilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be estab lished here: (i) If p. 1 and p. 2 have respective characteristic functions cp 1 (t) and cp2 ( t ), then p.1 * p. 2 has characteristic function cp 1 ( t) cp2 ( t ). Although con volution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding characteristic functions. (ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in (i), no information is lost. (iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives
It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. Of course, cp(O) = 1, and by (16.25), l cp(t) l < 1 for all t. By Theorem 16.8(i), cp(t) is continuous in t. In fact, l cp(t + h) - cp(t) l < f l e i h x 1 1 p. ( dx ), and so it follows by the bounded convergence theorem that cp(t) is
uniformly continuous.
Integration by parts shows that
(26 .1)
1 (X X
0
S
n
) eu ds ·
=
n+1 n +1 's 1 I + ( X - S ) e ds, 1 1 n+ n+ 0 •
X
X
·
and it follows by induction that
n ( · k ·n + 1 n ) 1 ( x - s ) e ;s ds e ix (26 .2) + L '\ k. n. for n � 0. Replace n by n - 1 in (26.1), solve for the integral on the right, =
t From
k-o
1
1
x
o
complex variable theory only De Moivre's formula and the simplest properties of the
exponential function are needed here.
SECTION
26.
353
CHARACTERISTIC FUNCTIONS
and substitute this for the integral in (26.2); this gives (26 . 3)
e
ix
=
�
( ix ) k +
i..Jo k ' k·
(n
;n
_
1x( x - ) n -1 ( e is - 1) ds . 1) ' s
o
·
Estimating the integrals in (26.2) and (26.3) (consider separately the cases x � 0 and x < 0) now leads to
e ix -
(26 .4)
n ( ix ) k
E
k-o
k '.
] . This extends to sums of three or more: If X1 , , Xn are independent, then =
•
(26.12)
E [ e ;rEZ- t x• )
If X has characteristic function function
=
•
.
fi1 E [ e ;' x• ] .
k-=
cp( t ),
then
aX + b
has characteristic
(26.13)
cp( - t), which is the complex
particular, - X has characteristic function conjugate of cp(t). In
Inversion and the Uniqueness Theorem
A characteristic function cp uniquely determines the measure p. it comes from. This fundamental fact will be derived by means of an inversion formula through which p. can in principle be recovered from cp. Define
S ( T ) = i r sinX x dx ,
T > 0.
0
In Example 1 8.3 it is shown that (26. 14)
T = lim S( ) r-. oo
'IT
2
;
356
CONVERGENCE OF DISTRIBUTIONS
S( T) is therefore bounded. If sgn 8 is + 1, 0, or - 1
negative, then
( 26.15 ) Theorem 26.2.
and if p. { a }
for si� tfJ dt
=
sgn fJ
If the probability
· S(TifJI ) ,
as 8 is positive, 0, or
T > 0.
measure p. has characteristic function
p. { b } = 0, then e - i ta e - i tb 1 . cp ( t ) dt. p.( a, b] = lim j ( 26.16 ) It 2 'IT Distinct measures cannot have the same characteristic function. =
T T-+ oo - T
q>,
_
.
Note: By (26.4 1 ), the integrand here converges as t -+ 0 to b - a, which is to be taken as its value for t = 0. For fixed a and b the integrand is thus continuous in t, and by (26.40) it is bounded. If p. is a unit mass at 0, then cp (t) = 1 and the integral in (26.16) cannot be extended over the whole line. PROOF . The inversion formula will imply uniqueness: It will imply that if p. and " have the same characteristic function, then p.(a, b] = Jl(a, b] if p. { a } = J1 { a } = p.1 { b } = J1 { b } = 0; but such intervals (a, b] form a 'IT-sys tem generating 91 • Denote by Ir the quantity inside the limit in (26.16). By Fubini's theorem ( 26 .17 )
IT _
_!__
2 'IT
/ 00 [J T e it( x - a) .
_
-
oo - T
It
e i t( x - b )
]dt ,., (
dx ) .
This interchange is legitimate because the double integral extends over a set of finite product measure and by (26.40) the integrand is bounded by l b - a l . Rewrite the integrand by DeMoivre's formula. Since sin s and cos s are odd and even, respectively, (26.15) gives
Ir =
joo [ sgn( x - a ) S( T · l x - a l ) - sgn( x - b ) S( T · l x - b l ) ] p.( dx) . - oo 'IT
'IT
T -+ oo 0 if x < a, 1/2 if x = a, 1 if a < x < b, 1/2 if X = b, 0 if b < X .
The integrand here is bounded and converges as
( 26.18 )
1/l a
.
b
(x) =
to the function
Thus Ir -+ f 1/1 a , b dp., which implies that (26.16) holds if p. { a } =
p. { b } = 0.
•
SECTION
26.
CHARACTERISTIC FUNCTIONS
357
The inversion formula contains further information. Suppose that
oof 1 cp < t > I dt < - oo
( 26. 19 )
oo .
In this case the integral in (26.16) can be extended over R 1 • By (26.40), e - i tb _. e - ita = I e it( b - a > 1 1 < lb - a l ; It I tI _
00
therefore, p. (a, b) < ( b - a) f I cp ( t) I dt, and there can be no point masses. By (26.16), the corresponding distribution function satisfies 00
_l_ joo e - itx _ . e - it(x + h > cp ( t ) dt
F( x + h ) - F( x ) _ - 2 'IT h (whether
by
l fP(t) l
(26.20)
h
_
00
Ith
is positive or negative). The integrand is by (26.40) dominated and goes to e - i txcp(t) as h -+ 0. Therefore, F has derivative
/(x)
1 =-
2 '1T
/ 00 e - 1.1xcp( t ) dt.
- oo
Since f is continuous for the same reason cp is, it integrates to F by the fundamental theorem of the calculus (see (17.6)). Thus (26.19) implies that p. has the continuous density (26.20). Moreover, this is the only continuous density. In this result, as in the Riemann-Lebesgue theorem, conditions on the size of cp(t) for large l t l are connected with smoothness properties of p.. The inversion formula (26.20) has many applications. In the first place, it can be used for a new derivation of (26.14). As pointed out in Example 17.3, the existence of the limit in (26.14) is easy to prove. Denote this limit temporarily by 'IT0/2-without assuming that 'ITo = 'IT. Then (26.16) and (26.20) follow as before if 'IT is replaced by 'IT0• Applying the latter to the standard normal density ( see (26.8)) gives
(26.21 )
·
1 / 00 - 1. 1x - 1 2 /2 1 - x 2 /2 -e dt e = e 2 '1To ,J2 '1T
- oo
'
where the 'IT on the left is that of analysis and geometry-it comes ultimately from the quadrature (18.10). An application of (26.8) with x and t interchanged reduces the right side of (26.21) to ( ,J2'1T j2 '1T0 )e - x2 1 2 , and therefore 'IT0 does equal 'IT.
358
CONVERGENCE OF DISTRIBUTIONS
Consider the densities in the table below. The characteristic function for the normal distribution has already been calculated. For the uniform distri bution over (0, 1), the computation is of course straightforward; note that in this case the density cannot be recovered from (26.20) because cp( t) is not integrable; this is reflected in the fact that the density has discontinuities at 0 and 1. The characteristic function for the exponential distribution is easily calculated; compare Example 21 .3. As for the double exponential or Laplace distribution, e - l x l e i tx integrates over (0, oo) to (1 - it) - 1 and over ( - oo, 0) to (1 + it) - 1 , which gives the result. By (26.20), then,
e - lx l = For
'IT
dt tx i e . 2 - oo 1+t
J
oo
x = 0 this gives the standard integral / 00 00 dt/(1
Distribution
1.
1-
Normal
Density
Interval
1 e - x2 /2
- oo < x < oo
J2'1T
2. Uniform
1
0<x
0.
Suppose that for each n the sequence Xn1 , , Xn ,,. is independent and satisfies (27.7). If (27.8) holds for all positive £, then S,jsn => N. Theorem
• • •
27.2.
This theorem contains the preceding one: Suppose that Xn k = Xk and r,. == n, where the entire sequence { xk } is independent and the xk all have 2 o same distribution with mean 0 and variance the • Then (27.8) reduces to
(27.9)
£oli] � O as n i oo . Replacing Xn k by Xn k/s n shows that there is no
which holds because [ I X1 1 � PaOOF OF THE THEOREM.
loss of generality in assuming
'n
s; = E on2k = 1.
(27.1 0) By
k-l
(26.4 2 ), ,
Therefore the
Note
characteristic function cpn k of
that the expected value is finite.
xn k satisfies
370
CONVERGENCE OF DISTRIBUTIONS
For positive
£
the right side of (27.11) is at most
Since the on2k add to 1 and condition that
£
is arbitrary, it follows by the Lindeberg
(27 .12) for each fixed t. This is the analogue of (27.4), and the objective now is to show that
(27 .13) 'n
- ne k-l
For
£
_ l 2a�k 12 + o(1) = e - 1 2 /2 + o(1).
positive
and so it follows by the Lindeberg condition (recall that
sn
is now
1) that
(27 .14) For large enough n , 1 - �t 2on2k are all between 0 and 1, and by (27.3) , nk.._ 1 cpn k (t) and n k.._1 (1 - �t 2on2k ) differ by at most the sum in (27.12). This establishes the first of the asymptotic relations in (27 .13). Now (27.3) also implies that
r,. n
k-1
( 1 - 2 t 2on2k ) 1
� e _ t 2a�lc /2 £] :+ 2£ 1tl . By (27.20), max k s rJ cpn k (t) - 1 1 -+ 0. By (26.4 1 ), lcpn k (t) - 1 1 1, n > 1. Sup pose that a n --+ 0, the idea being that xk and xk + n are then approximately independent for large n. In this case the sequence { Xn } is said to be a-mixing. If the distribution of the random vector ( Xn , Xn + 1 , , Xn +j ) does not depend on n, the sequence is said to be stationary. •
•
.
Let { Yn } be a Markov chain with finite state space and positive transition probabilities Pij ' and suppose that xn = /( Yn ), where I is some real function on the state space. If the initial probabilities P ; are the stationary ones (see Theorem 8.9), then clearly { Xn } is stationary. More n over, by (8.42), IP��> - pj l < p , where p < 1. By (8.11), P [Y1 = i1, , Yk n = i k ' Yk + n = 1·0 ' · · · Y:k + n + l = 1·I ] = P· P· · · · · P'�c - t ' lc P'J� Jo> P·Jolt · · · P·h- t11 ' which differs from P[Y1 = i 1 , Yk = i k ]P[Yk + n = }0, , Yk + n + l = }1 ] by n at most p,.1 p,.1 · · · p,.lc - 1 ,·lc p p1.oJ1 p1.1 - .vl It follows by addition that, if r is the number of states, then for sets of the form A = [( Y1 , , Yk ) E H) and n B = [( Yk+ n ' . . . , Yk + n +l ) E H'], (27.25) holds with a n = rp . These sets (for k and n fixed) form fields generating a-fields which contain o( X1, , Xk ) and o( Xk + n ' Xk + n + 1 , ), respectively. For fixed A the set of B satisfying (27.25) is a monotone class, and similarly if A and B are interchanged. It follows by the monotone class theorem (Theorem 3.4) that { Xn } is a-mixing with a n = rpn . • Example 27.6.
•
'
,·
,
•
•
•
•
2
'•
•
•
'• '2
,
·
·
•
•
•
•
•
•
•
•
•
* This topic may be omitted.
•
•
•
•
•
•
376
CONVERGENCE OF DISTRIBUTIONS
The sequence is m-dependent if ( X1 , . . . , Xk ) and ( Xk + n ' . . . , Xk + n + l) are independent whenever n > m. In this case the sequence is a-mixing with a n = 0 for n > m. In this terminology an independent sequence is 0-depen dent.
Example
Let Y1 , Y2 , . . . be independent and identically distrib + uted, and put xn = /( Yn , . . . ' yn + m > for a real function I on R m 1 • Then { xn } is stationary and m-dependent. • 27. 7.
Suppose that X1 , X2 , . . . is stationary and a-mixing with a n = O(n - 5 ) and that E[Xn ] = 0 and E [ X: 2 ] < oo. If Sn = X1 + · · · + Xn then Theorem 27.5.
,
n - 1 Var (Sn] -+ o 2 = E [ X12 ] + 2
( 27.26)
00
L E [ X1 X1 + k ] ,
k-l
where the series converges absolutely. If o > 0, then S,!o..fi � N. The conditions an = O(n - 5 ) and E [ X: 2 ] < oo are stronger than neces
sary; they are imposed to avoid technical complications in the proof. The idea of the proof, which goes back to Markov, is this: Split the sum X1 + · · · + Xn into alternate blocks of length bn (the big blocks) and In (the little blocks). Namely, let
( 2 7 27 ) un = x( i - 1)( bII + /11) + 1 + •
j
where rn is the largest integer ther, let
( 27 .28 )
i
•
•
•
+ x(i- 1 )( bII + /11) + bII '
for which
(i - l )(bn + In ) + bn < n.
Fur
vn i = x(i - 1 )(bll + /ll) +bll + 1 + . . . + xi (bll +/II ) ' Vn rII = X is the 0 0 um·on of the A '. for which �1: .. = + 1 and A < 1 > = 0 - A < > ' and if B < > and B< 1 > are defined similarly in terms of the .,.,1 , then E;1 � ; 'f'/1 d;1 < E u v i P(A < u > n B < v > ) - P(A < ">)P ( B < v >) l < 4an . • PROOF.
If Y is measurable o( X1 , • • • , Xk ) and E[Y4 ] < C, and if Z is measurable o( Xk + n ' Xk + n + t ' . . . ) and E[Z 4 ] s D, then
Lemma 3.
I E [ YZ ] - E [ Y ] E [ Z ] I < 8(1
(27 . 30 )
+ C + D ) a!(2 •
Let Yo = Y/[ IYI s a]' yl = Y/[ IYI � a ]' ZJ( IZ I > a J · By Lemma 2, 1 E[Y0Z0] - E[Y0]E[Z0] 1 PROOF.
Zo = Z/[ IZI s a ]' zl < 4a 2an . Further,
I E ( Y0Z1 ] - E ( Y0] E ( Z1 ]1 < E (l Y0 - E ( Y0]1 · IZ1 - E ( Z1 ]1 ] < 2 a · 2 E [ I Z1 1 ] < 4aE [ I Z1 1 · I Z1 /a 1 3 ] < 4DIa 2 • Similarly,
1E[ Y1 Z0] - E[Y1 ]E[Z0] 1 < 4Cja 2 • Finally,
1 E ( Y1 Z1 ] - E [ Y1 ] E [ Z1 ]1
s
Var 1 12 [ Y1 ] Var 1 12 [ Z1 ]
< E 112 [ Y12 ] E 1 12 [ Zf ]
Adding these inequalities gives 4a 2an + 4(C + D)a - 2 + C 1 12D 1 12a - 2 as a bound for the left side of (27 .3 0). Take a = a; 1 14 and observe that 4 + 4( C + D ) + C 1 12D 1 12 < 4 + 4(C 1 12 + D 1 12 ) 2 < 4 + 8( C + D ). •
27.5. By Lemma 3, 1 E [ X1 X1 + k ] l < 8(1 + 2 E [ Xi ]) a!(2 = O (n - 512 ) , and so the series in (27.26) converges absolutely. If P k = E[ X1 X1 + k ] , then by stationarity E[S,?] = np0 + 2E;::�(n - k)p k and therefore l o 2 - n - 1E[S,?] I S 2Ek-n1P k l + 2n - 1 E 7..:-l Ek-;I P k l ; hence PROOF
OF
THEOREM
(27.26). By stationarity again,
where the indices in the sum are constrained by
i, j, k > 0 and i + j + k
- p1 ), then the o 2 in (27.26) is E1iP;1 ( /(i) - m )( /( j) - m), and EZ - 1 /( Yk ) is ap,e_roximately normally dis tributed with mean nm and standard deviation ovn . If /(i) = Bio i' then EZ - 1 /( Yk ) is the number of passages through the state i0 in the first n steps of the process. In this case m = P;0 and • a 2 = P i0(1 - P io ) + 2p;0 Lk- t ( P f;� - P io ). Exllltlple 27.8.
If the Xn are stationary and m-dependent and have mean 0, Theorem 27.5 applies and o 2 = E[X12 ] + 2Ek' 1 E[ X1 X1 + k ]. Exam ple 27.7 is a case in point. Taking m = 1 and f( x , y ) = x - y in that example gives an instance where a 2 = 0. • Example 27.9.
PROBLEMS 27.1. 17.2.
17.3.
Prove Theorem 23.2 by means o( characteristic functions. Hint: Use (27.3) to compare the characteristic function of Ek,.- t Znk with exp[l:k Pnk ( e ; ' - 1)]. If { Xn } is independent and the Xn all have the same distribution with finite first moment, then n - 1sn --+ E[ X1 ] with probability 1 (Theorem 22.1), so that n - 1Sn � E[ X1 ]. Prove the latter fact by characteristic functions. Hint: Use (27.3). For a Poisson variable Y" with mean A , show that ( Y" - A )/ A � A --+ oo . Show that (22.3) fails for t = 1 .
t The 4 in (27.29) has become 1 6 to allow for splitting into real and imaginary parts.
N
as
380
27.4. 27.5.
27.6.
CONVERGENCE OF DISTRIBUTIONS
Suppose that I Xn k I S Mn with probability 1 and Mnfsn --+ 0. Verify Lyapounov's condition and then Lindeberg's condition. Suppose that the random variables in any single row of the triangular array are identically distributed. To what do Lindeberg's and Lyapounov's condi tions reduce? Suppose that Z1 , Z2 , are independent and identically distributed with mean 0 and variance 1 and suppose that Xnk == an k Z . Write down the Lindeberg condition and show that it holds if max k a�k == o(Lk.._ 1 an2k ). Construct an example where Lindeberg's condition holds but Lyapounov' s does not. 22.10 t Prove a central limit theorem for the number Rn of records up to time n . 6.3 f Let Sn be the number of inversions in a random permutation on n letters. Prove a central limit theorem for Sn . The B-method. Suppose that Theorem 27.1 applies to { Xn }, so that {n a - 1 ( Xn - c) � N, where Xn == n - 1 EZ - t Xk . Use Theorem 25.6 as in Example 27.2 to show that, if f( x) has a nonzero derivative at c, then {n (f( Xn ) - f( c))/af'( c) � N. Example 27.2 is the case f(x) == 1jx. f The mean volume of grains of sand can be estimated by measuring the amount of water a sample of them displaces. Use Problem 27.10 to find a method for estimating the mean diameter of the grains and analyze the variance of the estimator. The hypotheses of the theorems in this section impose conditions on moments. But there can be asymptotic normality even if there are no moments at all. For ann example, let Xn assume the values + 1 and - 1 with probability !(1 - 2 - ) each and the value 2k with probability 2 - k for k > n . Show that ' n- 1 12 Ei: _ 1 Xk � N even though the Xn have no finite moments. Let dn ( w) be the dyadic digits of a point w drawn at random from the unit interval. For a k-tuple ( u1 , . , uk ) of O's and 1 's, let Nn ( u 1 , , u k ; ) be the number of m < n for which ( dm (w ), . . . , dm + k I (w )) == ( u 1 , , u k ). - Problem 6.12.) Prove a central· limit theorem for Nn ( u1 , , uk ; w ) . (See The central limit theorem for a random number of summands . Let X1 , X2 , be independent, identically distributed random variables with mean 0 and variance a 2 ' and let sn == XI + . . . + xn . For each positive t ' let .,, be a random variable assuming positive integers as values; it need not be independent of the Xn . Suppose that there exist positive constants a, and 8 such that ., _ • • •
s ,
27.7. 27.8. 27.9. 27. 10.
27. 11.
27. 12.
27.13.
II
• •
• • •
• • •
27. 14.
•
•
w
•
.
a, --+
as
t --+
( 27 .33 )
oo .
oo ,
I
a,
�
(}
Show by the following steps that
S, --
, � N'
aF,
s,
___;,. , .
-
a{BQ,
_
�
N.
• .
SECTION
(a) (b)
27.
381
THE CENTRAL LIMIT THEOREM
Show that it may be assumed that fJ == 1 and the are integers. Show that it suffices to prove the second relation in (27.33). (c) Show tha t it suffices to prove ( S,, - Sa,)/ {Q; � 0. (d) Show that a1
P ( I S,, - Sa, l > ({Q;) < P ( l v, - a, l +
> f 3a , ]
I sk - sa, I P [ l k max < - a,l £ 3
a,
>
(
F.
]
,
and conclude from Kolmogorov's inequality that the last probability is at most 2 £a 2 • %7.15. 21.14 23.11 27.14 f A central limit theorem in renewal theory. Let XI ' x2 ' . . . be independent, identically distributed positive random variables with mean m and variance a 2 , and as in Problem 23.11 let N, be the maximum n for which S,. < t. Prove by the following steps that
(a) Show by the results in Problems 21.14 and 23.11 that �
0.
( SN, - t)/ {i
(b) Show that it suffices to prove that
(c) Show (Problem 23.11) that N,/t � m - 1 and apply the theorem in
Problem 27.14. 27.16. Show by partial integration that (27 .34)
as x --+ oo . %7.17. f Suppose that XI ' x2 ' . . . are independent and identically distributed with mean 0 and variance 1, and suppose that an --+ oo . A formal combina tion of the central limit theorem and (27.34) gives (27 .35) where fn --+ 0 if an --+ oo . For a case in which this does hold, see Theorem 9.4. 27.18. 21.2 f Stirling ' s formula. Let sn == XI + . . . + xn , where the xn are inde pendent and each has the Poisson distribution with parameter 1. Prove
382
CONVERGENCE OF DISTRIBUTIONS
successively: (a)
(b) (c)
E
[ ( Sn - n ) - ] = e - n kr.-0 ( n - ) n: - _nn_+_on;_,_2)_e-_'1 rn n
( Sni r � [E ( Sni n r ] �
rn
N-
n
E
(d) n ! - �2 '1T n + ( l /2> e - n
[ N- ] =
k
k.
k.
of O' s starting at the n th place in the dyadic expansion of a point w drawn at random from the unit interval; see Example 4.1. (a) Show that /1 , /2 , • • • is an a-mixing sequence, where a n = 4/2 11 • (b) Show that E;: _ 1 /�c is approximately normally distributed with mean n and variance 6n . hypotheses of Theorem 27.5 that Sn/n --+ 0 with probabil 27.20. Prove under the ' ity 1. Hint: use (27.31). 27.21. 26.1 26.28 f Let XI ' x2 ' . . . be independent and identically distributed, and suppose that the distribution common to the X,. is supported by [0, 2 '1T ] and is not a lattice distribution. Let sn = XI + . . . + xn ' where the sum is reduced modulo 2w. Show that Sn � U, where U is uniformly distributed over [0, 2 , ]. 27. 19. Let /n ( w) be the length of the
run
SECI'ION 28. INFINITELY DIVISIBLE DISTRIBUTIONS *
Suppose that Z" has the Poisson distribution with parameter A and that Xn 1 , • • • , Xnn are independent and P[ Xn k = 1] = 'Ajn, P[ Xnk = 0] = 1 A /n . According to Example 25.2, xnl + . . . + xnn => ZA. This contrasts with the central limit theorem, in which the limit law is normal. What is the class of all possible limit laws for independent triangular arrays? A suitably restricted form of this question will be answered here. Vague Convergence
The theory requires two preliminary facts about convergence of measures. Let #L n and p. be finite measures on ( R1 , 9P1 ). If l'n(a, b) --+ p.(a, b) for every finite interval for which p. { a } = p. { b } = 0, then #Ln converges vaguely to p., written #Ln --+ v p.. If P.n and p. are probability measures, it is not hard to see that this is equivalent to weak convergence: P.n � p.. On the other hand, if #Ln is a unit mass at n and p.( R1 ) = 0, then P.n --+ v p., but P.n � p. makes no sense because p. is not a probabil ity measure. The first fact needed is this: Suppose that #Ln --+ v p. and sup p.n ( R 1 ) < oo ; (28.1 )
n
* This section may be omitted.
28.
SECTION
383
INFINITELY DIVISIBLE DISTRIBUTIONS
then
ffdp.n --+ Jfdp.
(28.2)
for every continuous real f that vanishes at + oo in the sense that lim 1 t l - oc. f( x ) = 0. Indeed, choose M so that p.( R 1 ) < M and #Ln ( R 1 ) < M for all n . Given £ , choose a and b so that p. { a } == p. { b } == 0 and 1/ (x) l < £/M if x ll. A == ( a , b] . Then 1 / Acf dp.n l < £ and 1 / A fdp. l < £. If p. ( A ) > 0, define v ( B ) = p.( B n A )/p.( A ) and ,,. ( B) == P.n ( B n A )/P.n ( A ). It is easy to see that "n � v , so that ffd vn --+ f fd v . But then 1 / A fdp.n - f A fdp. l < £ for large n , and hence 1 / fdp. n - ffdp. l < 3£ for large n. If p. ( A ) == 0, then f A fdp.n --+ 0, and the argument is even simpler. The other fact needed below is this: If (28.1) holds, then there is a subsequence { l'n } and a finite measure p. such that p., --+ p. as k --+ oo . Indeed, let F, ( x) == p,,. ( :_ oo , x ]. Since the F, are uniformly b�unded because of (28.1 ), the proof of c
,,
Helly's theorem shows there exists a subsequence { F, } and a bounded, nondecreas ing, right-continuous function F such that lim* F, ( ; ) == F( x ) at continuity points x of F. If p. is the measure for which p.( a , b] == F� b) - F( a ) (Theorem 12.4), then clearly l'n --+ p.. l'
"
The
Possible Limits
Let Xnl, , Xn r , n = 1, 2, . . . , be a triangular array as in the preceding section. The random variables in each row are independent, and •
•
•
II
rll
s; = E on2k · k=l
(28.3) Put Sn = Xn1 + bounded :
·
·
·
+ Xn r . Here it will be assumed that the total variance is
sup sn2
(28.4) In
order that the
(28 .5)
II
n
xn k
(x 2 dFn k (x) --+ 0, which is exactly Lindeberg' s condi Example 28.4.
•
tion. Thus Theorem 28.4 contains Theorems 27.2 and 27.4.
The Poisson case. Let Zn 1 , . . . , Zn ," be an independent triangular array, and suppose xn k = zn k - m n k satisfies the conditions of the theorem, where m n k = E[Zn k 1 · If ZA has the Poisson distribution with parameter A, then Ek Xn k => ZA - A if and only if P. n converges vaguely to a mass of A at 1 (see Example 28.2). If s; --+ A, the requirement is P. n [1 1 + £] -+ A, or Example 28.5.
£,
(28 .11 ) for positive
£.
If
s; and Ek m n k both converge to A, (28.1 1 ) is
a
necessary
388
CONVERGENCE OF DISTRIBUTIONS
and sufficient condition that Ek Znk => Z". The conditions are easily checked under the hypotheses of Theorem 23.2: Znk assumes the values 1 and 0 with probabilities Pnk and 1 Pnk ' E k Pnk -+ A, and max k Pnk --+ 0. •
-
PROBLEMS 28.1. 28.2.
Show that #Ln --+ p. implies p. ( R1 ) < lim infn P.n ( R1 ). Thus in vague conver gence mass can "escape to infinity" but mass cannot "enter from infinity." (a) Show that #Ln --+ p. if and only if (28.2) holds for every continuous f with bounded support. (b) Show that if #Ln --+ p. but (28.1) does not hold, then there is a continuous f vanishing at + oo for which (28.2) does not hold. 23.8 f Suppose that N, Yi , Yi , . . . are independent, the Y, have a common distribution function F, and N has the Poisson distribution with mean a. Then S == Yi + · · · + YN has the compound Poisson distribution . (a) Show that the distribution of S is infinitely divisible. Note that S may not have a mean. n n (b) The distribution function of S is E�_ 0e- aa Fn * ( x)jn ! , where p • is the n-fold convolution of F (a unit jump at 0 for n == 0). The characteristic function of S is exp af�00 ( e;'x - 1) dF( x) . (c) Show that, if F has mean 0 and finite variance, then the canonical measure I' in (28.6) is specified by p.(A) == afA x2 dF( x). (a) Let " be a finite measure and define v
l'
v
28.3.
28.4.
( 28 .12)
28.5. 28.6.
[
cp( t ) == exp iyt +
-
f
oo
- oo
(
. -1e''x
)
itx 1 + x2 ( dx) , x2 1 + x2 P
]
where the integrand is t2/2 at the origin. Show that this is the characteris tic function of an infinitely divisible distribution. (b) Show that the Cauchy distribution (see the1 table on p. 35 8) is the case 1 where y == 0 and " has density w - ( 1 + x 2 ) - with respect to Lebesgue measure. Show that the Cauchy, exponential, and gamma (see (20.47)) distributions are infinitely divisible. Find the canonical representation (28.6) of the exponential distribution with mean 1 : (a) The characteristic function is f�ei'xe- x dx == (1 - it) - 1 == cp ( t). (b) Show that (use the principal branch of the logarithm or else operate formally for the moment) d (log cp ( t ))/dt == icp ( t) == if�ei1xe-x dx. Integrate with respect to t to obtain (28 . 1 3)
-- . == exp 1 1
1
II
0
oo
( eitx - 1 )
e- x X
dx .
28.
389 Verify (28.13) after the fact by showing that the ratio of the two sides has derivative 0. (c) Multiply (28.13) by e - to center the exponential distribution at its mean: The canonical measure p. has density xe - t over (0, oo ) . SECTION
INFINITELY DIVISIBLE DISTRIBUTIONS
;,
28.7.
f If X and Y are independent and each has the exponential density e - --.:, X - Y has the double exponential density !e - 1 x 1 (see the table on p. 358). Show that its characteristic function is 1 --2 == exp 1+t
28.8.
28.9.
/ 00 ( e'l. \. - 1 - ztx. ) -1 l x l e- l tl dx. x2
- oo
f Suppose X1 , X2 , are independent and each has the double exponen tial density. Show that E�- 1 Xn/n converges with probability 1. Show that the distribution of the sum is infinitely divisible andn that its canonical measure has density lx l e - l x l/(1 - e - l x l ) == E�_ 1 1x l e -l x l . 26.8 f Show that for the gamma density e - xx " - 1 /f ( u ) the canonical measure has density uxe - x over (0, oo). • •
•
The remaining problems require the notion of a stable law. A distribution function F is stable if for each n there exist constants an and bn , an > 0, such that, if XI ' . . . ' xn are independent and have distribution function F, then a; 1 ( XI + + Xn ) + bn also has distribution function F. · · ·
28.10. Suppose that for all a, a', b, b' there exist a", b" (here a, a', a" are all 28.11. 28.12. 28.13. 28.14.
positive) such that F( ax + b) • F( a'x + b') == F(a"x + b"). Show that F is stable. Show that a stable law is infinitely divisible. Show that the Poisson law, although infinitely divisible, is not stable. Show that the normal and Cauchy laws are stable. 28.10 t Suppose that F has mean 0 and variance 1 and that the depen dence of a", b" on a, a', b, b' is such that
(F -X ) • F( -X ) == F ( X ) . al
a2
Jar + af
Show that F is the standard normal distribution. 28.15. (a) Let Y, k be independent random variables having the Poisson distribu tion with mean cn°/ l k l 1 + a , where c > 0 and 0 < a < 2. Let Zn == n - 1 E i: � - n k Y, k (omit k == 0 in the sum) and show that if c is properly chosen then the characteristic function of zn converges to e - l t la. (b) Show for 0 < a < 2 that e -l tla is the characteristic function of a symmetric stable distribution; it is called the symmetric stable law of· exponent a. The case a == 2 is the normal law, and a == 1 is the Cauchy law. 2
390
CONVERGENCE OF DISTRIB UTIONS
SECTION 29.
LIMIT THEOREMS IN R k
If Fn and F are distribution functions on R k , Fn converges weak ly to F, written Fn => F, if lim n Fn ( x) = F(x) for all continuity points x of F. The corresponding distributions P.n and p. are in this case also said to converge weakly: P. n => p.. If Xn and X are k-dimensional random vectors (possibly on different probability spaces), Xn converges in distribution to X, written Xn => X, if the corresponding distribution functions converge weakly. The definitions are thus exactly as for the line. The Basic Theorems
The closure A - of a set in R k is the set of limits of sequences in A ; the interior is A 0 = R k - (R k - A) - ; and the boundary is vA = A - - A 0 • A Borel set A is a p.-continuity set if p.( aA ) = 0. The first theorem is the k-dimensional version of Theorem 25.8.
For probability measures P.n and p. on (R k , &ft k ), each of the following conditions is equivalent to the weak convergence of p. n to p.. (i) lim n ! I dp. n = ! 1dp. for bounded continuous f; (ii) lim SUPnP. n< C) < p.( C) for closed C; (iii) lim infnP.n ( G) > p.( G) for open G; (iv) lim n P. n ( A ) = p.( A) for p.-continuity sets A. Theorem 29.1.
It will first be shown that (i) through (iv) are all equivalent. (i) implies (ii): Consider the distance dist(x, C) = inf[ lx - Y l : y from x to C. It is continuous in x. Let PROOF.
cpj ( t )
=
1 1 - jt 0
E
C]
if t < 0, if 0 < t < j- 1 ' if j - 1 < t.
Then /j ( x ) = cp1 (dist(x, C)) is continuous and bounded by 1, and fj( X ) � Ic( X ) as j t 00 because c is closed. If (i) holds, then lim SUPnP. n ( c ) < lim n f fj dP. n = f fj dp.. As j f 00, f fj dp. � f Ic dp. = p. (C). (ii) is equivalent to (iii) : Take C = R k - G. (ii) and (iii) imply (iv): From (ii) and (iii) follows p. ( A 0
)<
p.: Consider the corresponding distribution functions. If Sx = [ y: Y; < x ; , i = 1, . . . , k ], then F is continuous at x if and only if p.( asx) = 0 ; see the argument following (20.18). Therefore, if F is continu ous at X, Fn( x ) = P.n(Sx) --+ p.(Sx) = F( x ) , and Fn => F. P.n => p. implies (iii) : Since only countably many parallel hyperplanes can have positive p.-measure, there is a dense set D of reals such that p. [x: X = d ] = 0 for d e D and i = 1, . . . , k. Let .JJI be the class of rectangles ; A = [x: a; < X; < h;, i = 1, . . . , k ] for which the a; and the h; all lie in D . All 2 k vertices of such a rectangle are continuity points of F, and so Fn => F implies (see (12.12)) that P.n(A ) = & A Fn � & A F = p. ( A ) . It follows by the inclusion-exclusion formula that P.n( B) --+ p.( B) for finite unions B of elements of .JJI . Since D is dense on the line, an open set G in R k is a countable union of sets Am in .JJI . But p.(Um � M Am) = limnP.n(Um < M Am) S lim inf nP.n( G). Letting M --+ oo gives ( iii) . • ·
·
Suppose that h: Rk --+ Ri is measurable and that the set Dh of its discontinuities is measurable. t If P.n => p. in R k and p. ( D ) = 0, then 1 P h - 1 => p.h - in Ri. Let C be a closed set in Ri. The closure ( h - 1 C) - in R k satisfies PROOF. (h - 1C ) - c Dh u h - 1c. If P.n => p., then part (ii) of Theorem 29.1 gives lim sup p. n h - 1 ( c ) s lim sup p. n ( ( h - 1C ) - ) < p. ( ( h - 1C ) - ) n n Theorem 29.2.
h
n
( D ) + ,.,. ( h - 1C ) . Using (ii) again gives P.nh - 1 => p.h - 1 if p. ( D ) = 0. s ,.,.
h
h
•
Theorem 29.2 is the k-dimensional version of the mapping theorem Theorem 25.7. The two proofs just given provide in the case k = 1 a second approach to the theory of Section 25, which there was based on Skorohod ' s ' The argument in the footnote on p. 343 shows that in fact Dh
e
gt k always holds.
392
CONVERGENCE OF DISTRIBUTIONS
theorem (Theorem 25.6). t Skorohod ' s theorem does extend to R k (see Theorem 29.6), but the proof is harder. Theorems 29.1 and 29.2 can of course be stated in terms of random vectors. For example, Xn => X if and only if P[ X E G ] < lim infn P[ Xn E G ] for all open sets G. A sequence { P.n } of probability measures on (R k , £Il k ) is tight if for every £ there is a bounded rectangle A such that P.n( A ) > 1 - £ for all n .
If { P.n } is a tight sequence of probability measures, there is a subsequence { P.n 1 } and a probability measure p. such that P.n l => p. as i ._. 00 . Theorem 29.3.
. The proof of Helly' s theorem (Theorem 25.9) carries over to R k : For points x and y in R k , interpret x < y as meaning x u < Yu , u = 1, . . . , k, and x < y as meaning x u < Yu ' u = 1, . . . , k. Consider ra tional points r-points whose coordinates are all rational-and by the diagonal method [A14] choose a sequence { n; } along which lim ;Fn ( r ) = l G(r) exists for each such r. As before, define F(x) = inf[ G ( r ) : x < r]. Although F is clearly nondecreasing in each variable, a further argument is required to prove &AF > 0 (see (12.12)). Given £ and a rectangle A = (a 1 , b 1 ] X X (a k , bk ], choose a 8 such k that if z = ( 8, . . . , 8 ), then for each of the 2 vertices x of A, x < r < x + z implies IF( x ) - G(r) l < £/2 k . Now choose rational points r and s such that a < r < a + z and b < s < b + z. If B = (r1 , s 1 ] X X (rk , sk ], then I & A F - &sCJ I < £. Since &JJ = 1im;& 8Fn > 0 and £ is arbitrary, it follows that &AF > 0. With the present interpretation of the symbols, the proof of Theorem 25 .9 shows that F is continuous from above and lim ;Fn .(x) = F(x) for continuity points x of F. By Theorem 12.5, there is a measure p. on (R k , £Il k ) such that p. ( A ) = &A F for rectangles A . If p. is a probability measure, then certainly p. n . => p.. But given £, choose a rectangle A such that P.n(A) > 1 - £ for all n. By enlarging A , it is possible to ensure that none of the ( k - I)-dimensional hyperplanes containing the 2k faces of A have positive p.-measure (Theorem k 10.2(iv)). But then F is continuous at each of the 2 vertices of A (see the discussion following (20.18)), and so 1 � p.( R k ) > p. ( A ) = &AF = • lim; & A Fn . = lim ;P.n .(A) > 1 - £. Thus A is a probability measure. FIRST PROOF
· · ·
· · ·
I
I
I
I
I
t The approach of this section carries over to general metric spaces; for this theory and its applications, see BILLINGSLEY'.
SECTION
29.
LIMIT THEOREMS IN
Rk
393
There is a proof that avoids an appeal to Theorem 12.s.t SECOND PROOF. Let JJI be the class of bounded, closed rectangles whose vertices have rational coordinates, and let .JII' consist of the empty set and the finite unions of � sets. Then .JII' is countable and closed under the formation of finite unions, and each set in .JII' is compact. Select by the diagonal procedure a subsequence { l'n ; } along which the limits (29. 1) exist for all H in .Jf'. Suppose it can be shown that there exists a probability measure p. such that ( 29.2)
p.( G ) == sup ( a( H ) : H c G, H e .Jt')
for
all open sets G. Then H c G will imply that a( H) == lim;#J.n . ( H) < lim inf;P.n . (G), so that p.(G) < lim inf;P.n (G) for open G by (29.2). Therefore (condition (iii) in Theorem 29.1), it suffi ces to produce a probability measure satisfying (29.2). Clearly, a( H), defined by (29.1) for all H in .J11' , has these properties: (29.3) (29.4)
Define (29.6)
/l ( G) == sup ( a( H): H c G, H e jf']
for open sets G, and then for arbitrary A define y ( A ) == inf( � ( G ) : A c G) ,
where G ranges over open sets. Qearly, y( G) == �(G) for open G. Suppose it is shown that y is an outer measure and that each closed set is y-measurable. In the notation of Theorem 11.1, it will then follow that � k c ..l( y ), and the restriction p. of y to gt k will be a measure satisfying p.( G) == y( G) == P< G), and so (29.2) will hold for open G, as required. By tightness, there is for given £ an #set H such that P.n ( H) > 1 - £ for all n . But then 1 > a( H) > 1 - £ by (29.1), and I'(R k ) = fl( R k ) == 1 by (29.2). The first step in proving that y is an outer measure is to show that P is finitely subadditive (on open sets): If H c G1 u G2 and H e .Jf', define C1 = [ x e H : dist( x, Gf) > dist(x, G2)] and c2 == [ x E H: dist( x, G}') < dist( x, G2')]. If X E c. ' It extends to the general metric space; see LOEVE, p. 196.
394
CONVERGENCE OF DISTRIBUTIONS
and x ll. G1 , then x e G2 , so that, since G2 is closed, dist( x, G2 ) a contradiction . Thus C1 c G1 ; similarly, C2 c G2 .
>
0
=
dist( x, Gi )
,
If x e C1 , there is an �set Ax such that x e A � c Ax c G1 . These A� cover C1 , which is compact (a closed subset of the compact set H), and so finitely many of them cover C1 ; the union H1 of the corresponding sets Ax is in .71' and satisfies c. c HI c c • . Similarly, there is an #set H2 for which c2 c H2 c G2 . But then a ( H) < a ( H1 U H2 ) < a( H1 ) + a( H2 ) < /l( G1 ) + /l( G2 ) by (29.3), (29. 5 ), and (29.6). Since H can be any #set in G1 u G2 , another application of (29.6) gives /l( G1 U G2 ) < /l( G1 ) + /l( G2 ). Next, /l is countably subadditive (on open sets): If H c UnGn , then by compact ness H c U n < n Gn for some n 0 , and therefore by finite subadditivity a( H) < /l(U n s n oGn ) 5:- �n < n 0 /l( Gn ) < E n /l( Gn ). Letting H vary over the #sets in U ,.G, gives /l (UnGn ) < 'En/l (Gn ). And now y is an outer measure: Clearly, y is monotone. To prove it subadditive� suppose that A n are arbitrary setsn and £ > 0. Choose open sets Gn such that A n C Gn and /l( Gn ) < y ( A n ) + £/2 . By the countable subadditivity of /l , y ( U A ) < /l(U nGn ) < E,. /l (Gn ) < E11y( A n ) + £; £ being arbitrary, y(Un A n ) < E ny ( A, ) fol lows. It remains only to prove that each closed set is y-measurable: ,
( 29 .7)
y( A ) > y( A
n
C) + y ( A
n
,
C( )
for C closed and A arbitrary (see (1 1 .3)). To prove (29.7) it suffices to prove
( 29 . 8)
/l ( G)
> y ( G n C) + y ( G
n
C( )
for open G, because then G :::> A implies that /l( G ) > y ( A n C ) + y ( A n C< ) an d taking the infimum over G will give (29.7). To prove (29.8), for given ( choose an #set HI for which HI c G () c( and a ( HI ) > /l ( G () c( ) - ( . Now choose an #set H2 for which H2 c G () H} and a ( H2 ) > /l( G n H} ) - £. Since H1 and H2 are disjoint and are contained in G, i t follows by (29.4) that /l ( G ) > a ( HI u H2 ) a ( HI ) + a ( H2 ) > /l( G () c( ) + /J( G • n Hi ) - 2 £ > y ( G n C(' ) + y( G n C) - 2 £. Hence (29.8). =
Obviously Theorem 29.3 implies that tightness is a sufficient condition that each subsequence of { P.n } contain a further subsequence converging weakly to some probability measure. (An easy modification of the proof o f
SECTION
29.
LIMIT THEOREMS IN
Rk
395
Theorem 25.10 shows that tightness is necessary for this as well.) And clearly the corollary to Theorem 25.10 now goes through:
If { P. n } is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure p., then P. n => p.. Corollary.
Characteristic Functions
Consider a random vector X = ( X1 , , Xk ) and its distribution p. in R k . Let t · x = E� _ 1 t u x u denote inner product. The characteristic function of X and of p. is defined over R k by • • •
(29. 9 ) To a great extent its properties parallel those of the one-dimensional characteristic function and can be deduced by parallel arguments. The inversion formula (26.16) takes this form: For a bounded rectangle A = [x: a u < x u < bu , u � k] such that p.( aA) = 0,
(29.10 ) where B r = [t e R k : l t u l < T, u < k] and prove it, replace cp(t) by the middle term in as in (26.17): The integral in (29.10) is
dt is short for dt 1 • • • dt k . To (29.9) and reverse the integrals
The inner integral may be evaluated by Fubini' s theorem in
lr =
f.
n[
Rk u =- 1
R k, which gives
sgn ( x ,.
- a ,. ) S( T · l x ,. a ,. l ) 'IT
gn ( x - b,. ) S( T l ,. s · x b l) � ,.
Since the integrand converges to follows as in the case k = 1.
IT�- t tP a , , b ,(x u )
(see
]
p(
dx )
.
(26.18)), (29.10)
396
CONVERGENCE OF DISTRIBUTIONS
The proof that weak convergence implies (iii ) in Theorem 29.1 shows that for probability measures p. and " on R k there exists a dense set D of reals such that p.( a A) = 11( a A) = 0 for all rectangles A whose vertices have coordinates in D. If p. ( A ) = 11(A) for such rectangles, then p. and " are identical by Theorem 3.3.
Thus the characteristic function cp uniquely determines the probability measure p.. Further properties of the characteristic function can be derived from the one-dimensional case by means of the following device of Cramer and Wold. For t e R k , define h,: R k --+ R 1 by h,( x ) = t · x. For real a, [x: t · x < a ] is a half-space, and its p.-measure is p.(x: t · x < a ] = p.h; 1 ( - oo , a] . ( 29.11 ) By change of variable, the characteristic function of p. h; 1 is
( 29.12 ) To know the p.-measure of every half-space is (by (29.11)) to know each p. h; 1 and hence is (by (29.12) for s = 1) to know cp (t) for every t; and to know the characteristic function cp of p. is to know p.. Thus p. is uniquely determined by the values it gives to the half-spaces. This result, very simple in its statement, seems to ,require Fourier methods-no elementary proof is known. If p. n => p. for probability measures on R k , then cpn ( t) --+ cp( t) for the corresponding characteristic functions by Theorem 29.1. But suppose that the characteristic functions converge pointwise. It follows by (29.12) that for each t the characteristic function of P. n h; 1 converges pointwise on the line to the characteristic function of p. h; 1 ; by the continuity theorem for characteristic functions on the line then, P. n h; 1 => p. h; 1 Take the u th component of t to be 1 and the others 0 ; then the P. n h; 1 are the marginals for the u th coordinate. Since { p. n h - 1 } is weakly convergent, there is a bounded interval (a u , bu ] such that P.n[x E R k : a u < x u < bu ] = P. n h; 1 (a u , bu ] > 1 - £/k for all n. But then P. n (A) > 1 - f. for the bounded rectangle A = [ x: a u < x u � bu , u = 1, . . . , k ]. The sequence { p. n } is there fore tight. If a subsequence { P. n , } converges weakly to ,, then cpn , (t ) converges to the characteristic function of ,, which is therefore cp (t). By uniqueness, " = p., so that P. n => p.. By the corollary to Theorem 29 .3 , P. n => p.. This proves the continuity theorem for k-dimensional characteristic functions: p. n => p. if and only if cpn ( t) --+ cp( t) for all t. •
I
SECTION
29.
LIMIT THEOREMS IN
Rk
397
The Cramer-Wold idea leads also to the following result, by means of which certain limit theorems can in a routine way be reduced to the one-dimensional case.
For random vectors Xn = ( Xn 1 , , Xn k ) and Y = ( Y1 , , Yk ), a necessary and sufficient condition for Xn Y is that E� _ 1 t u Xn u E!_ 1 t u Yu for each (t 1 , , t k ) in R k . •
1beorem 29.4. • • •
=>
• • •
•
•
=>
The necessity follows from a consideration of the continuous mapping h above-use Theorem 29.2. As for sufficiency, the condition implies by the continuity theorem for one-dimensional characteristic func tions that for each (t 1 , , t k ) PROOF.
1
• • •
for all real s. Taking s = 1 shows that the characteristic function of Xn • converges pointwise to that of Y. Normal Distributions in Rk By Theorem 20.4 there is (on some probability space) a random vector
X = ( X1 ,
, Xk ) with independent components each having the standard normal distribution. Since each Xu has density e - x 2 12;J2 '1T , X has density • • •
(see (20.25))
(29.13) where l x l 2 = E!_ 1 x; denotes Euclidean norm. This distribution plays the role of the standard normal distribution in R k . Its characteristic function is
(29.14 ) Let A = [a u v 1 be a k X k matrix and put Y = AX, where X is viewed as a column vector. Since E[ Xa Xp] = Bap ' the matrix � = [ ou v 1 of the covari ances of Y has entries ou v = E[ Yu Yv 1 = E!_ 1 a uaa v a · Thus � = AA', where the prime denotes transpose. The matrix � is symmetric and nonnegative definite: L u vou v x u x v = IA'x l 2 � 0. View t also as a column vector with transpose t', and note that t · x = t'x . The characteristic function of AX is
398
CONVERGENCE OF DISTRIBUTIONS
thus
Define a centered normal distribution as any probability measure whose characteristic function has this form for some symmetric nonnegative definite �. If � is symmetric and nonnegative definite, then for an appropriate orthogonal matrix U, U '� U = D is a diagonal matrix whose diagonal elements are the eigenvalues of � and hence are nonnegative. If D0 is the diagonal matrix whose elements are the square roots of those of D , and if A = UD0, then � = AA'. Thus for every nonnegative definite � there exists a unique centered normal distribution (namely the distribution of A X) with covariance matrix � and characteristic function exp( - � t '�t ). If � is nonsingular, so is the A just constructed. Since X has density (29.13), Y = AX has by the Jacobian transformation formula (20.20) density /( A - 1x) l det A - 1 1 . Since � = AA', l det A - 1 1 = (det �) - 1 12 . Moreover, � - 1 = (A') - 1A - 1 , so that IA - 1x l 2 = x'� - 1x. Thus the normal distribution has density (2w) - k l2 (det �) - 1 12 exp( - �x�� - 1x) if � is non singular. If � is singular, the A constructed above must be singular as well, so that AX is confined to some hyperplane of dimension k 1 and the distribution can have no density. If M is a j X k matrix and Y has in R k the centered normal distribu tion with covariance matrix, �, MY has in Rj the characteristic function exp( - !( M 't )'�( M 't )) = exp( - ! t '( M � M ') t ) (t e Ri). Hence MY ha s the centered normal distribution in Rj with covariance matrix M � M . •
-
'
Thus a linear transformation of a normal distribution is itself normal. These normal distributions are special in that all the first moments
vanish. The general normal distribution is a translation of one of these centered distributions. The Central Limit Theorem
Let Xn = ( Xn 1 , , Xn k ) be independent random vectors all having the same distribution. Suppose that E[ X;u 1 < oo ; let the vector of means be c = (c 1 , , ck ), where c u = E[Xn u 1, and let the covariance matrix be � = [ (Ju v 1, where (Ju v = E[( xn u - c u )( xn v - c v )]. Put sn = XI + . . . + Xn . •
•
•
• . .
Under these assumptions, the distribution of the random vector (Sn - nc)/ {i converges weakly to the centered normal distribution with covariance matrix �-
Theorem 29.5.
29.
SECTION
LIMIT THEOREMS IN
Rk
399
Let Y = ( Y1 , , Yk ) be a normally distributed random vector with 0 means and covariance matrix �- For given t = (t 1 , , t k ), let Zn = E!-t t u ( 1Xn u - c u ) and Z = E!_ 1 t u Yu . By Theorem 29.4, it suffices to prove that n - 12 Ej_ 1 Z1 Z (for arbitrary t). But this is an instant consequence of the Lindeberg-Levy theorem (Theorem 27 .1 ) . • PROOF.
•
•
•
•
•
•
�
Skorohod's Theorem in Rk *
Suppose that Yn and Y are k-dimensional random vectors and Yn probability 1 . If G is open, then
Y with
--+
( Y e G ) c lim infn ( Yn e G ) u ( Yn --+ Y ) c, and Theorem 4.1 implies that P[ Y e G] < lim inf n P l Yn e G], so that Yn Y by Theorem 29.1. Thus convergence with probability 1 implies con �
vergence in distribution.
Suppose that P.n and p. are probability measures on (R k , &J k ) and p. n p.. Then there exist random vectors Yn and Y on a common probability space ( 0, �, P) such that Yn has distribution P. n ' Y has distribu tion p., and Yn ( w ) --+ Y( w ) for each w. Theorem 29.6. �
As in the proof of Theorem 25.6, the probability space ( 0, �, P) will be taken to be the unit interval with Lebesgue measure. It will be convenient to take 0 = [0, 1)-closed on the left and open on the right. Let B0 be the set of real s such that p.[x: X; = s ] > 0 for some i = 1, . . . , k, and let B consist of all rational multiples of elements of B0• Then B is countable, and so there exists some positive t outside B. By the construction the set PROOF.
D = ( n2 - "t: u = 0, 1, 2, . . . ; n = 0, ± 1, ± 2, . . . ] is disjoint from B0; of course, D is dense on the line. Let [ C;�1 > : i = 1, 2, . . . ] be the countable decomposition of R k into cubes of the form [x: n ; t < X; < (n ; + 1 )t , i = 1, . . . , k] for integers n 1 , , nk. 1 k k For each i 1 let [C;���: i 2 = 1, . . . , 2 ] be the decomposition of C;� > into 2 cubes of side tj2. Now decompose each C;��� into 2 k cubes of side t/2 2 • Continue inductively in this way to obtain for u = 1, 2, . . . decompositions [C; X ; , ; 1, . . . , k] and the X; all lie in the set (29.15), then A is for some u the union of countably many sets C;���- ;,, so that by Scheffe' s theorem lim v P[Y < v > e A] = p. ( A ) . Since (29.15) is dense, the distribution of y < v > Y by (29.21), p. and the distribution converges weakly to p..t Since y < v >
Therefore, -=
of
Y must
=>
coincide.
Construction of the Yn .
n, u > 1, let
The construction is as for Y. For [ 11� �: �), 1 be a decomposition of [0, 1) into subintervals for which n / .( , u .- 1 ) = '1 '"- 1
(29.23)
.
0 0
2k
u J ( n , u;)
i-l
;1
..
0
"- 1 ;
and
(29.24 ) This time arrange inductively* that i < j if and only if /,. ( w) as x ! ".>. . ; for w in < y For the same xj") . . as in (29.20), define ; n 1 ( 1;1�� �] . As before, I Yn' u +I>(w) - yn< u > (w)l < kt/2 " - 1 , so that the limit .. Y,( '-' ) = lim u Yn< u > (w) exists,
i
I;��:�], _ Ji· s i,
• • •
,·
, iu)
• • •
I
1J
i
11
(29 .25 ) and
yn
has distribution
P. n ·
tSee the proof on p. 391 that P.n
�
p. implies (iii).
* The construction in the footnote on p. 400 has the property that i < j if and only if I; lies to the left of /j .
402
CONVERGENCE OF DISTRIBUTIONS
Convergence. It remains to consider the convergence of Yn (w) to Y(w ).
Let
] .< n , u �, = [ a � n , u �, , b � n , u � ) , '• . . . ,
'• . . . , ,
'• . . . ,
E' denotes summation over those ( }1 , , iu ) for which ( i 1 , , i")-here i 1 , , i u are fixed-then by (29.24), ) "" ' ,_ ( c � u "' ' P ( I1�1 n , u � ) = i...J > a 1�1n , u � = i...J n 1 r- 1 If
. • •
u,
• • •
. • •
'
• • •
...
"
},
.. .
.
},
( }1 ,
• • .
, iu )
p. by hypothesis, it follows by (29 . 17) that this finite su m converges to
"' ' P ( l1�1 " > · 1., ) = i...J "" 'P. ( c1. · 1., ) · a � u >. ., = i...J - b;(1u.). . ;, for each and 11· , . . . , I· u · Sl. ffil"1ar1y, li m n b;(1n..., u );, If w is interior to 1; 1 ;,' it is therefore also interior to 1;� �:�1 for large so that y < u > ( w) = Yn 1 - t: for all n . (b) Show that { #Ln } is tight if and only if each of the k sequences of marginal distributions is tight on the line. Assume of ( Xn , Y, ) that Xn � X and Y, � c. Show that ( Xn , Y, ) � ( X, c ). This is an example of Problem 29.2(b) where Xn and Y, need not be assumed independent. Jl .
29.3.
19.4.
19.5. 19.6. '1.9.7. 19.8.
Let p. and v be a pair of centered normal distributions on R 2 • Suppose that all the variances are 1 but that the two covariances differ. Show that the mixture !P. + t v is not normal even though its marginal distributions are. Prove analogues for R k of the corollaries to Theorem 26.3. Devise a version of Lindeberg's theorem for R k . Suppose that estimates U and V of a and fJ are approximately normally distributed with means a and fJ and small variances aJ and al � , respec tively. Suppose that U and V are independent and fJ > 0. Use Example 29. 1 for l(x, y ) == xjy to show that U/ V is approximately normally distributed
404
CONVERGENCE OF DISTRIBUTIONS
with mean aj{J and variance
A physical example is this: U estimates the charge a == e on the electron, V estimates the charge-to-mass ratio {J == ejm, and U/ V estimates the mass a.j{J == m. 29.9. To obtain a uniform distribution over the surface of the unit sphere in R k , fill in the details in the following argument. Let X have the centered normal distribution with the identity as covariance matrix. For U orthogonal, UX has the same distribution. Let cf-(x) == xf l x l on R k (take ct-(0) == 0, say). Then Y == ct-(X) lies on the unit sphere and, as UY == ct-(UX) for U orthogonal, the distribution of Y is invariant under rotations. 29.10. Assume that
is positive definite, invert it explicitly, and show that the corresponding two-dimensional normal density is
29.1 1. 29.12.
29.13.
29. 14.
where D == a1 1 a22 - af2 • 21.15 29.10 t Al�ough uncorrelated random variables are not in general independent, they are if they are jointly normally distributed. Suppose that /( X) and g( Y) are uncorrelated for all bounded continuous I and g. Show that X and Y are independent. Hint: Use characteris tic functions. 20.22 f Suppose that the random vector X has a centered k-dimensional normal distribution whose covariance matrix has 1 as an eigenvalue of multiplicity r and 0 as an eigenvalue of multiplicity k - r. Show that 1 X 1 2 has the chi-squared distribution with r degrees of freedom. f Multinomial sampling. Let p1 , • • • , Pk be positive and add to 1, and let z. ' z2 ' • . . be independent k-dimensional random vectors such that zn has with probability P; a 1 in the i th component and O's elsewhere. Then In = < ln 1 , • • • , In k ) == E :, _ 1 Zm is the frequency count for a sample of size n from a multinomial population with cell probabilities P; · Put Xn ; == ( In ; np; )/{ni; and xn == ( xnl ' . . . ' xn k ). (a) Show that Xn has mean values 0 and covariances oiJ == ( BiJ p1 - P; p1 )/
VP; PJ .
(b) Show that the chi-squared statistic Ef.... 1 < In ; - np; ) 2 jnp; has asympto t
ically the chi-squared distribution with k - 1 degrees of freedom.
SECTION
30.
405
THE METHOD OF MOMENTS ==
(
�
, nX,,, ) is uniformly distributed over the surface of a sphere of radius vn in R . Fix t and show that xnl ' . . . ' xnt are in the limit independent, each with the standard normal distribution. Hint: If the components of Y, == ( Yn 1 , , Y, n ) are independent, each with the standard normal distribution, then xn has the same distribution as {n Y,, / 1 Y, I · (b) Suppose that the distribution of Xn ( Xn 1 , . . . , Xnn ) is spherically symmetric in the sense that Xn/ I Xn l is uniformly distributed over the unit sphere. Assume that 1 xn 1 2 /n � 1 and show that Xn1 , . . . , xn , are asymptot ically independent and normal. 29.16. 19.13 29.15 f Prove the result in Problem 29.15(a) by the method of Problem 19.12. 29.17. Let Xn ( Xn . , . . . , Xn k ) , n == 1, 2, . . . , be random vectors satisfying the mixing condition (27.25) with an O( n - 5 ). Suppose that the sequence is stationary (the distribution of ( Xn , . . . , Xn +.i ) is the same for all n), that E[ Xn u 1 0, and that the Xn u are uniformly bounded. Show that if Sn X1 + · · · + Xn , then Sn/ {n has in the limit the centered normal distribution with covariances
29.15. 29.9 t
•
A
•
theorem of Poincare. (a) Suppose that Xn
,
.
•
•
•
=-
==
==
==
==
00
00
E [ x. u xl v ] + E E [ x. u x. +j. v ] + E E [ x. +j. u Xll, ] .
j- 1
j- 1
Hint: Use the Cramer-Wold device. 29.18. f As in Example 27.6, let { Y, } be a Markov chain with finite state space S == {1, . . . , k }, say. Suppose the transition probabilities Puv are all positive and the initial probabilities Pu are the stationary ones. Let fn u be the number of i for which 1 < i < n and Y; = u . Show that the normalized frequency count
n - 1 12 ( fn1 - np l ' · · · , fn k - npk ) has in the limit the centered normal distribution with covariances 00
00
Buv - Pu Pv + E ( P!t) - Pu Pv } + E ( P!�) - PvPu } ·
j- 1
j- 1
SECI10N 30. THE METHOD OF MOMENTS * 1be
Moment Problem
For some distributions the characteristic function is intractable but mo ments can nonetheless be calculated. In these cases it is sometimes possible *This section may be omitted.
406
CONVERGENCE OF DISTRIBUTIONS
to prove weak convergence of the distributions by establishing that the moments converge. This approach requires conditions under which a distri bution is uniquely determined by its moments, and this is for the same reason that the continuity theorem for characteristic functions requires for its proof the uniqueness theorem.
Let p. be a probability measure on the line having finite moments a k f (X) (X) x kp.( dx ) of all orders. If the power series Ekak r k/k ! has a positive radius of convergence, then p. is the only probability measure with the moments a 1 , a 2 , PROOF. Let P k f (X) (X) I x l kp.(dx) be the absolute moments. The first step is Theorem 30. 1. =
• • •
•
=
to show that
k --+
(30.1)
00 ,
for some positive r. By hypothesis there exists an s, 0 < s < 1, such that a k s kjk! 0. Choose 0 < r < s; then 2kr 2 k - t < s 2 k for large k. Since .-.
lx l 2k - t � 1
+
l x l 2 k,
r2k- l ( 2k - 1)!
+
p2 ks 2 k ( 2k )!
for large k. Hence (30.1) holds as k goes to infinity through odd values; since Pk = a k for k even, (30.1) follows. By (26.4),
e itx
(
k i x ) � (h e ih x - i..J k '. k -o
and therefore the characteristic function
By (26.10), the integral here is
( 30.2 )
cp ( t + h)
=
)
n +l hx l l < , ( n + 1) !
cp of p. satisfies
cp< k >(t). By (30.1), 00 cp( k ) ( t) k l h l < r. E kI h , .
k -0
SECTION
30.
407
THE METHOD OF MOMENTS
If " is another probability measure with moments function 1/1 ( t ), the same argument gives
ak
and characteristic
r. < k > ( t ) k 1/l ( t + h ) = E k I. h , l h l < r . k-0 since q>< k > (O) = i ka k = 1/J < k > (O) (see (26.9)), cp oo
(30 .3)
..
.,..
and 1/J agree in Take t = 0; ( - r, r) and hence have identical derivatives there. Taking t = r - £ and t == - r + £ in (30.2) and (30.3) shows that cp and 1/J also agree in ( - 2r + £, 2r - £) and hence in ( - 2r, 2r). But then they must by the same argument agree in ( - 3r, 3r) as well, and so on. t Thus cp and 1/J coincide, and by the • uniqueness theorem for characteristic functions, so do p. and " · be
A probability measure satisfying the conclusion of the theorem is said to
determined by its·moments.
EX111ple 11
For the standard normal distribution, the theorem implies that it is determined by its moments. 30.1.
l ak l < k !, and so •
But not all measures are determined by their moments:
EX111ple 11
If N has the standard normal density, then
30. 2.
log-normal density f( x ) Put g(x)
=
1 .!. e - ( log x ) 2 /2 X
'2fT 0
= f( x) (1 + sin(2w log x)). If 100x kf( x ) sin(2'1T log x) dx = 0, 0
eN
has the
if X > 0, if X < 0.
k = 0, 1 ' 2, . . . '
then g, which is nonnegative, will be a probability density and will have the same moments as f. But a change of variable log x = s + k reduces the integral above to
which vanishes because the integrand is odd. t lbis process is a version of analytic continuation.
•
-
CONVERGENCE OF DISTRIBUTIONS
Suppose that the distribution of X is determined by its moments, that the Xn have moments of all orders, and that limnE [ X; ] E[ x r ] for r 1, 2, . . . . Then Xn X. PROOF. Let P. n and p. be the distributions of xn and X. Since E[ X,l ] converges, it is bounded, say by K. By Markov' s inequality, P[ I Xnl > x] s Kjx 2 , which implies that the sequence { P. n } is tight. Suppose that P. n k " and let Y be a random variable with distribution If u is an even integer exceeding r, the convergence and hence boundedness of E[Xn"1 implies that E [ X;k ] --+ E[Y r ] by the corollary to Theorem 25.12. By the hypothesis, then, E[Y r ] E[X r ]-or " and p. have
Theorem 30.2.
�
=
"
=
�
·
=
the same moments. Since p. is by hypothesis determined by its moments, " must be the same as p., and so P. n k � p.. The conclusion now follows by the corollary to Theorem 25.10. • Convergence to the log-normal distribution cannot be proved by estab lishing convergence of moments (take X to have density f and the Xn to have density g in Example 30.2). Because of Example 30.1, however, this approach will work for a normal limit.
Moment Generating Functions
Suppose that p. has a moment generating function M(s) for s e [ - s0 , s0 ], s0 > 0. By (21 .22), the hypothesis of Theorem 30.1 is satisfied, and so p. is determined by its moments, which are in turn determined by M(s) via (21 .23). Thus p. is determined by M(s) if it exists in a neighborhood of o.t The version of this for one-sided transforms was proved in Section 22-see Theorem 22.2. Suppose that P. n and p. have moment generating functions in a common interval [ - s 0 , s0 ], s 0 > 0, and suppose that Mn (s) --+ M(s) in this interval. Since P. n [( - a, a)c] S e - so a(Mn ( - s0 ) + Mn (s 0 )), it follows easily that { P.n } is tight. Since M(s) determines p., the usual argument now gives P. n � p.. Central Limit Theorem by Moments
To understand the application of the method of moments, consider on ce again a sum sn = xnl + . . . + xn k ' where xn l ' . . . ' xn k are independent n
t For another proof, analyticity.
see
n
Problem 26.7. The present proof does not require the idea of
SECTION
30.
409
THE METHOD OF MOMENTS
and kn
(30 .4)
s; = E on2k · k- 1
Suppose further that for each n there is an Mn such that I Xnk l < Mn , k == 1, . . . , k n' with probability 1. Finally, suppose that (30 .5) All moments exist,
(30.6)
andt
r r .I 1 1 = L " r Srn . E E ' r ' . . . r ' x n i l u · u •' 1· u- 1
•
•
•
II
xnri, '
where E' extends over the u-tuples of positive integers satisfying r1 + · · · + ru = r and E" extends over the u-tuples ( i 1 , , i u ) of distinct integers in the range 1 < ia S k n . By independence, then, •
•
•
( 30.7)
where (30. 8 )
A n ( rl · . . , ru ) = E " •
. . . E [ x:r. ] . x [ :,E ;}. ] n
and E' and E" have the same ranges as before. To prove that (30. 7)
converges to the r th moment of the standard normal distribution, it suffices to show that (30.9 )
if r1 = · · · = ru = otherwise.
2'
if r is even, all terms in (30.7) will then go to 0 except the one for which u = r/2 and ra = 2, which will go to r !j(r1 ! · · · ru !u!) = 1 · 3 · S ( r - 1 ). And if r is odd, the terms will go to 0 with out exception. Indeed, •
•
•
tTo deduce this from the multinomial formula, restrict the inner sum to u-tuples satisfying 1 S i1 < · · · < iu < k n and compensate by striking out the 1/u !.
410
CONVERGENCE OF DISTRIBUTIONS
= 1 for some then (30.9) holds because by (30.4) each summand in (30.8) vanishes. Suppose that r01 > 2 for each and r01 > 2 for some Then r > 2 u, and since I E[X;j] l < M� ra - 2>on� ' it follows that A n( r1 , • • • , ru ) < ( Mn/sn ) r- 2 "A n (2, . . . , 2). But this goes to 0 because (30.5) holds and because A n (2, . . . , 2) is bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged to include all the u-tuples (il, . . . ' i u )). It remains only to check (30.9) for r1 = · · · = ru = 2. As just noted, A n (2, . . . , 2) is at most 1, and it differs from 1 by Es; 2 "on� , the sum If
a,
r01
a.
a
"
extending over the ( i 1 , • • • , i u ) with at least one repeated index. Since on2; < M;, the terms for example with i u = i u - I sum to at most Mn2s; 2 "Eon�1 • • • on�,_ 1 < M;s; 2 • Thus 1 A n (2, . . . , 2) < u 2M;s; 2 --+ 0. This proves that the moments (30.7) converge to those of the normal distribution and hence that Snfsn => N. -
Application to Sampling 1beory
Suppose that
n
numbers
not necessarily distinct, are associated with the elements of a population of size n . Suppose that these numbers are normalized by the requirement
( 30.10)
Mn
= maxn l x n h l · h
N if II
II
( 30.11)
sn2
=
kn n
-+ 0 '
M __!!. sn
--+ 0 '
Since M; � n - 1 by (30.10), the second condition here in fact implies the third. The moments again have the form (30.7), but this time E[ X;} 1 • • • x:r, 1 cannot be factored as in (30.8). On the other hand, this expected value is by symmetry the same for each of the ( k n ) u = k n ( kn 1) · · · ( kn u + 1 ) -
-
SECTION
30.
411
THE METHOD OF MOMENTS
choices of the indices ia in the sum E". Thus
The problem again is to prove (30.9). 1 The proof goes by induction on u. Now A n ( r ) = kns; rn - Ej! _ 1 x�h' so that An (1) = 0 and An(2) = 1 . If r > 3, then l x�h l < M;- 2x;,h , and so IA,.( r ) l � ( M,/sn ) r - 2 -+ 0 by (30.1 1). Next suppose as induction hypothesis that (30.9) holds with u 1 in place of u. Since the sampling is without replacement, E [ X;t · · · x:: 1 = Ex�1h 1 x�.,/( n )u, where the summation extends over the u-tuples ( h 1 , , h u ) of distinct integers in the range 1 < h a < n . In this last sum enlarge the range b:y requiring of ( h 1 , h 2, , h u ) only that h 2, , h u be distinct, and then compensate by subtracting away the terms where h 1 = h 2, where h 1 = h 3, and so on. The result is -
•
•
•
•
•
•
•
_
•
•
•
�
( n ) u - t [ x r1... . . . x r1 + ra E n2 na ai.-2 (n) u
•
•
•
•
xr, nu] ·
•
This takes the place of the factorization made possible in (30.8) by the assumed independence there. It gives
By the induction hypothesis the last sum is bounded, and the factor in front goes to 0 by (30.11). As for the first term on the right, the factor in front goes to 1 . If r1 =I= 2, A n ( r1 ) -+ 0 and An( r2 , , ru ) is bounded, and so A,.(r1 , , ru ) -+ 0. The same holds by symmetry if ra =I= 2 for some a other , ru ) -+ 1 by than 1 . If r1 = · · · = ru = 2, then An( r1 ) = 1 , and A n ( r2 , the induction hypothesis. Thus (30.9) holds in all cases, and Sn!sn => N follows by the method of moments. •
•
•
•
•
•
• •
•
412
CONVERGENCE OF DISTRIBUTIONS
Application to Number Theory
Let g ( m) be the number of distinct prime factors of the integer m; for example g (3 4 5 2 ) = 2. Since there are infinitely many primes, g (m) is unbounded above; for the same reason, it drops back to 1 for infinitely many m (for the primes and their powers). Since g fluctuates in an irregular way, it is natural to inquire into its average behavior. On the space D of positive integers, let Pn be the probability measure that places mass 1/n at each of 1, 2, . . . , n, so that among the first n positive integers the proportion that are contained in a given set A is just Pn ( A ). The problem is to study Pn [m: g(m ) < x] for large n. If Bp (m) is 1 or 0 according as the prime p divides m or not, then •
(30 .12) Probability theory can be used to investigate this sum because under Pn the Bp ( m ) behave somewhat like independent random variables. t If p 1 , • • • , Pu are distinct primes, then by the fundamental theorem of arithmetic, BP 1 ( m ) = · · · = Bp ,( m ) = 1 -that is, each P; divides m-if and only if the product p 1 • • • Pu divides m. The probability under Pn of this is just n - 1 times the number of m in the range 1 < m < n that are multiples of p 1 • • • Pu , and this number is the integer part of njp 1 • • • Pu· Thus (30.13)
Pn [ m : Bp . ( m) = 1 , i = 1 , . . . , u ] = ! n I
•
[
n
P t . . . Pu
]
for distinct P; · Now let XP be independent random variables (on some probability space, one variable for each prime p) satisfying 1 P ( Xp = 1 ] = - ' p If
p 1 , • • • , pu are distinct, then P [ xp,· = 1 , ; = 1 , . . . , u ] =
(30 .14) For fixed
p1, • • •
,
Pu '
1
P t · · · Pu
(30.13) converges to (30.14) as
t See also Problems 2.1 5, 5.16, and 6.17.
.
n --+
oo .
Thus the
SECTION
30.
413
THE METHOD OF MOMENTS
behavior of the XP can serve as a guide to that of the BP (m). If m < n, (30. 12) is EP < nBp ( m ) because no prime exceeding m can divide it. The idea is to compare this sum with the corresponding sum E P n XP . This will require from number theory the elementary estimatet s
(30. 1 5 )
1
= log log x + 0 ( 1 ) . p p <x E
-
The mean and variance of Ep < n Xp are Ep � n P - 1 and Ep � n P - 1 (1 - p - 1 ) ; since EP p - 2 converges, each of these two sums is asymptotically log log n. Comparing Ep � n8p ( m ) with Ep s n Xp then leads one to conjecture the
Erdos-Kac centra/ limit theorem for the prime divisor function:
1beorem 30.3.
(30.16)
Pn
For all x,
[m : g ( m )
l
- log log n 1 � X --+ - � v�w y'log log n
jx -
oo
e - u 2 ;2 d
U.
PROOF.
The argument uses the method of moments. The first step is to show that (30.16) is unaffected if the range of p in (30.12) is further restricted. Let { an } be a sequence going to infinity slowly enough that
(30.17) but
log a n --+ log n
O
fast enough that
(30.18 ) Because of (30.15), these two requirements are met if, for example, log an = log njlog log n. Now define
(30.19) For
Bn ( m ) = E Bp ( m ) . p � an
a function f of positive integers, let
n En ( / ] = n - 1 E /( m ) m-1
'See, for example, Problem 1 8.18, or HARDY & WRIGHT, Chapter XXII.
414
CONVERGENCE OF DISTRIBUTIONS
denote its expected value computed with respect to Pn. By (30.13) for
u=
1,
By (30.18) and Markov' s inequality
Therefore (Theorem 25.4), (30.16) is unaffected if Bn ( m ) is substituted for
g(m).
Now compare (30.19) with the corresponding sum Sn mean and variance of sn are
=
Ep � a..Xp . The
cn = � � -p1 ' p � an
and each is log log n + o (log log n) 1 12 by (30.18). Thus (see Example 25.8), (30.16) with g(m) replaced as above is equivalent to (30.20)
[
�
pn m : 8n ( m - en
1 xp · · · xr" ,< dx > for a11 r1 , , rk , then , coin cides with 1-' · (b) Show that a k-dimensional normal distribution is determined by its moments. f Let 1-'n and 1-' be probability measures on R k . Suppose that for each i, (30.27) has a positive radius of convergence. Suppose that ==
30.6.
for all nonnegative integers r1 , . . . , rk . Show that 1-'n
• • •
�
1-' ·
30.7. 30.8.
30.
417 30.5 f n Suppose that X and Y are bounded random variables and that xm and y are uncorrelated for m, n 1, 2, . . . . Show that X and Y are independent. 26.17 30.6 f (a) In the notation (26.32), show for A =I= 0 that SECTION
THE METHOD OF MOMENTS
==
{ 30.28) for even r and that the mean is 0 for odd r. It follows by the method of moments that cos Ax has a distribution in the sense of (25.23), and in fact of course the relative measure is (30.29)
p[
x : cos Ax s u ]
==
1 1 - - arccos u,
-1 < u < 1.
'IT
(b) Suppose that A1 , A 2 , . . . are linearly independent over the field of
rationals in the sense that, if n 1 A1 + ··· n m 0. Show that n1 =
==
···
+ nm Am
==
==
0 for integers n,, then
(30.30) for nonnegative integers r1 , . . . , rk . (c) Let X1 , X2 , . • • be independent and have the distribution function on the right in (30.29). Show that { 30.31) (d)
Show that
( 30.32)
30.9. 30. 10.
For a signal that is the sum of a large number of pure cosine signals with incommensurable frequencies, (30.32) describes the relative amount of time the signal is between u1 and u 2 • 6.17 f From (30.16), deduce once more the Hardy-Ramanujan theorem (see (6.14)). f (a) Prove that (if Pn puts probability 1/n at 1, . . . , n) { 30.33)
[
log log m - log log n lim pn m : n ,flog log n
>
£]
=
O
.
418
CONVERGENCE OF DISTRIBUTIONS
(b) From (30.16) deduce that (see (2 . 21 ) for the notation)
{ 30 .34) 30.1 1.
v [ m : g( m ) {
x]
- log log m � log log m
==
fx
1
J2
w
-
oo
e - u 2 12 du .
f Let G( m) be the number of prime factors in m with multiplicity counted. In the notation of Problem 5.16, G( m) == EP aP (m ). (a) Show for k > 1 that Pn [m: ap (m) - Bp (m) > /C) � 1/p k + I ; hence
En [ a� - Bp ] < 2/p 2 • (b) Show that En [ G - g] is bounded. (c)
Deduce from (30.16) that
[Pn m .
. G { m ) - loglog n
(d)
30.12.
f
y'log log n
S
X] �
1
� v w
2
jx -
oo
Prove for G the analogue of (30.34). Prove the Hardy-Ramanujan theorem in the form
[
m v m : g( ) - 1 log log m Prove this with G in place of g.
�
(]
== 0.
e - u2; 2 dU .
C H APTER
6
Derivatives and Conditional Probability
SECI10N 31. DERIVATIVES ON THE LINE *
This section on Lebesgue's theory of derivatives for real functions of a real variable serves to introduce the general theory of Radon-Nikodym deriva tives, which underlies the modem theory of conditional probability. The results here are interesting in themselves and will be referred to later for purposes of illustration and comparison, but they will not be required in subsequent proofs. The Fundamental Theorem of Calculus
To what extent are the operations of integration and differentiation inverse to one another? A function F is by definition an indefinite integral of another function f on [a, b) if
(31 .1) for
a
a � x � b; F is by definition a primitive of f if it has derivative f:
(31 .2) for
x F( x) - F( a) = j f ( t ) dt
F' ( x) = / ( x)
a � x < b . According to the fundamental theorem of calculus (see
* This section may be omitted.
419
420
DERIVATIVES AND CONDITIONAL PROBABILITY
(17.5)), these concepts coincide in the case of continuous f: Suppose that f is continuous on [ a , b). (i) An indefinite integral off is a primitive off: if (31.1) holds for all in [ a , b ], then so does (31.2). (ii) A primitive off is an indefinite integral off: if (31.2) holds for all in [ a, b ], then so does (31.1). Theorem 31.1.
x x
A basic problem is to investigate the extent to which this theorem holds if
f is not assumed continuous. First consider part (i). Suppose f is integrable, so that the right side of (31.1) makes sense. If f is 0 for x < m and 1 for x > m (a < m < b), then an F satisfying (31.1) has no derivative at m. It is thus too much to ask that (31.2) hold for all x. On the other hand, according to a famous theorem of Lebesgue, if (31.1) holds for all x, then (31.2) holds almost everywhere-that is, except for x in a set of Lebesgue measure 0. In this section almost everywhere will refer to Lebesgue measure
only. This result, the most one could hope for, will be proved below (Theorem 31 . 3) . Now consider part (ii) of Theorem 31.1. Suppose that (31.2) holds almost everywhere, as in Lebesgue' s theorem, just stated. Does (31.1) follow? The answer is no: If f is identically 0, and if F(x) is 0 for x < m and 1 for x > m (a < m < b), then (31.2) holds almost everywhere, but (31.1) fails for x > m. The question was wrongly posed, and the trouble is not far to seek: If f is integrable and (31.1) holds, then
( 31 .3)
F( x + h ) - F( x ) =
J\x . x + h ) ( t ) f( t ) dt -+ a
0
as h � 0 by the dominated convergence theorem. Together with a similar argument for h j 0 this shows that F must be continuous. Hence the question becomes this: If F is continuous and f is integrable, and if (31.2) holds almost everywhere, does (31.1) follow? The answer strangely enough is still no: In Example 31.1 there is constructed a continuous, strictly increas ing F for which F '( x) = 0 except on a set of Lebesgue measure 0, an d (31.1) is of course impossible if f vanishes almost everywhere and F is strictly increasing. This leads to the problem of characterizing those F for which (31.1) does follow if (31.2) holds outside a set of Lebesgue measure 0 and f is integrable. In other words, which functions are the integrals of their (almost everywhere) derivatives? Theorem 31.8 gives the characteriza tion. It is possible to extend part (ii) of Theorem 31.1 in a different direction. Suppose that (3 1. 2) holds for every x, not just almost everywhere. In Example 17.4 there was
SECTION
31.
421
DERIVATIVES ON THE LINE
given a function F, everywhere differentiable, whose derivative I is not integrable, and in this case the right side of (3 1 . 1) has no meaning. If, however, (3 1 . 2) holds for every x , and if I is integrable, then (3 1 . 1) does hold for all x . For most purposes of probability theory, it is natural to imfose conditions only almost everywhere, and so this theorem will not be proved here.
The program then is first to show that (31.1) with f integrable implies that (31.2) holds almost everywhere, and second to characterize those F for which the reverse implication is valid. For the most part, f will be non negative and F will be nondecreasing. This is the case of greatest interest for probability theory; F can be regarded as a distribution function and f as a density. In Chapters 4 and 5 many distribution functions F were either shown to have a density f with respect to Lebesgue measure or were assumed to have one, but such F ' s were never intrinsically characterized, as they will be in this section. Derivatives of Integrals
The first step is to show that a nondecreasing function has a derivative almost everywhere. This requires two preliminary results. Let A denote Lebesgue measure.
Let A be a bounded linear Borel set, and let J be a collection of open intervals covering A. Then J has a finite, disjoint subco/lection I I for which E�_ 1 A(I; ) > A(A)/6. PROOF . By regularity (Theorem 12.3) A contains a compact subset K satisfying A(K ) > A(A)/2. Choose in J a finite subcollection J0 covering Lemma I.
1,
K.
•
•
•
,
k
Let I1 be an interval in J0 of maximal length; discard from J0 the interval I1 and all the others that intersect I1 • Among the intervals remain ing in J0, let I2 be one of maximal length; discard I2 and all intervals that intersect it. Continue this way until J0 is exhausted. The I; are disjoint. Let � be the interval with the same midpoint as I; and three times the length. If I is an interval in J0 that is cast out because it meets I; , then I c 1;. Thus each discarded interval is contained in one of the 1;, and so the 1; cover K. Hence EA(I; ) = EA(l;)/3 > A(K)/3 > A(A)/6. • If
(31 .4)
&:
a a0 < a1 =
t For a proof, see RUDIN', p. 179.
E ' ( F ( a ;_ 1 ) - F( a ;) ) + I E" ( F( a ;) - F( a ;_ 1 ) ) 1 = L ' ( F( a ;_ 1 ) - F( a ; ) ) + I ( F( b) - F(a )) + E' ( F( a ;_ 1 ) - F( a ;))l · As all the differences in this last expression are nonnegative, the absolute
SECTION
31.
423
DERIVATIVES ON THE LINE
value bars can be suppressed; therefore,
IIFII � > F( b ) - F( a ) + 2 L' ( F( a ;_ 1 ) - F( a ; ) ) >
A function
derivatives
F( b ) - F( a ) + 28 L' ( a ; - a ;_ 1 ) .
F has at each x four derivates, the upper and lower right D F( x ) - lim sup
F ( X + h ) - F( X ) , h
. I lim nf
F( X + h ) - F ( X ) , h
h J, O
DF ( X ) and
•
=
h J, O
the upper and lower left derivatives FD ( x ) FD
( )
x
-
_
=
lim sup li
h J, O
F ( x ) - F( x - h ) , h
. F( x) - F( x - h) . h h J, O .
m rnf
There is a derivative at x if and only if these four quantities have a common value. Suppose that F has derivative F '(x) at x. If u < x < v, then
F( v ) - F ( u ) v-u
_
F '( x )
F F v x v ) x ( ( ) s -v-u v-x
_
F F x u) ) u x ( ( F' ( x) + v - u x-u
_
F' ( x ) .
Therefore,
(31 .8)
F( v) - F( u ) v-u
-+
F' ( x )
u f x and v � x; that is to say, for each £ there is a 8 such that u < x < v and 0 < v - u < 8 together imply that the quantities on either side of the
as
arrow differ by less than £. Suppose that F is measurable and that it is continuous except possibly at countably many points. This will be true if F is nondecreasing or is the
424
DERIVATIVES AND CONDITIONAL PROBABILITY
difference of two nondecreasing functions. Let M be a countable, dense set containing all the discontinuity points of F; let rn ( x ) be the smallest number of the form k/n exceeding x. Then
D F( x)
=
F F y ) (x) . ( lim sup n __. + G(b) I IIGII � l 3 0 " ( ( a, b) n C11 ) .
G(a) < G(b) the reverse inequality holds, choose a x and bx so that the ratio in (31.11) exceeds 8, which is possible because D G ( x ) > (J for
If instead of x
e C9• Again (31 .12) follows.
In each interval [a, b) there is thus a partition (31.4) satisfying (31 .12). Apply this to each inverval [a ;_ 1 , a ; ] in the partition. This gives a partition !l 1 that refines !l, and adding the corresponding inequalities (31 .12) leads to
II G II �1 > II G il � +
1 3 8 " ( (a, b ) n C11 ) .
Continuing leads to a sequence of successively finer partitions
!l n such that
(31 .13 ) Now IIGII � is bounded by IF( b) - F( a ) l + � I a + fJ I(b - a) because F is monotonic. Thus (31 .13) is impossible unless A(( a, b) n C9 ) = 0. Since (a, b) can be any interval, A ( C9 ) = 0. This proves (31 .10) and establishes the differentiability of F almost everywhere. It remains to prove (31. 9). Let
(31 .14)
/,n ( X )
=
F( x
+ n - l ) - F( x ) . 1 n-
426
DERIVATIVES AND CONDITIONAL PROBABILITY
Now In is nonnegative, and by what has been shown, fn(x) F'(x) except on a set of Lebesgue measure 0. By Fatou ' s lemma and the fact that F is nondecreasing, �
nn+ + la lb [ F( x ) dx F( x ) dx - n = lim inf n n
b
H�(x) = h n (x) almost everywhere.
SECTION
31 .
427
DERIVATIVES ON THE LINE
F'(x) > f(x) almost everywhere, and so J:F'(x) dx > 1:1< x ) dx = F( b) - F( a). But the reverse inequality is a consequence of (31 .9 ). Therefore, J: (F'(x) - f(x )) dx = 0, and as before F' = f except on
As a
n
set
was arbitrary,
•
of Lebesgue measure 0.
Singular Functions
If /( x) is nonnegative and integrable, differentiating its indefinite integral J��f(t) dt leads back to f(x) except perhaps on a set of Lebesgue measure 0. That is the content of Theorem 31 .3. The converse question is this: If F(x) is nondecreasing and hence has almost everywhere a derivative F'(x), does integrating F'(x) lead back to F(x)? As stated before, the answer turns out to be no even if F(x) is assumed continuous: Let X1 , X2 , . . . be independent, identically distributed random variables such that P[ Xn = 0] = Po and P[ Xn = 1] = p 1 = 1 - p0 , and let X = E�_ 1 Xn 2 - n . Let F(x) = P[ X < x] be the distribution function of X. For an arbitrary sequence u 1 , u 2 , . . . of O' s and 1 ' s, P[ Xn = u n , n == 1, 2, . . . ] = lim n Pu1 n • Pu, = 0; since x can have at most two dyadic expansions x = E n u n 2 - , P[ X = x] = 0. Thus F is everywhere continuous. Of course, F(O) = 0 and F(1) = 1 . For 0 < k < 2 n , k2 - n has the form E7_ 1 u ; 2 - ; for some n-tuple (u 1 , , u n ) of O's and 1 ' s. Since F is continu Exllllple l 31. 1.
•
•
•
ous,
•
•
= P [ X. = u . i -< n ] = p · · · p I
I'
Ut
U,
•
F is strictly increasing over then unit interval. If p0 = p 1 = �, the right side of (31.15) is 2 - , and a passage to the limit shows that F(x) = x for 0 < x < 1 . Assume, however, that p0 =I= p 1 . It will be shown that F '( x) = 0 except on a set of Lebesgue measure 0 in this case. This
shows that
Obviously the derivative is 0 outside the unit interval, and by Theorem 31 .2 it exists almost everywhere inside it. Suppose then that 0 < x < 1 and that F has a derivative F'(x) at x. It will be shown that F'(x) = 0. For each n choose k n so that x lies in the interval In = (k n 2 - n , (k n + 1)2 - n ]; In is that dyadic interval of rank n that contains x. By (31 .8),
428
DERIVATIVES AND CONDITIONAL PROBABILITY
. 25
.5
Graph of F( x ) for p0 = .25, jJ 1 .75. Because of the recW'Sion (3 1.17), the part of the graph over [0, . 5 ] and the part over [. 5 , 1] are identical, apart from changes in scale, with the whole graph. Each segment of the curve therefore contains scaled copies of the whole; the extreme irregularity which this implies is obscured by the fact that the accuracy is only to within the width of the printed line. -
If F ( x) is distinct from 0, the ratio of two successive terms here must go to 1, so that '
( 31 .16 ) If In consists of the reals with nonterminating base-2 expansions beginning with the digits u 1 , , u n , then P[ X E In ] = Pu Pu by (31.15). But In + 1 must for some u n + 1 consist of the reals beginning u 1 , , u n , u n + 1 ( u n + 1 is 1 or 0 according as x lies to the right of the midpoint of In or not). Thus P[ Xe In + 1 ]/P[ X e In ] = Pun + l is either p0 or p 1 , and (31 .16) is possible only if Po = p 1 , which was excluded by hypothesis. • • •
1
· · ·
II
• • •
31 . DERIVATIVES ON THE LI NE 429 Thus F is continuous and strictly increasing over [0, 1 ] but F'( x ) = 0 except on a set of Lebesgue measure 0. For 0 < x < � independence gives F(x) = P[ X1 = O, E:0.... 2 Xn 2 - n + I < 2x] = p0F(2x). Similarly, F(x) - Po = p 1 F(2x - 1) for 1 < x < 1. Thus SECTION
( 31 .17)
p0F( 2x ) { F( x ) =
Po + p 1 F( 2x
- 1)
if O < X < ! if 1 < X < 1 .
In Section 7, F(x) (there denoted Q (x)) entered as the probability of • success at bold play; see (7 .30) and (7 .33). A function is singular if it has derivative 0 except on a set of Lebesgue measure 0. Of course, a step function constant over invervals is singular. What is remarkable (indeed, singular) about the function in the preceding example is that it is continuous and strictly increasing but nonetheless has derivative 0 except on a set of Lebesgue measure 0. Note that there is strict inequality in (31 .9) for this F. Further properties of nondecreasing functions can be discovered through a study of the measures they generate. Assume from now on that F is nondecreasing, that F is continuous from the right (this is only a normaliza tion), and that 0 = lim x _. _ 00F(x) < lim x --+ + ooF(x) = m < oo . Call such an F a distribution function, even though m need not be 1 . By Theorem 1 2.4 there exists a unique measure p. on the Borel sets of the line for which
(31 .18 )
p.( a, b] = F( b ) - F( a ) .
p. ( R 1 ) = m is finite. The larger F' is, the larger p. is:
Of course,
Suppose that F and p. are related by (31 .18) and that F'(x) exists throughout a Borel set A. (i) If F'(x) s a for x e A, then p.(A ) < a A( A ). (ii) If F'(x) > a for x e A, then p.( A) > aA( A ). PROOF. It is no restriction to assume A bounded. Fix £ for the moment. Let E be a countable, dense set, and let A n = ('(A n / ), where the intersection extends over the intervals I = (u, v] for which u, v e E, 0 A( /) < n - 1 , and Theorem 31.4.
· By compactness, K has a finite subcover lx 1 , , Ix / If some three of these have a nonempty intersection, one of them must be contained in the union of the other two. Such superfluous intervals can be removed from the subcover, and it is therefore possible to assume that no point of K lies in more than two of the lx . But then PROOF.
• • •
,
I
p. ( K ) s p. (U ; Ix, ) < L P. ( lx, ) < n L A ( lx, ) ;
Since £ was arbitrary,
A(K) = 0, as required.
;
•
Emmple 31.4. Restrict the F of Examples 31.1 and 31.3 to (0, 1), and let g be the inverse. Thus F and g are continuous, strictly increasing mappings of (0, 1) onto itself. If A = [ x e (0, 1): F'( x) == 0] , then A( A ) == 1, as shown in the examples, while I' ( A ) = 0. Let H be a set in (0, 1) that is not a Lebesgue set. Since H - A is contained in a set of Lebesgue measure 0, it is a Lebesgue set; hence H0 == H n A is not a Lebesgue set, since otherwise H == H0 u ( H - A ) would be a Lebesgue set. If B == (0, x], then Ag- 1 ( 8) == A(O, F(x)] == F( x) = I'( B), and it follows that A g - 1 (1 B) == I'(B) for all Borel sets B. Since g - 1 H0 is a subset of 1g - 1A and A ( g - A ) == I' ( A ) == 0, g - t H0 is a Lebesgue set. On the other hand, if g - H0 were a t See Problem 3 1 . 8
31.
433 Borel set, H0 == r 1 (g - 1 H0 ) would also be a Borel set. Thus g - 1 H0 provides an • example of a Lebesgue set that is not a Borel set.t SECTION
DERIVATIVES ON THE LINE
Integrals of Derivatives
Return now to the problem of extending part (ii) of Theorem 31.1, to the problem of characterizing those distribution functions F for which F' integrates back . to F:
F(x ) = lx F' ( t ) dt.
(31 .25)
- oo
The first step is easy: If (31.25) holds, then
F has the form
F(x ) = lx f ( t ) dt
(31.26 )
- oo
for a nonnegative, integrable I (a density), namely I = F'. On the other hand, (31.26) implies by Theorem 31.3 that F' = I outside a set of Lebesgue measure 0, whence (31.25) follows. Thus (31.25) holds if and only if F has the form (31.26) for some I, and the problem is to characterize functions of this form. The function of Example 31.1 is not among them. As observed earlier (see (31.3)), an F of the form (31.26) with I integrable is continuous. It has a still stronger property: For each £ there exists a 8 such that
jAf( x ) dx < E if ). ( A ) < 8.
(31 .27 )
if A n = [x: l(x) > n], then A n � 0, and since I is integrable, the £/2 for large n. dominated convergence theorem implies that fA l(x) Fix such an n and take 8 = £j2n. If A( 8 , then fA I( x) fA -A f(x) + fA l(x) � nA(A) + £/2 £. If F is given by (31.26), then F(b) - F(a) = f:l( x) dx, and (31.27) has this consequence: For every £ there exists a 8 such that for each finite collection [a ; , h; ], i = 1, . . . , k, of nonoverlapping* intervals Indeed,
n
(31 .2 8 )
dx
n
k
by (14.5) and prove that it is continuous and singu,ar. Suppose that p. and F are related by (31.18). If F is not absolutely continuous, then p. ( A ) > 0 for some set A of Lebesgue measure 0. It is an interesting fact, however, that almost all translates of A must have p.-mea sure 0. From Fubini's theorem and the fact that A is invariant under translation and reflection through 0 show that, if A ( A ) = 0 and p. is a-finite, then p. ( A + x) == 0 for x outside a set of Lebesgue measure 0. 17.2 31.6 f Show that F is absolutely continuous if and only if for each Borel set A , p.(A + x) is continuous in x. Let F.( x ) = lim8 _ 0 inf( F( v ) - F( u ))/( v - u ), where the infimum ex tends over u and v such that u < x < v and v - u < B. Define F * ( x ) as this limit with the infimum replaced by supremum. Show that in Theorem 31.4, F' can be replaced by F * in part (i) and by F in part (ii). Show that in Theorem 31.6, FD can be replaced by F (note that F .(x) < FD( x)). Lebesgue ' s density theorem. A point x is a density point of a Borel set A if A (( u , v ] n A )j( v - u) --+ 1 as u f x and v � x. From Theorems 31.2 and 31.4 deduce that almost all points of A are density points-that is, that the SECTION
=
31.2.
31 .3.
31.4.
31.5.
31.6.
31.7. 31.8.
31.9.
=
=
,
s
•
•
438
31.10. 31.11.
31.12.
31. 13.
DERIVATIVES AND CONDITIONAL PROBAB ILITY
points of A that are not density points form a set of Lebesgue measure 0. Similarly, A (( u, v ] n A)/( v - u ) --+ 0 almost everywhere on Ac. 19.3 f Let f: [a, b ] --+ R * be an arc; /( t) = (/1 ( t ), . . . , /" ( t)). Show that the arc is rectifiable if and only if each /; is of bounded variation over [ a, b ]. F(O) = 0, f Suppose that F is continuous and nondecreasing and that 2 F(1) = 1. Then f(x) = (x, F(x)) defines an arcf: [0, 1] --+ R • It is easy to see by monotonicity that the arc is rectifiable and that, in fact, its length satisfies L( / ) < 2. It is also easy, given £, to produce functions F for which L( / ) > 2 - £. Show by the arguments in the proof of Theorem 31.4 tha t L( / ) = 2 if F is singular. Suppose that the characteristic function of F satisfies lim sup, _ lcp( t) 1 = 1. Show that F is singular. Compare the lattice case (Problem 26.1). Hint: Use the Lebesgue decomposition and the Riemann- Lebesgue theorem. 26.22 f Suppose that X1 , X2 , • • • are independent andn assume the values + 1 with probability t each, and let X = �_ 1 Xn/2 . Show that X is uniformly distributed over [ - 1, + 1]. Calculate the characteristic functions of X and Xn and deduce (1.36). Conversely, establish (1.36) by trigonome try and conclude that X is uniformly distributed over [ - 1, + 1]. oo
31.14. (a) Suppose that X1 , X2 , . • • are independent and assume the values 0 and
1 with probability ! each. Let F 2n and G be the distribution functions of n 2 I E�_ 1 X2 n _ 1 j2 - and E�_ 1 X2n/2 . Show that F and G are singular but that F G is absolutely continuous. (b) Show that the convolution of an absolutely continuous distribution function with an arbitrary distribution function is absolutely continuous. 31.15. 31.2 f Show that the Cantor function is the distribution function of n E�_ 1 Xn/3 , where the Xn are independent and assume the values 0 and 2 with probability ! each. Express its characteristic function as an infinite product. 31.16. Show for the F of Example 31.1 that F D(1) = oo and D F(O) = 0 if p0 < ! . From (31.17) deduce that F D(x) = oo and D F(x) = 0 for all dyadic rationals x. Analyze the case p0 > ! and sketch the graph. 31.17. 6.14 f Let F be as in Example 31.1 and let p. be the corresponding probability measure on the unit interval. Let dn (x) be the n th digit in the non terminating binary expansion of x, and let sn ( x) = EZ _ 1 d1c ( x ). If ln (x) is the dyadic interval of order n containing x, then •
(31 .33) (a) Show that (31.33) converges on a set of p.-measure 1 to the entropy
log Po - p 1 log p 1 • From the fact that this entropy is less than log 2 if p0 ¢ ! , deduce that on a set of p.-measure 1, F does not have a finite derivative if Po =F ! .
h
=
- Po
31 .
439 (b) Show that (31.33) converges to - !log p0 - !log p 1 on a set of Lebesgue measure 1. If p0 =F ! this limit n exceeds log 2 (arithmetic versus geometric means), and so p. ( /n ( x ))/2 - --+ 0 except on a set of Lebesgue measure 0. This does not prove that F'( x) exists almost everywhere, but it does show that, except for x in a set of Lebesgue measure 0, if F'( x ) does exist, then it is 0. (c) Show that, if (31.33) converges to I, then SECTION
DERIVATIVES ON THE LINE
( 31 .34)
31.18.
31.19.
31.20.
31.21.
31.22.
.
00
1f a
0
.f
>
1 a
0 for all E in A + and cp(E) � 0 for all E in A -.
1beorem 32.1.
A set A is
positive if cp( E ) > 0 for E c A and negative if cp(E) < 0 for
E c A . The A + and A - in the theorem decompose 0 into a positive and a
negative set. This is the Hahn decomposition. If cp(A) = fA ! dp. (see Example 32.1), the result is easy: take A + = [/ > and A - = [ / < 0].
0]
Let a = sup[cp(A): A e j&'). Suppose that there exists a set A + with cp(A + ) = a (which implies that a is finite). Let A - = 0 - A + . If A c A + and cp(A) < 0, then cp(A + _ A ) > a, an impossibility; hence A + is a positive set. If A c A - and cp( A) > 0, then cp(A + u A) > a, an impossi bility; hence A - is a negative set. It is therefore only necessary to construct a set A + for which cp( A + ) = a. Choose sets A n such that cp(A n ) -+ a and let A = U n A n . For each n consider the 2 n sets Bn; (some perhaps empty) that are intersections of the form n k_ 1 Ak, where each Ak is either A k or else A - A k . The collection n !/In = [ Bn; : 1 � i � 2 ] of these sets partitions A . Clearly, !fin refines ffln _ 1 : each Bn j is contained in exactly one of the Bn 1 , ;. Let en be the union of those Bn; in !fin for which cp( Bn ;) > 0. Since A n is the union of certain of the Bn i ' cp(A n ) < cp(en ). Since the partitions 91 1 ' EM2 ' . are successively finer, m < n implies that (em u . . . u en - I u Cn ) - (em u · · · u en _ 1 ) is the union (perhaps empty) of certain of the sets Bn ; ; the Bn; in this union must satisfy cp( Bn ; ) > 0 because they are conPROOF.
• •
442
DERIVATIVES AND CONDITIONAL PROBABILITY
tained in en . Therefore, cp(em U · · · u en _ 1 ) < cp(em U · · · U Cn ), so that by induction cp ( A m ) < cp(em ) < cp(em U · · · u en ). If Dm = U�-men , then by Lemma 1 (take Ev = em U · · · u em v) cp( A m ) < cp( Dm ). Let A + = + + D (note that A = lim supnen ), so that Dm � A + . By Lemma 1, n :_ 1 m a = lim mcp ( A m ) < lim mcp(Dm ) = cp ( A + ). Thus A + does have maximal cpvalue. • If cp + ( A ) = cp ( A n A + ) and cp - (A) = - cp ( A n A - ), then cp + and cp are finite measures. Thus
(32.2) represents the set function cp as the difference of two finite measures having disjoint supports. If E c A , then cp( E) < cp + (E ) < cp + (A), and there is equality if E = A n A + . Therefore, cp + ( A ) = SUPE c A cp( E ). Similarly, cp - ( A ) = - inf E c A cp( E ). The measures cp + and cp - are called the upper and lower variations of cp, and the measure l cp l with value cp + (A) + cp - ( A ) at A is called the total variation. The representation (32.2) is the Jordan
decomposition.
Absolute Continuity and Singularity
Measures p. and " on ( 0, �) are by definition mutually singular if they have disjoint supports- that is, if there exist sets S" and S, such that
( 32.3 )
(
( - s") = o,
,., o
,
( 0 - s, ) = 0 '
s" n s, - o.
In this case p. is also said to be singular with respect to " and " singular with respect to p. . Singularity is sometimes indicated by p. .l " · Note that mea sures are automatically singular if one of them is identically 0. According to Theorem 31.5 a finite measure on R 1 with distribution function F is in the sense of (32.3) singular with respect to Lebesgue measure if and only if F'(x) = 0 except on a set of Lebesgue measure 0. In Section 31 the latter condition was taken as the definition of singularity, bu t of course it is the requirement of disjoint supports that can be generalized from R 1 to an arbitrary 0. The measure " is absolutely continuous with respect to p. if for each A in !F, p. ( A) = 0 implies 11( A ) = 0. In this case " is also said to be dominated by p., and the relation is indicated by " 0 by the right-hand inequality in (4.10) (which applies because , is finite). Hence ., « p. fails, and so (32.4) follows if , is finite and , < < p.. If , is finite, in order that , 0 and £p. ( E ) s J1 ( E ) for all E c A .
Lemma 2.
-1
PROOF.
., - n
Let
,..u·,
u A;
A;
be a
Hahn
decomposition for the set function
= U n A n-+ 1' so that Me-=1 n n A n- Since Me is in the for ., - n p., J1 ( M e ) s n p, ( M e ) ; since this holds for all
put M
negative set
n,
A;
·
0. Thus M supports .,, and from the fact that p. and
., are not mutually singular it follows that Me cannot support p.- that is, that p. ( M ) must be positive. Therefore, p. ( A ; ) for some Take A A; and
J1 ( Me) =
£ = n
-1
>0
.
=
•
( R 1 , 91 1 ),
p. is Lebesgue measure = A, and J1 ( a , b] = F(b) - F(a). If ., and A do not have disjoint supports, A= then by Theorem 31.5, A[x : F'(x) > 0] > 0 and hence for some [x : F'(x) > e] satisfies A(A ) > 0. I f E = (a, b] is a sufficiently small inter-
Emmple 322.
Suppose that
( D, !F)
n.
e,
32. THE RADON -NIKODYM THEOREM 445 val about an x in A, then JI( E )/A ( E ) = ( F( b ) - F(a ))/( b - a ) > e, which is the same thing as e'A ( E ) < JI( E ) . • Th us Lemma 2 ties in with derivatives and quotients v( E)jp. ( E ) for "small" sets E. Martingale theory links Radon-Nikodym derivatives with such quotients; see Theorem 35.8 and Example 35 . 10. Suppose that p. and " are finite measures PROO F OF THEOREM 32.2. satisfying " p.. Let f§ be the class of nonnegative functions g such that J8g dp. s JI ( E ) for all E. If g and g' lie in f§, then max( g , g') also lies in f§ SECTION
«
because
J.Emax( g , g') dp. = J.E n (g � g' ]gdp. + J.En [ g' < "(E
n (g >
> g]
g' dp.
g'] ) + " ( E n ( g ' > g] ) = " ( E ) .
f§ is closed under the formation of finite maxima. Suppose that functions gn lie in f§ and gn i g . Then fEg dp. = limn fEgn dP. < JI( E ) by the monotone convergence theorem, so that g lies in f§. Thus f§ is closed under Thus
nondecreasing passages to the limit. Let a = sup f g dp. for g ranging over f§ (a < JI ( D )). Choose gn in f§ so 1 that f gn dp. > a - n - . If In = max(g 1 , . . . , gn ) and I = limln , then I lies in t6 and f Idp. = lim n f In dp. > limn / gn dp. = a . Thus I is an element of f§ for which f Idp. is maximal. Define "ac by "ac (E) = fEidp. and "s by "s (E) = J1 (E) - "uc (E) . Thus
v ( E ) = �'ac( E ) + vs ( E ) = f. Jdp. + vs ( E ) .
(32.8)
E
Since I is in f§, "s as well as "a c is a finite measure. Of course, "ac is absolutely continuous with respect to p.. Suppose that "s fails to be singular with respect to p.. It then follows from Lemma 2 that there is a set A and a positive £ such that p.(A) > 0 and €I'(E) s "s( E) for all E c A. Then for every E
L(/ +
dA
) dp. = /jdp. + t p. ( E () A ) < /jdp. + �'s ( E () A ) = f. fdp. + vs( E n A ) + f. f dp. En A
E- A
= v ( E n A ) + f. fdp. < v ( E n A ) + v ( E - A ) E- A
= "( E ).
446
DERIVATIVES AND CONDITIONAL PROBABILITY
In other words, f + f.(� lies in rl; since f (/ + f./A ) dp. = a + f. P, ( A ) > a , this contradicts the maximality of f. Therefore, p. and "s are mutually singular, and there exists an S such tha t vs ( S ) = p. ( S c ) = 0. But since " (except the empty set), that there is no Hahn decomposition, and that cp does not have bounded range. SECI'ION 33. CONDmONAL PROBABILITY
The concepts of conditional probability and expected value with respect to a a-field underlie much of modem probability theory. The difficulty in under standing these ideas has to do not with mathematical detail so much as with probabilistic meaning, and the way to get at this meaning is through calculations and examples, of which there are many in this section and the next. The Discrete Case
Consider first the conditional probability of a set A with respect to another set B. It is defined of course by P( A I B) = P( A n B)/P( B), unless P( B) vanishes, in which case it is not defined at all. It is helpful to consider conditional probability in terms of an observer in possession of partial information. t A probability space ( D, �, P) describes the working of a mechanism, governed by chance, which produces a result w distributed according to P; P( A) is for the observer the probability that the point w produced lies in A. Suppose now that w lies in B and that the observer learns this fact and no more. From the point of view of the observer, now in possession of this partial information about w, the probability that w also lies in A is P(A IB) rather than P( A). This is the idea lying back of the definition. If, on the other hand, w happens to lie in B e and the observer learns of this, his probability instead becomes P( A I Be). These two conditional probabilities can be linked together by the simple function ( 33 . 1 )
{ P( A IB ) /( ) = P( A I B") "'
The observer learns whether w lies in B or in Be; his new probability for the event w E A is then just w ). Although the observer does not in
/(
tAs always, obseroer , information , know , and so on are informal, nonmathematical terms; see the related discussion in Section 4 (p. 52).
SECTION
33.
449
CONDITIONAL PROBABILITY
general know the argument w of f, he can calculate the value /( w ) because he knows which of B and B e contains w. (Note conversely that from the value /( w ) it is possible to determine whether w lies in B or in Be, unless P(A I B) = P( A I B e)-that is, unless A and B are independent, in which case the conditional probability coincides with the unconditional one any way.) The sets B and Be partition D, and these ideas carry over to the general be a finite or countable partition of D into jli:sets, partition. Let B 1 , B2 , and let f§ consist of all the unions of the B;. Then f§ is the a-field generated by the B;. For A in �, consider the function with values •
•
•
P ( A n B; ) ( 3 3 . 2 ) /( "' ) = P ( A I B; ) = P( B; )
i = 1 , 2, . . .
if "' e B; ,
.
If the observer learns which element B; of the partition it is that contains w, then his new probability for the event w E A is /( w ). The partition { B; }, or equivalently the a-field f§, can be regarded as an experiment, and to learn which B; it is that contains w is to learn the outcome of the experiment. For this reason the function or random variable f defined by (33.2) is called the conditional probability of A given f§ and is denoted P[AIIf§]. This is written P[AIIf§]"' whenever the argument w needs to be explicitly shown. Thus P[AIIf§] is the function whose value on B; is the ordinary condi tional probability P(A I B;) · This definition needs to be completed, because P(A IB; ) is not defined if P(B;) = 0. In this case P[AIIf§] will be taken to have any constant value on B;; the value is arbitrary but must be the same over all of the set B;. If there are nonempty sets B; for which P( B;) = 0, P (AIIf§] therefore stands for any one of a family of functions on D. A specific such function is for emphasis often called a version of the condi tional probability. Note that any two versions are equal except on a set of probability 0. Consider the Poisson process. Suppose that 0 < s < t, and let A = [Ns = 0] and B; = [N1 = i], i = 0, 1, . . . Since the increments are independent (Section 23), P( A I B;) = P[Ns = O]P[N, - Ns = i]/P[N, = i], and since they have Poisson distributions (see (23.9)), a simple calculation reduces this to Example 33. 1.
.
(33.3 ) P [ N. = 0 11� )., = ( 1 - � Since i =
(33.4 )
N1 ( w ) on
B;,
r
if "'
this can be written
E
B; ,
i = 0, 1 , 2, . . . .
P [ Ns = 0 II f§ ] "' = ( 1 - tS ) N, ( "' ) .
450
DERIVATIVES AND CONDITIONAL PROBABILITY
Here the experiment or observation corresponding to { B; } or f§ de termines the number of events-telephone calls, say-occurring in the time interval [0, t ]. For an observer who knows this number but not the locations of the calls within [0, t ], (33.4) gives his probability for the event that none of them occurred before time s. Although this observer does not know w, he knows N1 ( w ), which is all he needs to calculate the right side of (33.4). • Example 33. 2.
Suppose that X0, X1 , space S as in Section 8. The events
• • •
is a Markov chain with state
( 33.5 ) form a finite or countable partition of D as i 0, , i n range over S. If f§n is the a-field generated by this partition, then by the defining condition (8.2) for Markov chains, P[ Xn + l = j ll f§n ]"' = P;n.� holds for w in (33.5). The sets •
•
•
( 33.6 ) for i e S also partition D, and they generate a a-field f§n° smaller than f§n . Now (8.2) also stipulates P[ Xn + l = j ll f§n°1"' = P;j for w in (33.6), and the essence of the Markov property is that
( 33 .7 )
•
The General Case
If f§ is the a-field generated by a partition B 1 , B , , then the general 2 element of l§ is a disjoint union B;1 u B; 2 u . . . , finite or countable, of certain of the B;. To know which set B; it is that contains w is the same thing as to know which sets in f§ contain w and which do not. This second way of looking at the matter carries over to the general a-field f§ contained in !F. (As always, the probability space is (D, ofF, P ).) The a-field f§ will not in general come from a partition as above. One can imagine an observer who knows for each G in f§ whether w e G or w e Gc. Thus the a-field f§ can in principle be identified with an experiment or observation. This is the point of view adopted in Section 4; see p. 52. It is natural to try and define conditional probabilities P[ A ll r9' ] with respect to the experiment f§. To do this, fix an A in ofF and define a finite measure " on f§ by •
11
(G)
= P( A n G) ,
G E f§.
•
•
SECTION
33.
CONDITIONAL PROBABILITY
451
Then P( G) = 0 implies that I'( G) = 0. The Radon-Nikodym theorem can be applied to the measures , and P on the measurable space ( 0, f§ ) because the first one is absolutely continuous with respect to the second.t It follows that there exists a function or random variable /, measure f§ and integrable with respect to P, such thatt P( A n G) = I'( G ) = fGfdP for all G in f§. Denote this function f by P[ AIIf§]. It is a random variable with two properties: (i) P[ A IIf§ ] is measurable f§ and integrable. (ii) P[ A II f§] satisfies the functional equation
(33 .8 )
1 P [ AII � ) dP = P( A n G ) ,
G
G
E
f§.
There will in general be many such random variables P[AIIf§], but any two of them are equal �with probability 1. A specific such random variable is called a version of the conditional probability. If f§ is generated by a partition B 1 , B2 , , the function f defined by (33. 2) is measurable f§ because [ w: /( w) E H ] is the union of those B; over which the constant value of f lies in H. Any G in f§ is a disjoint union G = Uk B;" ' and P( A n G) = Ek P( A I B;" )P( B;" ), so that (33.2) satisfies (33.8) as well. Thus the general definition is an extension of the one for the discrete case. Condition (i) in the definition above in effect requires that the values of P[ A IIf§] depend only on the sets in f§. An observer who knows the outcome of f§ viewed as an experiment knows for each G in f§ whether it contains w or not; for each x he knows this in particular for the set [w': P[AIIf§] w' = x], and hence he in principle knows the functional value P[ AIIf§ ]"' even if he does not know w itself. In Example 33.1 a knowledge of N,( w) suffices to determine the value of (33.4)-w itself is not needed. Condition (ii) in the definition has a gambling interpretation. Suppose that the observer, after he has learned the outcome of f§ , is offered the opportunity to bet on the event A (unless A lies in f§, he does not yet know whether or not it occurred). He is required to pay an entry fee of P[AIIf§] units and will win 1 unit if A occurs and nothing otherwise. If the observer decides to bet and pays his fee, he gains 1 P[AIIf§] if A occurs and - P [ A IIf§] otherwise, so that his gain is •
•
•
-
t Let P0 be the restriction of P to '§ (Example 10.4) and find on ( 0 , '§) a density f for Jl with resp ect to P0 . Then for G e '§, P(G ) = JGfdP0 = fGfdP (Example 16.4). If g is another such density, then P [ / =I= g ] = P0 [ f =I= g] = 0.
452
DERIVATIVES AND CONDITIONAL PROBABILITY
If he declines to bet, his gain is of course 0. Suppose that he adopts the strategy of betting if G occurs but not otherwise, where G is some set in f§. He can actually carry out this strategy, since after learning the outcome of the experiment f§ he knows whether or not G occurred. His expected gain with this strategy is his gain integrated over G:
foU� - P [ AII � ] ) dP . But (33.8) is exactly the requirement that this vanish for each G in f§. Condition (ii) requires then that each strategy be fair in the sense that the observer stands neither to win nor to lose on the average. Thus P[AIIf§] is the just entry fee, as intuition requires. 33.3. EX111ple 11
Suppose that A e f§, which will always hold if f§ coin cides with the whole a-field !F. Then IA satisfies conditions (i) and (ii), so that P[AIIl§] = lA with probability 1. If A e f§, then to know the outcome of l§ viewed as an experiment is in particular to know whether or not A has • occurred. If f§ is {0, 0 }, the smallest possible a-field, every func tion measurable f§ must be constant. Therefore, P[AII f§ ] "' = P( A ) for all w • in this case. The observer learns nothing from the experiment f§. Example 33.4.
According to these two examples, P[A II{O, O }] is identically P( A ), whereas lA is a version of P[AII�]. For any f§, the function identically equal to P( A ) satisfies condition (i) in the definition of conditional prob ability, whereas lA satisfies condition (ii). Condition (i) becomes more stringent as f§ decreases, and condition (ii) becomes more stringent as f§ increases. The two conditions work in opposite directions and between them delimit the class of vetsions of P [A II f§ ]. Let 0 be the plane R 2 and let � be the class gJ 2 of planar Borel sets. A point of 0 is a pair ( x, y) of reals. Let f§ be the a-field consisting of the vertical strips, the product sets E x R 1 = [(x, y): x e £ ], where E is a linear Borel set. If the observer knows for each strip E X R 1 whether or not it contains (x, y), then, as he knows this for each one-poin t set E, he knows the value of x. Thus the experiment f§ consists in the determination of the first coordinate of the sample point. Suppose now that P is a probability measure on gJ 2 having a density f(x, y) with respect to planar Lebesgue measure: P( A) = ffAf(x, y ) dx dy. Let A be a horizon tal strip R 1 x F = [( x , y): y e F], F being a linear Borel set. The conditional probability P[ AIIf§] can be calculated explicitly. Extunple 33.5.
SECTION
33.
CONDITIONAL PROBABILITY
453
Put
cp(x, y) =
{33 .9 )
/j (x, t ) dt
fRlf(x, t ) dt
.
Set cp(x, y) = 0, say, at points where the denominator here vanishes; these points form a set of P-measure 0. Since cp (x, y ) is a function of x alone, it 1 is measurable f§. The general element of f§ being E X R , it will follow that fP(X, y) is a version of P[AIIf§] ( x , y ) if it is shown that •
(33.10)
f
F: x
R
A X R1 ) ) . = dP(x, ) ) E ( y y n P cp(x, ( 1
Since A = R 1 x F, the right side here is P(E X F). Since P has density f, Theorem 16.10 and Fubini ' s theorem reduce the left side to
fe{ L.q> (x , Y )f (x , y ) dy } dx = fe{ /j(x , t ) dt } dx =
JJ
f(x, y) dx dy = P( E x F ) .
EX F
Thus (33.9) does give a version of P[R 1 X
F 11 �1 < x. y > ·
•
The right side of (33.9) is the classical formula for the conditional probability of the event R 1 X F (the event that y e F) given the event { x } X R 1 (given the value of x ) . Since the event { x } X R 1 has probability 0, the formula P ( A I B) = P(A n B)/P(B) does not work here. The whole point of this section is the systematic development of a notion of condi tional probability that covers conditioning with respect to events of prob ability 0. This is accomplished by conditioning with respect to collections of events-that is, with respect to a-fields f§. The set A is by definition independent of the a-field f§ if it is independent of each G in f§: P(A n G ) = P( A )P(G ). This being the same thing as P(A n G) = JG P(A) dP, A is independent of f§ if and only if P[A II�] = P( A ) with probability 1. • EX111ple 33.6. 11
454
DERIVATIVES AND CONDITIONAL PROBABILITY
X) He [w: X(w) e A X P[AIIX] P[AIIa(X)] X), X( x, x ( x,
X
The a-field a( generated by a random variable consists of the sets H) for .91 1 ; see Theorem 20.1. The conditional probability of given is defined as and is denoted Thus = by definition. From the experiment corresponding to contains w one learns which of the sets = the a-field a( and hence learns the value . Example 33.5 is a case of this: take y) = for y) in the sample space 0 = R 2 there. This definition applies without change to a random vector, or, equiv alently, to a finite set of random variables. It can be adapted to arbitrary sets of random variables as well. For any such set the a-field t a [ X1, t it generates is the smallest a-field with respect to which each is measurable. It is generated by the collection of sets of the form H ] for t in and in .91 1 . The conditional probability t of with respect to this set of random variables is by definition the conditional probability of with respect to the a-field t a [ X1, t In this notation the property (33.7) of Markov chains becomes
P[AI I a(X)]
X( w )
P[AIIX]. [ w' : X( w' ) x ] [X,, e T ],
e T] X, ( w ) e e T] A e T].
T
H P[AIIa[X,, e T]] A
X, [w: P[AI I X,,
( 33 .11 )
[Xn + I j]
= is the same for someone who The conditional probability of knows the present state as for someone who knows the present state xn and the past states as well.
xn Xo, . . . ' xn - I Example 33. 7. Let X and Y be random vectors of dimensions j and k, let p. be the distribution of X over R j, and suppose that X and Y are
independent. According to (20.30),
P ( X e H, ( X, Y) e J] = jHP (( x , Y) e J ) p. ( dx ) e
Je
_9lj + k . This is a consequence of Fubini' s theorem; it for H _9lj and has a conditional probability interpretation. For each in R j put
( 33 .12)
x f ( x ) = P [( x , Y ) e J] = P [ w': ( x , Y( w' )) e J] .
X and Y are independent, then ( 33 .13 ) f ( X( w )) = P( ( X, Y ) e J IIX ]
If
with probability
1.
By the theory of Section
w
18, f( X( w ))
is measurable
SECTION
33.
455
CONDITIONAL PROBABILITY
o( X), and since p. is the distribution of X, a change of variable gives
1
/( X ( w )) P ( dw ) =
[ XE H )
j f( x ) p. ( dx ) = P( [ ( X, Y ) H
Since [ X E H] is the general element of The fact just proved can be written
P [( X, Y ) E JIIX]
w
E
J]
n (X E H]).
o( X), (33.13) follows.
=
P ( ( X( w ) , Y ) E J]
=
P [ w': ( X( w ) , Y( w' ) ) e J ] .
Replacing w' by w on the right here causes a notational collision analogous to the one replacing y by x causes in J:f(x, y) dy. • Suppose that X and Y are independent random variables and that Y has distribution function F. For J = [(u, v): max{ u, v } < m], (33.12) is 0 for m < x and F(m) for m > x; if M = max{ X, Y }, then (33.13) gives (33 .14) with probability 1. All equations involving conditional probabilities must in this way be qualified by the phrase with probability 1 because the condi tional probability is unique only to within a set of probability 0. The following theorem is helpful in calculating conditional probabilities.
Let gJ be a 'IT-system generating the a-field f§, and suppose that 0 is a finite or countable union of sets in 9'. An integrable function f is a version of P[AIIf§] if it is measurable f§ and if
Theorem 33.1.
f JdP = P ( A n G )
( 33 .15)
G
holds for all G in gJ_ PROOF.
Apply Theorem 10.4.
•
The condition that 0 is a finite or countable union of gJ-sets cannot be suppressed; see Example 10.5. Emmple 33.8. Suppose that X and Y are independent random variables with a common distribution function F that is positive and continuous. What is the conditional probability of [ X � x] given the random variable M == max{ X, Y }? As
DERIVATIVES AND CONDITIONAL PROBABILITY 456 it should clearly be 1 if M < x, suppose that M > x. Since X � x requires M Y, the chance of which is ! by symmetry, the conditional probability of [ X < x] should by independence be ! F( x )/F( m ) = !P[ X < x i X < m] with the random variable M substituted for m. Intuition thus gives =
( 33 .16) It suffices to check (33.15) for sets G = [ M < m ], because these form a w-system generating a( M). The functional equation reduces to 1 . P ( M < nun { x, m } ] + 2
{ 33 .17)
j
F( x ) dP F M ) x< M�m (
= P( M < m , X < x] . Since the other case is easy, suppose that x < m. Since the distribution of ( X, Y) is product measure, it follows by Fubini's theorem and the assumed continuity of F that
+
J£
l dF u dF = 2( F( m ) - F( x ) ) ) v) ( ( F( u ) 0] of random variables is a Markov process in continuous time if for k > 1, 0 < t1 � · · · < tk � u, and H e !1/ 1 , Example 33. 9.
(33 .18) with probability 1. The analogue for discrete time is (33.11). (The xn there have countable range as well, and the transition probabilities are constant in time, conditions that are not imposed here.) Suppose that t � u . Looking on the right side of (33.18) as a version of the conditional probability on the left shows that
1 P [ X,. e HII X,) dP = P ([ X,. e H) n G )
(33 .19) if 0 < t 1 � let k and
G
• •
t1,
•
tk = t s u and G e o( X11, , X,k ). Fix t, u, and H and , t k vary. Consider the class gJ = Uo(X11, , X,k ), the
�
• • •
• • •
• • •
SECTION
33.
457
CONDITIONAL PROBABILITY
union extending over all k > 1 and all k-tuples satisfying 0 < t 1 < · · · � tk = t. If A E o( x,1 , . . . ' x,k ) and B E o( XS1 ' . . . ' xs ), then A n B E o ( X,1 , . . . ' xr ), where the ra are the Sp and the ly merged, together. Thus gJ is a w-systeln . Since gJ generates o[Xs : s < t] and P[ Xu e HII X,] is measurable with respect to this a-field, it follows by (33.19) and Theorem 33.1 that P[ Xu E HII X,] is a version of P[ Xu e HII Xs , s < t]:
t < u,
(33.20 )
with probability 1. This says that for calculating conditional probabilities about the future, the present o( X,) is equivalent to the present and the entire past o[Xs : • s < t]. This follows from the apparently weaker condition (33.18). The Poisson process [N,: t > 0] has independent incre ments (Section 23). Suppose that 0 < t 1 < < t k < u. The random vector ( N,1 , N, 2 - N,1 , . . . , N,k - N,k- 1 ) is independent of Nu - N, , and so (Theorem 20.2) (N11 , N,2, , N,k ) is independent of Nu - N,k . If J is the set k l of points ( x 1 , , xk, y) in R + such that xk + y e H, where H e !l/ 1 , and if Jl is the distribution of Nu - N, k, then (33.12) is P[( x 1 , , xk, Nu N, ) e J ] = P[ xk + Nu - N,k e H) = J1(H - xk ). Therefore, (33.13) gives lc P [Nu e HIIN,1 , , N,k ] = J1(H - N,k ). This holds also if k = 1, and hence P [ Nu e HIIN,1 , , N,k ] = P[Nu e HIIN,k ]. The Poisson process thus has the Markov property (33.18); this is a consequence solely of the indepen dence of the increments. The extended Markov property (33.20) follows. • Example 33. 10.
· · ·
k
• • •
• • •
• • •
• • •
• • •
Properties of Conditional Probability 1beorem 33.2.
For each A
0 < P (AIIfl ] < 1 ( 33 .21 ) with probability 1. If A 1 , A 2 , is a finite or countable sequence of disjoint sets, then (3 3 . 2 2 ) • • •
n
with probability 1. For each version of the conditional probability, fG P[A IIfl] dP = P( A n G) > 0 for each G in fl ; since P[A IIfl] is measurable fl, it must be nonnegative except on a set of P-measure 0. The other inequality in (33.21) is proved the same way. PROO F.
458
DERIVATIVES AND CONDITIONAL PROBABILITY
If the A n are disjoint and if G lies in f§, it follows (Theorem 16.6) that
Thus E n P l A n llf§], which is certainly measurable f§, satisfies the functional equation for P[UnAnllf§], and so must coincide with it except perhaps on a set of P-measure 0. Hence (33.22). • ( 33 .23)
P( B - A ll�) == P( Bll�) - P( All�) ,
c
A B , then P( A ll�) < P( B ll�) .
Additional useful facts can be established by similar arguments. If The inclusion-exclusion formula
P u- 1 A ;ll� holds. If A n f A , then ( 33 .24)
[
.
I
]
==
� P[ A ;II�] - �. P ( A ; n A1 11 �] I
I <J
+ ···
( 33 .25) and if A n � A , then ( 33 .26) Further,
P ( A ) == 1 implies that
( 33 .27) and
P( A ll�) == 1
P ( A ) == 0 implies that
( 33 .28)
P[ A II�] == 0.
Of course, (33.23) through (33.28) hold with probability 1 only. Difficulties and Curiosities
This section has been devoted almost entirely to examples connecting the abstract definition (33.8) with the probabilistic idea lying back of it. There are pathological examples showing that the interpretation of conditional probability in terms of an observer with partial information breaks down in certain cases. Let ( 0, � ' P) be the unit interval 0 with Lebesgu e measure P on the a-field � of Borel subsets of 0. Take f§ to be the a-field Example 33. 11.
SECTION
33.
CONDITIONAL PROBABILITY
459
of sets that are either countable or co-countable. Then the function identi cally equal to P(A) is a version of P[AIIrl] because P(G) is either 0 or 1 for every G in rl; therefore,
P [ A II rl ] "' = P ( A )
(33 .29)
with probability 1. But since r9 contains all one-point sets, to know which elements of r9 contain w is to know w itself. Thus r9 viewed as an experiment should be completely informative-the observer given the infor mation in r9 should know w exactly-and so one might expect that
P [ A II !J ).,
(33 .30)
=
{�
if w E A, if w � A .
•
This is Example 4.9 in a new form.
The mathematical definition gives (33.29); the heuristic considerations lead to (33.30). Of course, (33.29) is right and (33.30) is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties. The point of view in this section has been "global." To each fixed A in � has been attached a function (usually a family of functions) P[AIIrl]"' defined over all of 0. What happens if the point of view is reversed-if w is fixed and A varies over �? Will this result in a probability measure on �? Intuition says it should, and if it does, then (33.21) through (33.28) all reduce to standard facts about measures. Suppose that B 1 , , Br is a partition of 0 into �sets, and let r9 = o ( B 1 , , Br ). If P(B 1 ) = 0 and P(B;) > 0 for the other i, then one version of P[ A ll rl] is •
•
•
•
•
•
17
P ( A I I rl ] "' = P ( A n B; ) P ( B;)
i = 2, . . . , r.
With this choice of version for each A, P[A IIrl]"' is, as a function of A, a probability measure on � if w e B2 u · · · UBr , but not if w e B 1 . The " wrong" versions have been chosen. If, for example,
P [ AI I � ] "' =
P( A ) P ( A n B; ) P ( B; )
then P[A II �]"' is a probability measure in such as this one exist if r9 is finite.
i = 2, . . . , r,
A
for each w. Clearly, versions
460
DERIVATIVES AND CONDITIONAL PROBABILITY
It might be thought that for an arbitrary a-field f§ in � versions of the various P[AIIf§] can be so chosen that P[A I I f§]"' is for each fixed w a probability measure as A varies over �- It is possible to construct a counterexample showing that this is not so.t The example is possible because the exceptional w-set of probability 0 where (33.22) fails depends on the sequence A , A 2 , ; if there are uncountably many such sequences, 1 it can happen that the union of these exceptional sets has positive probabil ity whatever versions P[AII�] are chosen. The existence of such pathological examples turns out not to matter. Example 33.9 illustrates the reason why. From the assumption (33. 18) the notably stronger conclusion (33.20) was reached. Since the set [ Xu e H] is fixed throughout the argument, it does not matter that conditional probabil ities may not, in fact, be measures. What does matter for the theory is Theorem 33.2 and its extensions. Consider a point w0 with the property that P( G) > 0 for every G in � that contains w0• This will be true if the one-point set { w0 } lies in ffi and has positive probability. Fix any versions of the P[A II f§]. For each A the set [w: P[A II f§]"' < 0] lies in � and has probability 0; it therefore cannot contain w0• Thus P[A II�]"' o > 0. Similarly, P[OIIf§]"' o = 1, and, if the A n are disjoint, P [Un A nll �]"' o = E n P[AIIf§]"' o· Therefore, P[AII�]"'o is a prob ability measure as A ranges over ffl. Thus conditional probabilities behave like probabilities at points of positive probability. That they may not do so at points of probability 0 causes no problem because individual such points have no effect on the probabilities of sets. Of. course, sets of points individually having probabil ity 0 do have an effect, but here the global point of view reenters. •
•
•
Conditional Probability Distributions
Let
X be a random variable on ( 0, ffi, P) and let f§
be a a-field
in
ffl .
There exists a function p.( H, w ), defined for H in fll 1 and w in n, with these two properties: (i) For each w in 0, p.(H, w ) is, as a function of H, a probability measure on 91 1 (ii) For each H in fll 1 , p.( H, w) is, as a function of w, a version of P[ X e H ll f§1 w · The probability measure p.( · , w) is a conditional distribution of X given f§. If f§ = a ( Z ), it is a conditional distribution of X given Z. Theorem 33.3.
•
t The argument is outlined in Problem 33.13. It depends on the construction of certain nonmeasurable sets.
SECTION
For each rational then by (33.23),
PROOF.
r
�
s,
w outside a w outside a
(33 .33)
r, let F(r, w) be r,
a
version of
P[ X < rll�1w·
If
s,
r9-set A rs of probability 0. By (33.26),
F( r, w ) = lim F(
(3 3 .32) for
461
CONDITIONAL PROBABILITY
F( w ) < F( w )
( 33 .3 1 ) for
33.
n
r
+ n- 1, w)
r9-set Br of probability 0. Finally, by (33.25) and (33.26), lim
r -. - oo
F( r , w ) = 0,
lim
F( , w ) = 1 r
outside a �set C of probability 0. As there are only countably many of these exceptional sets, their union E lies in l§ and has probability 0. For w � E extend F( · , w) to all of R 1 by setting F(x, w) = in f[ F( r, w ) : x < r ]. For w e E take F( x, w) = F( x ) , where F is some arbitrary but fixed distribution function. Suppose that w � E . By (33.31) and (33.32), F(x, w) agrees with the first definition on the rationals and is nondecreas ing; it is right-continuous; and by (33.33) it is a probability distribution function. Therefore, there exists a probability measure p.( · , w) on (R 1 , !A 1 ) with distribution function F( · , w ) . For w e E, let p. ( · , w) be the probability measure corresponding to F(x). Then condition (i) is satisfied. The class of H for which p.( H, w) is measurable l§ is a A-system containing the sets H = ( - oo , r ] for rational r; therefore p.( H, w) is measurable f§ for H in !A 1 . By construction, p. (( - oo, r ], w) = P [ X < r II f§ ] "' with probability 1 for rational r. For H = ( - oo, r], then,
j p. ( H, w ) P( dw ) = P ([ X e H ) n G ) G
for all G in f§. Fix G. Each side of this equation is a measure as a function of H, and so the equation must hold for all H in !A 1 • • Let X and Y be random variables whose joint distribu tion " in R 2 has density f(x, y) with respect to Lebesgue measure: P(( X, Y ) e A ] = v ( A ) = f fAf(x, y) dx dy. Let g(x, y) = f(x, y)/ /Rtf(x, t) dt, and let p.( H, x) = fHg(x, y) dy have probability density g(x, · ) ; if fRt f(x, t) dt = 0, let p.( · , x) be an arbitrary probability measure on the line. Then p.( H, X( w )) will serve as the conditional distribution of Y given X. Indeed, (33.10) is the same thing as fE x R • p. ( F, x) d v ( x , y) = v ( E X F ), and a change of variable gives f[ x E J p. ( F, X( w))P(dw) = P[ X e E, Y e F). Thus p.( F, X(w)) is a version of P[ Y e FII X1w · This is Example • 33.5 in a new guise. Example 33.12.
e
462
DERIVATIVES AND CONDITIONAL PROBABILITY
PROBLEMS 33. 1.
33.2.
33.3.
20. 5 f
Borel ' s paradox. Suppose that a random point on the sphere is
specified by longitude 8 and latitude (), but restrict 8 by 0 < 8 < so that 8 specifies only the complete meridian circle (not semicircle) contain ing the point, and compensate by letting () range over ( - 'IT, 'IT ]. (a) Show that for given 8 the conditional distribution of () has density i 1 cos 4J 1 over ( - 'IT, + 'IT]. If the point lies on, say, the meridian circle through Greenwich, it is thus not uniformly distributed over that great circle. (b) Show that for given () the conditional distribution of 8 is uniform over (0, 'IT). If the point lies on the equator ( () is 0 or 'IT ), it is thus uniformly distributed over that great circle. Since the point is uniformly distributed over the spherical surface and great circles are indistinguishable, (a) and (b) stand in apparent contradic tion. This shows again the inadmissibility of conditioning with respect to an isolated event of probability 0. The relevant a-field must not be lost sight of. 20.22 f Let X and Y be independent, each having the standard normal distribution and let ( R , 8 ) be the polar coordinates for ( X, Y). (a) Show that X + Y and X - Y are independent and that R 2 = [( X + Y) 2 + ( X - Y) 2 ]12, and conclude that the conditional distribution of R 2 given X - Y is the chi-squared distribution with one degree of freedom translated by ( X - Y) 212. (b) Show that the conditional distribution of R 2 given 8 is chi-squared with two degrees of freedom. (c) If X - Y == 0, the conditional distribution of R 2 is chi-squared with one degree of freedom. If 8 == 'ITI4 or 8 == 5'1TI4, the conditional distribu tion of R 2 is chi-squared with two degrees of freedom. But the events [ X - Y == 0] and [ 8 == 'ITI4] u [ 8 == 5'1TI4] are the same. Resolve the ap parent contradiction. f Paradoxes of a somewhat similar kind arise in very simple cases. (a) Of three prisoners, call them 1, 2, and 3, two have been chosen by lot for execution. Prisoner 3 says to the guard, " Which of 1 and 2 is to be executed? One of them will be, and you give me no information about myself in telling me which it is." The guard finds this reasonable and says, "Prisoner 1 is to be executed." And now 3 reasons, " I know that 1 is to be executed; the other will be either 2 or me, and so my chance of being executed is now only t , instead of the ! it was before." Apparently, the guard has given him information. If one looks for a a-field, it must be the one describing the guard ' s answer, and it then becomes clear that the sample space is incompletely specified. Suppose that, if 1 and 2 are to be executed, the guard 's response is "1" with probability p and "2" with probability 1 - p ; and, of course, suppose that, if 3 is to be executed, the guard names the other victim. Calculate the conditional probabilities. (b) Assume that among families with two children the four sex distribu tions are equally likely. You have been introduced to one of the two 'IT ,
33. CONDITIONAL PROBABILITY 463 children in such a family, and he is a boy. What is the conditional probability that the other is a boy as well? Extend Examples 33.4 and 33.11 : If P(G) is 0 or 1 for every G in t§, then P[A II t§] == P(A) with probability 1. There is a slightly different approach to conditional probability. Let ( n , .F, P ) be a probability space, ( 0' , !F') a measurable space, and T: n n' a mapping measurable !Fj!F'. Define a measure ., on !F' by a function v( A') == P( A n 1 1A') for A' e !F'. Prove that there exists 1 p(A l w') on n', measurable !F' 1 and integrable PT- , such that 1 fA 'p (A l w') PT- (dw') == P(A n T- A') for all A' in !F'. Intuitively, p(A l w') is the conditional probability that w e A for someone who knows that Tw == w'. Let t§ == (T- 1A' : A' e !F']; show that t§ is a a-field and that p(A I Tw) is a version of P[AIIt§ L., . f Suppose that T == X is 1a random variable, (Q', !F') == ( R1 , 9P1 ), and x is the general point of R • In this case p(A lx) is sometimes written P[A 1 X == x] What is the problem with this notation? For the Poisson process (see Example 33.1) show that for 0 < s < t, SECTION
33.4. 33.5.
-+
33.6.
..
33.7.
P ( � == k ilN, ] ==
33.8.
0,
k
N, , N, .
Thus the conditional distribution (in the sense of Theorem 33.3) of � given N, is binomial with parameters N, and sIt. 29.10 t Suppose that ( X1 , X2 ) has the centered normal distribution-has in the plane the distribution with density (29.29). Express the quadratic form in the exponential as
where
XI.
33.9.
k
T
== a22 - al2aii 1 . Describe the conditional distribution of x2 given
Prove this form of Bayes ' theorem:
1 P [ AII!J ] dP P ( G I A ) == _G foP[ AII!J ] dP _ __
for G e t§.
464
DERIVATIVES AND CONDITIONAL PROBABILITY
33.10. Suppose that p. ( H, w) has property (i) in Theorem 33.3, and suppose that p. ( H, · ) is a version of P[ X e Hll�] for H in a w-system generating 9P1 •
Show that p. ( · , w) is a conditional distribution of X given �. 33 . 11. f Deduce from (33.16) that the conditional distribution of X given M is
1 p. ( H n ( - oo , M( w ) ]) 12 ' 1f M e H J ( w ) + 2 !-' ( - oo , M w ( )]) where p. is the distribution corresponding to F (positive and continuous). Hint: First check H ( - oo, x]. It is shown in Example 33.9 that (33.18) implies (33.20). Prove the reverse implication. 4.13 12.4 f The following construction shows that conditional probabili ties may not give measures. Complete the details. In Problem 4.13 it is shown that there exist a probability space ( Q, !F, P), a a-field � in !F, and a set H in !F, such that P( H) t , H and � are independent, � contai s all the singletons, and � is generated by a count able subclass. The countable subclass generating � can be taken to be a w-system 91 { B1 , B2 , } (pass to the finite intersections of the sets in the original class). Assume that it is possible to choose versions P[ A II� ] so that P[A 11 � 1cAJ is for each w a probability measure as A varies over !F. Let Cn be the w-set where P[ Bn 11 �1cAJ IB ( w ); show (Example 33.3) that c nncn has prob ability 1 . Show that w" e C implies that P[ G ll � 1cAJ IG ( w) for all G in � and hence that P[{ w } II� LAJ 1. Now w e H n C implies that P[ HII�LAJ > P[ { w } II�LAJ 1 and e He n c implies that P[ HII�LAJ � P ( Q - { w } II�LAJ 0. Thus w E c im plies that P[ H II�LAJ IH ( w ). But since H and � are independent, P[ HII�LAJ P( H) ! with probability 1, a contradiction. This example is related to Example 4.9 but concerns mathematical fact instead of heuristic interpretation. Use Theorem 12.5 to extend Theorem 33.3 to the case in which X is a random vector. Let a and fl be a-finite measures on the line and let l(x , y ) be a probability density with respect to a X fl. Define ==
33.12. 33.13.
==
n
==
•
•
•
==
==
==
==
33.14. 33.15.
( 33 .34)
unless
==
==
==
8x ( Y )
==
l( x , y )
f.RlI( X ' t ) fl ( dt )
==
==
w
'
==
the denominator vanishes, in which case take 8x ( Y ) 0, say. Show that, if ( X, Y) has density I with respect to a X fl, then the conditional distribution of Y given X has density 8 x ( Y ) with respect to fl. This generalizes Examples 33.5 and 33.12, where a and fl are Lebesgue measure.
33. CONDITIONAL PROBABILITY 465 33.16. 18.25 f Suppose that p. and �'x (one for each real x) are probability measures on the line, and suppose that �'x ( B) is a Borel function in x for each B e 911 . Then (see Problem 18.25) SECTION
{ 33 .35)
w ( E ) ==
f.Rl �'x ( y : ( x , y ) E E) p.( dx )
defines a probability measure on ( R 2 , 9P 2 ) . Suppose that ( X, Y) has distribution w and show that "x is a version of the conditional distribution of Y given X. 33.17. f Let a and fl be a-finite measures on the line. Specialize the setup of Problem 33.16 by supposing that p. has density f(x) with respect to a and �'x has density gx ( Y ) with respect to fl. Assume that 8x ( Y ) is measurable 91 2 in the pair (x, y), so that �'x ( B) is automatically measurable in x. Show that (33.35) has density f(x)gx ( Y ) with respect to a X fl : w( E) == ffEf( x)gx ( Y )a(dx) fl(dy ). Show that (33.34) is consistent with f(x, y) == f( x)gx (y). Put
Suppose that ( X, Y) has density f(x) gx ( Y ) �th respect to a X fl and show that py(x) is a density with respect to a for the conditional distribution of X given Y. In the language of Bayes, f( x ) is the prior density of a parameter x, gx ( Y ) is the conditional density of the observation y given the parameter, and P.v ( x) is the posterior density of the parameter given the observation. , that �is positive, 33.18. f Now suppose that a and fl are Lebesgue measur � x continuous, and bounded, and that 8x ( Y ) == e - 0. If
P( B; ) = 0, the value of E[ XII f§] over B; is constant but arbitrary.
•
For an indicator lA the defining properties of E[/AIIf§] and P[AIIf§] coincide; therefore, E[/AIIf§] = P[A II f§] with probability 1. It is easily checked that, more generally, E[ X II f§] = E; a ; P[A ; II f§] with prob • ability 1 for a simple function X = E; a ; IA . · Example 34. 2.
I
In analogy with the case of conditional probability, if [ X, , t E T] is a collection of random variables, E[ X II X,, t E T] is by definition E[X II r9' ] with a [ X,, t E T] in the role of �Properties of Conditional Expectation
To prove the first result, apply part (iii) of Example 16.9 to (D, � ' P).
f and X on
Let 9' be a 'IT-system generating the a-field f§, and suppose that 0 is a finite or countable union of sets in f§. An integrable function f is a version of E [ X II f§] if it is measurable f§ and if
Theorem 34. 1.
j fdP = f x dP G
G
holds for all G in 9'. In most applications it is clear that 0 E gJ_ All the equalities and inequalities in the following theorem hold with probability 1 .
468
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose that X, Y, Xn are integrable. a with probability 1, then E[XIIfl] = a. (ii) For constants a and b, E[aX + bYIIfl] = aE[ X IIfl] + bE[YII�]. (iii) If X s Y with probability 1, then E[ X llfl ] s E[YIIfl]. (iv) I E[ XII fl] l s E[ I XII I� ]. (v) If lim n Xn = X with probability 1, 1 Xn1 S Y, and Y is integrable, then lim n El Xnllfl ] = E[ XII�] with probability 1. PROOF. If X = a with probability 1, the function identically equal to a satisfies conditions (i) and (ii) in the definition of E[XIIfl], and so (i) above Theorem 34.2. (i) If X =
follows by uniqueness. As for (ii), aE[ XII�l +
bE[YII�] is integrable and measurable fl , and
1 ( aE [ XII� ) + bE [ Y II�]) dP = a 1 E [ XII� ) dP + b 1 E [ YII� ) dP G
G
G
= a 1 X dP + b 1 Y dP G
G
=
1 ( aX + bY ) dP G
for all G in t§ , so that this function satisfies the functional equation. If X s Y with probability 1, then /a( E[YIIfl] - E[ XII�]) dP = JG ( Y X) dP � 0 for all G in �- Since E[YII�] - E [X II�] is measurable fl , it must be nonnegative with probability 1 (consider the set G where it is negative). This proves (iii), which clearly implies (iv) as well as the fact that E[ XII�] = E[ YII t§ ] if X == Y with probability 1. To prove (v), consider zn == supk � n iXk - X I . Now zn � 0 with probabil ity 1, and by (ii), (iii), and (iv), I E [Xnll� ] - E[ XII�] I s E[Znll�). It suffices, therefore, to 'show that E[Znii�] � O with probability 1 . By (iii) the sequence E[Znll �] is nonincreasing and hence has a limit Z; the problem is to prove that Z = 0 with probability 1, or, Z being nonnegative, that E[Z] = 0. But 0 s Zn s 2Y, and so (34.1) and the dominated convergence theorem give E [ Z] f E [ZII t§ ] dP s f E [Znll � ] dP == E[Zn] 0. • -
--+
The properties (33.21) through (33.28) can be derived anew from Theo rem 34.2. Part (ii) shows once again that E[E; a1IA 1 11�1 == E; a ; P[A ; II�] for simple functions. If X is measurable �, then clearly E( X I I t§ ] ==- X with probability 1. The following generalization of this is used constantly. For an observer with the information in �, X is effectively a constant if it is measurable �:
SECTION
Theorem 34.3.
34.
CON DITIONAL EXPECTATION
469
If X is measurable f§, and if Y and XY are integrable, then
( 34 .2) with probability l .
I t will be shown first that the right side of ( 34.2) is a version of the left side if X = /Go and G0 E f§. Since /G0E [ YIIf§] is certainly measurable rl, it suffices to show that it satisfies the functional equation JG IG 0E [ YIIf§] dP = JG IG0Y dP, G E f§. But this reduces to /G n G0E [ YIIf§] dP == fG n G y dP, which holds by the definition of E [ YIIf§]. Thus (34.2) holds if o X is the indicator of an element of f§. It follows by Theorem 34.2(ii) that (34.2) holds if X is a simple function measurable f§. For the general X that is measurable f§, there exist simple functions Xn , measurable f§, such that I Xn l � l XI and lim n Xn = X (Theo rem 13.5). Since I XnYI < I XYI and I XYI is integrable, Theorem 34.2(v) implies that lim n E l XnYIIf§] = E [ XY IIf§] with probability 1 . But E [ XnYIIf§] = XnE[ YIIf§] by the case already treated, and of course lim n Xn E [ Y II�] = XE [YIIf§]. (Note that I XnE[ YIIf§] l = I E [ XnYIIf§] I � E[ I Xn YI II�] � E [ I XYI IIf§], so that the limit XE[ YIIf§ ] is integrable.) Thus (34.2) holds in general. Notice that X has not been assumed integrable. • PROOF.
Suppose that X and Y are random vectors of dimen sions j and k, that X is measurable f§, and that Y is independent of f§. If k J e gJ J + and fJ( x ) = P[(x, Y) e J], then Example 34.3.
(34 .3)
P [ ( X, Y ) E J ll� ] = IJ ( X )
with probability 1, which generalizes (33.13). To prove it, suppose first that k J = K X L ( K E gJi, L E gJ ); then by (34.2) and Example 33.6, P[ X E K, Y E Lilt§] = l[ x e K J P [ Y E Lilt§] = IK( X)P[ Y E L], and this reduces to (34.3). Since the class of J for which (34.3) holds is a A-system, it holds for all J in gJ J + k . Let cp(x, y) be a Borel function on Ri X R k , and for simplicity assume it is bounded. If /'P (x) = E[cp(x, Y)], then ( 34.4 ) E [ cp ( X, Y ) llf§ ) = /'P ( X) with probability 1 . If cp = IJ, this is just (34.3). It follows by linearity that (34.4) holds for simple functions, and by Theorem 34.2(v) (and Theorem 1 3 .5) that it holds for the general cp. • Taking a conditional expected value can be thought of as an averaging or smoothing operation. This leads one to expect that averaging X with respect
470
DERIVATIVES AND CONDITIONAL PROBABILITY
to f§2 , and then averaging the result with respect to a coarser (smaller) a-field �1 should lead to the same result as would averaging with respect to f§ 1 in the first place: Theorem 34.4.
f§ 1
c
f§2 , then
If X is integrable and the a-fields f§1 and f§2 satisfy
(34 .5 )
with probability l .
The left side of (34.5) is measurable f§ 1 , and so to prove that it is a version of E[ XIIf§1 ], it is enough to verify /G E[E[XIIf§2 ]ll f§1 ] dP = fG XdP for G e f§ 1 • But if G e f§1 , then G e �2 , and the left side here is • fG E[ X ll f§2 ] dP = fG X dP. PROOF.
If f§2 = !F, then E[ Xllf§2 ] = X, so that (34.5) is trivial. If and f§2 = f§, then (34.3) becomes
f§1
= {0, 0 }
E ( E ( XIIf§ ]) = E ( X] , the special case of (34.1) for G = 0 . If f§1 c f§2 , then E[XIIf§1 ], being measurable f§1 , is also measurable f§2 , so that taking an expected value with respect to f§2 does not alter it: E[E [ XIIf§1 ] ll f§2 ] = E[XIIf§1 ]. Therefore, if f§1 c f§2 , taking iterated ex pected values in either order gives E[ XIIf§1 ]. The remaining result of a general sort needed here is Jensen ' s inequality for conditional expected values: If cp is a convex function on the line and X and cp ( X) are both integrable, then (34 .7) cp ( E ( XII f§ ]) s E ( cp ( X ) II f§ ) with probability 1. For each x 0 take a support line [A33 ] through (x 0 , cp(x 0 )): cp (x 0 ) + A(x 0 )(x - x0) s cp (x) . The slope A(x 0 ) can be taken the right-hand derivative of cp, so that it is nondecreasing in x0• Now cp ( E ( XIIf§ ] ) + A ( E ( XIIf§ ] )( X - E ( X f l f§ ] ) < cp ( X ) . (34.6)
as
Suppose that E[ XII�] is bounded. Then all three terms here are integrable (if cp is convex on R 1 , then cp and A are bounded on bounded sets), and taking expected values with respect to f§ and using (34.2) on the middle term gives (34. 7). To prove (34.7) in general, let Gn = [ IE [ X II f§] l < n]. Then E[IG XII � ] = IG E[XII�] is bounded, and so (34.7) holds for IG x: cp (IG E[ XII f§]) s II
II
II
II
SECTION
34.
CONDITIONAL EXPECTATION
471
E [cp( IG11X)II rl]. Now E[cp ( IG11X)IIrl] = E[IG11cp ( X) + IG� cp (O) II� ] = IGN · E [ cp( X) II rl] + IGc:cp (O) ---. E[cp( X)IIrl]. Since cp(IG E [ X II rl]) converges to cp (E[ X II� ]) by the continuity of cp, (34.7) follows. If cp(x) = l x l , (34.7) II
II
gives part (iv) of Theorem 34.2 again.
Conditional Distributions and Expectations
Let p.( · , w ) be a conditional distribution with respect to r9 of a random variable X, in the sense of Theorem 33.3. If cp: R 1 ---. R1 is a Borel function for which cp( X) is integrable, then fR•cp(x)p.(dx, w) is a version of E[ cp ( X) ll riL.,. PROOF . If q> = IH and H e !1/ 1 , this is an immediate consequence of the definition of conditional distribution, and by Theorem 34.2(ii) it follows for q> a simple function over R 1 • For the general nonnegative q>, choose simple Cl>n such that 0 < cpnfx) j cp(x) for each x in R 1 • By the case already treated, fRl CJ>n (x)p.( dx, w ) is a version of E[cpn( X)IIrl]"'. The integral converges by the monotone convergence theorem in ( R 1 , !1/ 1 , p.( · , w )) to fRl cp( x )p.( dx, w) for each w, the value + oo not excluded, and E [ cpn ( X) ll rl] "' converges to E[ q>( X) ll rl]"' with probability 1 by Theorem 34.2(v). Thus the result holds Theorem 34.5.
for nonnegative q>, and the general case follows from splitting into positive and negative parts. •
It is a consequence of the proof above that fRtcp(x)p.( dx, w ) is measur able r9 and finite with probability 1 . If X is itself integrable, it follows by the theorem for the case cp(x) = x that
E [ xn � L.. = with probability 1 . If
( 34 .8 )
J-oo x p. ( dx , ,., ) oo
cp( X) is integrable as well, then
E [ cp( X) II� L =
j-oo cp( x )p.( dx, ,., ) oo
with probability 1 . By Jensen ' s inequality (21 .14) for unconditional expected values, the right side of (34.8) is at least cp( J 00 00 xp.( dx, w )) if cp is convex. This gives another proof of (34.7). Sufficient Subfields*
Suppose that for each 8 in an index set 8 , P9 is a probability measure on (0, §') . In statistics the problem is to draw inferences about t\le unknown parameter (J from an observation w.
* This topic may be omitted.
472
DERIVATIVES AND CONDITIONAL PROBABILITY
Denote by P9 [ A 11�1 and E9 [ XII�] conditional probabilities and expected values calculated with respect to the probability measure P9 on ( 0 , '> · A a-field � in 9i" is sufficient for the family [ P9 : 8 e 8 ] if versions P9 [ A 11�1 can be chosen that are independent of 8- that is, if there exists a function p ( A , w ) of A e � and w e 0 such that, for each A e !F and 8 e 8 , p ( A , · ) is a version of P9 [ A II� ] . There is no requirement that p ( · , w ) be a measure for w fixed. The idea is that although there may be information in !F not already contained in �, this information is irrelevant to the drawing of inferences about 8 .t A sufficient statistic is a random variable or random vector T such that a( T) is a sufficient subfield. A family Jt of measures dominates another family .Y if, for each A , from p.( A ) == 0 for all p. in J/, it follows that J�(A) == 0 for all J' in %. If each of Jl and .Y dominates the other, they are equivalent. For sets consisting of a single measure these are the concepts introduced in Section 32.
Theorem 34.6. Suppose that [ P9 : 8 e 8 ] is dominated by the a-finite measure p.. A
necessary and sufficient condition that � be sufficient is that the density f9 of P9 with respect to p. can be put in the form f9 == g9 h for a g9 measurable �. It is assumed throughout that g9 and h are nonnegative and of course that h is measurable !F. Theorem 34.6 is called the factorization theorem , the condition being that the density f9 splits into a factor depending on w only through � and a factor independent of 8. Although g9 and h are not assumed integrable p., their product f9 , as the density of a finite measure, must be. Before proceeding to the proof, consider an application. Ex1111le 1p 34.4. Let ( 0 , '> == ( R k , &f k ) and for 8 > 0 let P9 be the measure having with respect to k-dimensional Lebesgue measure the density if 0 � X; < 8 , i == 1 , . . . , k , otherwise. If X; is the function on R k defined by X; ( x) == X; , then under P9 , X1 , . . . , Xk are independent random variables, each uniformly distributed over [0, 8 ]. Let T( x) == max; s k X; ( x ). If g9 (t) is 9 - k for 0 < t < 8 and 0 otherwise, and if h ( x ) is 1 or 0 according as all x; are nonnegative or not, then f9 ( x) == g9 ( T( x)) h ( x ). The factorization criterion is thus satisfied, and T is a sufficient statistic. Sufficiency is clear on intuitive grounds as well : 8 is not involved in the conditional distribution of X1 , , Xk given T because, roughly speaking, a random one of them equals T and the others are independent and uniform over [0, T ]. If this is true, the distribution of X; given T ought to have a mass of k - 1 at T and a uniform distribution of mass 1 - k - 1 over [0, T], so that • • •
( 34 .9) t See Problem 34.20.
1 E, [ X; II T ] == k T + k -k I 2T == k 2+k l T.
34. CONDITIONAL EXPECTATION For a proof of this fact, needed later, note that by (21.9) SECTION
1
(34.10)
Ts. t
X; dP9 -
10
«>
473
P9 ( T .S t, X; � u ) du
tk 1 11 t - u ( -t ) k - 1 du == k == 0
8
8
+
--
28
if 0 s t s 8. On the otherk hand, P9 [ T s t ] == ( tj8 ) k , so that under P9 the distribu tion of T has density kt - 1 j8 k over [0, 8 ]. Thus (34.11 )
1
k + 1 T dP9 == k + 1 ' u k - 1 == t k + 1 . uk k du 2 k 2 k 0 Ts. t 8 28 k
1
Since (34.10) and (34.11) agree, (34.9) follows by Theorem 34.1.
•
The essential ideas in the proof of Theorem 34.6 are most easily understood through a preliminary consideration of special cases. For the sufficiency of the condition, suppose first that h is integrable I' ·
Suppose that [ P9 : 8 e 8 ] is dominated by a measure p. and that each P9 has with respect to p. a density f9 == g9 h , where g9 is measurable t§ and h is integrable I' · Then t§ is sufficient. Lemma 1.
Put a == I h dl' and replace g9 by g9a and h by hja. This shows that it is no restriction to assume that I h dp. == 1. Define P on !F by P( A ) == IA h dp., so that P1 ( A ) == IA g9 dP. For G in t§, (34.2) gives PROOF.
Therefore, P( A IIt§]-the conditional probability calculated with respect to P-serves as a version of P9 [A II t§] for each 8 in 8. Thus t§ is sufficient for the family [ P9 : 8 e 8 ]-even for this family augmented by P (which might happen to • lie in the family to start with). For the necessity, suppose first that the family is dominated by one of its members.
Suppose that [ P9 : 8 e 8 ] is dominated not only by p. but by P90 for some 10 e 8. If t§ is sufficient, then each P9 has with respect to p. a density f9 == g9 h , where g1 is measurable t§ and h is integrable p.. Lemma 2.
474
DERIVATIVES AND CONDITIONAL PROBABILITY
Let d8 be any density of P8 with respect to applications of (34.2),
PROOF.
P80.
By a number of
... jE60{ IA Es0( d,ll�) ll� } dPs0 ... jE,0{ 1AII� } Es0 ( dsll�) dPs0 =
J Es0 [ Es0 { lA II� } dsll�] dPs0 .., J Es0 { lA II� } d, dPs0
j Ps0 ( A ll�) dP, ... f P,( A ll�) dP6 ... P6 ( A ) , the next-to-last equality by sufficiency. Thus g8 == E80 [ d8 11 � ] , which is measurable � ' can serve as a density for P8 with respect to P80 • Take for h a density for P80 with =
re�ect � p.
•
To complete the proof of Theorem 34.6 requires one more lemma of a technical sort. Lemma 3.
If [ P8: () e 9] is dominated by a a-finite measure, then it is equivalent to
some countable subfamily.
If p. is a-finite, there is a countable partition n of Q into Jli:sets An such that 0 < p(An) < oo . The measure with value En 2- p(A n An)/p(An) at A is finite and equivalent to p. In proving the lemma it is therefore no restriction to assume the family is dominated by a finite measure p. Each P8 is dominated by p and hence has a density f8 with respect to it. Let S8 == [ w: f8 ( w) > 0]. Then P8 ( A ) == P8 (A n S8 ) for all A , and P8 ( A ) == 0 if and only if p(A n S8 ) == 0. In particular, S8 supports P8• Call a set B in !F a kernel if B c S8 for some (), and call a finite or countable union of kernels a chain. Let a be the supremum of p(C) over chains C. Since p is finite and a countable union of chains is a chain, a is finite and p( C) == a for some chain C. Suppose that C == Un Bn , where each Bn is a kernel, and suppose that Bn C Ss . The problem is to show that [ P8: () e 9] is dominated by [ P8 : n == 1, 2, . . . ] and hence equivalent to it. Suppose that P8 ( A ) == 0 for all n. Then 'jt (A n S8 ) == 0, as observed above. Since C c Un S8 , p(A n C) == 0, and it follows that P8 ( A n C) == 0 whatever () may be. But suppose that P8 (A - C) > 0. Then P8 (( A - C) n S8 ) = P8 ( A - C) is positive, and so (A - C) n S8 is a kernel, disjoint from C, of positive p-measure; this is impossible because of the maximality of C. Thus P8 (A - C) is 0 • along with P8 ( A n C), and so P8 ( A ) == 0. PROOF .
n
n
n
Suppose that [ P8 : () e 8 ] is dominated by a a-finite p, as in Theorem 34.6, and let [ P8 : n == 1 , 2, . . . ] be the sequence whose existence is guaranteed by Lemma 3. Define" a probability measure P on !F by
( 34.12)
n
SECTION
34.
475
CONDITIONAL EXPECTATION
Clearly, P is equivalent to [ P8 : n == 1, 2, . . . ] and hence to [ P8 : 8 e 8 ], and p. dominates P. If each P8 has density g8 h with respect PROOF OF SUFFICIENCY IN THEOREM 34.6. to p., thenn by the construction (34.12), P has density fh with respect to p., where f == En 2- g811 • Put r8 == g8/f if / > 0 and r8 == 0 (say) if / == 0. If each g8 is measurable t§, the same is true of f and hence of the r8• Since P [ f == 0] == 0 and P is equivalent to the entire family, P8 [f == 0] == 0 for all 8. Therefore, II
j r1 dP = j r1fh dp. = j A
A
== P8 ( A
A n [f> O]
n[
r1fh dp. =
j
A n [ /> 0]
g1 h dp.
/ > 0]) == P8 ( A ) .
Each P8 thus has with respect to the probability measure P a density measurable t§, and it follows by Lemma 1 for p. == P and h 1 that t§ is sufficient. • PROOF OF NECESSITY IN THEOREM 34.6. Let p(A , w) be a function such that, for each A and 8, p(A, · ) is a version of P8 [A IIt§], as required by the definition of sufficiency. For P as in (34.12) and G e t§, =
l p ( A , w ) P ( dw ) == [n 2- nj p ( A , w ) P8 ( dw) G
G
II
== [ 2- n J P, [ A Il l§] dP, == [ 2 - n P, ( A n
G
II
II
n
II
()
G ) == P ( A
()
G) .
Thus p (A , · ) serves as a version of P [A II t§] as well, and t§ is still sufficient if P is added to the family. Since P dominates the augmented family, Lemma 2 gives the • required factorization. Minimum-Variance Estimation*
To illustrate sufficiency, suppose that 8 is a subset of the line. An estimate of 8 is a random variable Z, and the estimate is unbiased if E8 [ Z] == 8 for all 8. A measure of the accuracy of the estimate Z is E8 (( Z - 8 ) 2 ). If t§ is sufficient, it follows by linearity (Theorem 34.2(ii)) that E8 [ Xllt§] has for X simple a version that is independent of 8. Since there are simple Xn such that I xn I < I XI and xn --+ X, the same is true of any X that is integrable with respect to each P8 (use Theorem 34.2(v)). Suppose that t§ is, in fact; sufficient, and denote by E( Xllt§] a version of E8 [ XIIt§] that is independent of 8. lbeorem 34.7. Suppose that E8 [( Z - 8 ) 2 ] < oo for all 8 and that t§ is sufficient.
Then
(34.13) for all 8. If Z is unbiased, then so is E[ Zllt§]. * This topic may be omitted.
476
DERIVATIVES AND CONDITIONAL PROBAB ILITY
By Jensen's inequality (34.7) for cp(x) == ( x - 8 ) 2 , ( E( ZII�] - 8 ) 2 < E1 [( Z - 8) 2 11�]. Applying E8 to each side gives (34.13). The second statement follows from the fact that E8 [ E[ Z11�11 == E8 [ Z]. • PROOF.
This, the Rao-B/ackwe/1 theorem, says that E[ Zll�] is at least as good an estimate as Z if � is sufficient. Returning to Example 34.4, note that each X; has mean 8/2 under P8 , so that if X == k - 1 E�_ 1 X; is the sample me� then 2 X is an unbiased estimate of fJ. But there is a better one. By (34.9), E8 [2 XIIT] == ( k + 1) Tjk == T' , and by the Rao-Blackwell theorem, T' is an unbiased estimate with variance at most that of 2 X. In fact, for an arbitrary unbiased estimate Z, E8 [(T' - 8) 2 ] s E8 [( Z - 8) 2 ]. To prove this, let B == T' - E[ Z II T]. By Theorem 20.1(ii), B == /(T) for some Borel function kf, and E8 [/(T)] == 0 for allk 8. Taking account of the density for T leads to f( x)x - 1 integrates to 0 over all intervals. Therefore, Jtf( x)x - 1 dx == 0, sok that f( x) along with f( x)x - 1 vanishes for x > 0, except on a set of Lebesgue measure 0, and hence P8 [/(T) == 0] == 1 and P8 (T' == E[ ZIIT]] == 1 for all 8. Therefore, E8 [( T' - 8 ) 2 ] == E8 [( E[ ZIIT] - 8 ) 2 ] s E8 [( Z - 8) 2 ] for Z unbiased, and T' has • minimum variance among all unbiased estimates of 8 . E%111le 11p 34.5.
PROBLEMS 34. 1. 34.2.
Work out for conditional expected values the analogue of Problem 33.5. In the context of Examples 33.5 and 33.12, show that the conditional expected value of Y (if it is integrable) given X is g( X), where ( X , y ) y dy !«> j g ( x ) == - : j f( x, y ) dy
•
- oo
34.3.
34.4.
Show that the independence of X and Y implies that E( YIIX] == E[ Y], which in turn implies that E[ XY] == E[ X] E[ Y]. Show by examples in an 0 of three points that the reverse implications are both false. (a) Let B be an event with P(B) > 0, and define a probability measure P0 by P0 ( A ) == P(A I B). Show that P0[AII�] == P[A n BII�]/P[ BII�] on a set of P0-measure 1. (b) Suppose that .Tt' is generated by a partition B1 , B2 , , and le t � v .Jf'== a(� U ft'). Show that with probability 1, • • •
SECTION
34.
CONDITIONAL EXPECTATION
477
34.5.
The equation (34.5) was proved by showing that the left side is a version of the right side. Prove it by showing that the right side is a version of the left side.
34.6.
X and Y that E[ YE[ XII�]] = E[ XE[ Y ll �]]. 33.14 f Generalize Theorem 34.5 by replacing X with a random vector. Assume that X is nonnegative but not necessarily integrable. Show that it is still possible to define a nonnegative random variable £( XII�], measurable � ' such that (34.1) holds. Prove versions of the monotone convergence
34.7. 34.8.
34.9.
Prove for bounded
theorem and Fatou' s lemma. (a) Show for nonnegative X that E[ XII�] = /0 P[ X > t il �] dt with prob ability 1. (b) Generalize Markov' s inequality: P[ I XI > a ll �] < a - k E[ I X l k 11 �1 with probability 1. (c) Similarly generalize Chebyshev's and Holder' s inequali ties.
34.10. (a) Show that, if �. c �2 and E[ X2 ] < oo , then E[( X - E[ X11 �2 ]) 2 ] < 2
E[( X - E[ X 11 �1 ]) ]. The dispersion of X about its conditional mean becomes smaller as the a-field grows. (b) Define Var[ XII�] = E[( X - E[ X11 �]) 2 11 �]. Prove that Var[ X] = E[Var[ XII�]] + Var[ E[ XII�]]. 34. 11. Let t§1 , �2 , �3 be a-fields in !F, let �;1 be the a-field generated by � u � ' and let A ; be the generic set in �; · Consider three conditions: (i) P [ A 3 11 �1 2 ] = P[A 3 11 �2 ] for all A 3 • (ii) P[A 1 n A 3 11 �2 ] = P[A 1 11 �2 ] P[A 3 11 �2 ] for all A 1 and A 3 • (iii) P[A 1 11 �2 3 ] = P[A 1 11 �2 ] for all A 1 • If �. , �2 , and �3 are interpreted as descriptions of the past, present,
and future, respectively, (i) is a general version of the Markov property: the conditional probability of a future event A 3 given the past and present �1 2 is the same as the conditional probability given the present �2 alone. Condition (iii) is the same with time reversed. And (ii) says that past and future events A 1 and A 3 are conditionally independent given the present �2 • Prove the three conditions equivalent.
34.12. 33.7 34.11 f
Use Example 33.10 to calculate P[Ns = k ii Nu , u > t] (s < t)
for the Poisson process.
34.13. Let L 2 be the Hilbert space of square-integrable random variables on ( 0 , $, P ) . For � a a-field in !F, let M� be the subspace of elements of L2 that are measurable �. Show that the operator P� defined for X e L2 by P� X = E[ XII�] is the perpendicular projection on M� . 34.14.
f Suppose in Problem 34.13 that � = a( Z) for a random variable Z in L2 • Let Sz be the one-dimensional subspace spanned by Z. Show that Sz may be much smaller than M0 z> , so that E[ XII Z] (for X e L2 ) is by no means the projection of X on Z. Hint: Take Z the identity function on the unit interval with Lebesgue measure.
34.15.
f Problem 34.13 can be turned around to give an alternative approach to conditional probability and expected value. For a a-field � in !F, let P� be
478
DERIVATIVES AND CONDITIONAL PROBAB ILITY
the perpendicular projection on the subspace Mt'§ . Show that Pt'§ X has for X e L2 the two properties required of E[ Xll fl ]. Use this to define E[ Xllfl] for X e L2 and then extend it to all integrable X via approximation by random variables in L2 • Now define conditional probability.
34. 16. Mixing sequences. A sequence A 1 , A 2 , ( Q, �' P ) is
(34.14)
•
mixing with constant a if
•
•
of $sets in a probability space
lim P( A n n E ) = a P( E ) n
for every E in !F. (a) Show that { An } is mixing with constant a if and only if li�
(34.15)
£ XdP = a f XdP II
for each integrable X (measurable !F). (b) Suppose that (34.14) holds for E e Bl, where fJI is a w-system, 0 e �' and A n e a(BI) for all n. Show that { An } is mixing. Hint: First check (34.15) for X measurable a(BI) and then use conditional expected values with respect to a( Bl). (c) Show that, if P0 is a probability measure on ( Q , !F) and P0 « P, then mixing is preserved if P is replaced by P0.
34.17.
Application of mixing to the central limit theorem. Let X1 , X2 , . . . be random variables on ( Q, !F, P), independent and identically distributed with mean 0 and variance a 2 ' and put sn == XI + . . . + xn. Then Sn/a..fn f
� N by the Lindeberg-Levy theorem. Show by the steps below that this
still holds if P is replaced by any probability measure P0 on ( Q , !F) that P dominates. For example, the central limit theorem applies to the sums EZ _ 1 rk ( w ) of Rademacher functions if w is chosen according to the uniform density over the unit interval, and this result shows that the same is true if w is chosen according to an arbitrary density. Let Y, = Sn/a..fn and Zn = ( Sn - S1 10 n J )/a..fn , and take fJI to consist k of the sets of the form [( X1 , , Xk ) e H, , k > 1, H e � . Prove succes sively: (a) P[ Y, < x ] --+ P[ N < x]. • • •
(b) P( l Y, - Zn l > t: ) --+ 0.
(c) (d)
P[ Zn < x] --+ P[ N < x]. P ( E n [ Zn < x]) --+ P( E) P [ N < x] for E e Bl. (e) P ( E n [ Zn � x]) --+ P( E) P[N < x ] for E e !F. (f) P0 [ Zn < x] --+ P[ N < x]. (g) Po l l Y, - Zn l > t: ] --+ 0. (h) P0 [ Y, < x] --+ P[ N < x]. 34. 18. Let ( X, !I, p.) and ( Y, '!1, ) be probability measure spaces and let ( Q , !F, P) be their product. Let t§ == [A X Y: A e !£] be the a-field of vertical strips. For a random variable f = f(x, y), put g(x, y) = Jyf {x, z ) v(dz ) and show by Fubini's theorem that g is a version of E[/llt§]. .,
SECTION
34.
479
CONDITIONAL EXPECTATION
34. 19. Here is an alternative construction of conditional expectation. It uses the
Hahn decomposition and in effect incorporates one of the proofs of the Radon-Nikodym theorem. Suppose that X is nonnegative and integrable. For each nonnegative dyadic rational t, find by the Hahn result (Theorem 32.1 ) a �set A, such that
J X dP > tP ( G) 1 XdP < tP ( G ) G
G
if
G c A, ,
Ge
'§ ,
if
G c A�,
Ge
'§.
Show that P(Al, - A u ) == 0 if v > u and replace A , by A, - U,, u (A,. A u ) · Show that it is possible to take A 0 == Q . Show by the integrability of X that P(ns 0 A s ) = 0 and replace A, by A, - ns 0As for t > 0. Thus it can be assumed that the A, are nonincreasing (over dyadic rational t ), that A 0 == Q , and that n, 0 A, == 0. Put Gn k = and consider the successively finer decom - A
>
>
A k 12��
>
1
1 )(2��,
• • •
1 P ( G n Gn k )
XdP -< k +11 1 2 Gn GII/c
1
.!5:_n
n , and so lim nfn ( w ) = l( w ), where 1/( w) - ln ( w ) l < 2 - . Clearly, fGin dP --+ fGi dP. If G E fl, then If w E
Gm j
C
"'
n
< L P ( G n Gn k ) 2- < 2 -
n
k
by (34. 16). Hence fGin dP --+ fG X dP , and the �measurable function satisfies !GfdP = JG X dP for G E fl.
34. 20. Suppose that t§ is measures
P8 ,
a
f
sufficient subfield for the family of probability 8 E 8 , on ( Q , .F). Suppose that for each 6 and A , p ( A , w ) is
480
DERIVATIVES AND CON DITIONAL PROBABILITY
a version of P8 [A II�L" ' and suppose further that for each w . p( w ) is a probability measure on !F. Define Q8 on !F by Q8 ( A ) = f0p ( A , w ) P8 ( dw ) , and show that Q8 = P8 • The idea is that an observer with the information in � (but ignorant of w itself) in principle knows the values p ( A , w ) because each p ( A , ) is measurable �- If he has the appropriate randomization device, he can draw an w' from Q according to the probability measure p ( , w ) , and his w' will have the same distribution Q8 = P8 that w has. Thus, whatever the value of the unknown 8, the observer can on the basis of the information in � alone, and without knowing w itself, construct a probabilistic replica of w. · ,
·
·
SECI10N 35. MARTINGALES Definition
Let X1, X2 , be a sequence of random variables on a probability space be a sequence of a-fields in �- The sequence (D, � ' P), and let �1, �2 , { ( Xn , �): n = 1, 2, . . . } is a martingale if these four conditions hold : (i) � c � + 1 ; (ii) xn is measurable �; (iii) E [ I Xn I ] < oo ; (iv) with probability 1, •
•
•
•
•
•
( 35 .1 ) Alternatively, the sequence X1, X2 , is said to be a martingale relative to the a-fields �1 ' �2 , Condition (ii) is expressed by saying the xn are adapted to the �If Xn represents the fortune of a gambler after the n th play and � represents his information about the game at that time, (35.1) says that his expected fortune after the next play is the same as his present fortune. Thus a martingale represents a fair game, and sums of independent random variables with mean 0 give one example. As will be seen below, martingales arise in very diverse connections. The sequence x1 , x2 , . . . is defined to be a martingale if it is a martingale relative to some sequence �1 , �2 , In this case, the a-fields fin = a( X1, , Xn ) always work: Obviously, fin c fin + 1 and Xn is measurable fin , and if (35.1) holds, then E[ Xn + 1 llfln1 = E[E[ Xn + 1 ll �]llfln1 = E[ Xnll �n1 = Xn by (34.5). For these special a-fields fin , (35.1) reduces to •
•
•
•
•
•
•
•
•
( 35 .2 )
.
.
•
•
•
SECTION
35.
481
MARTINGALES
Since a( X1 , , Xn ) c � if and only if Xn is measurable � for each n , the a( X1, , Xn ) are the smallest a-fields with respect to which the Xn are a martingale. The essential condition is embodied in (35.1) and in its specialization (35.2). Condition (iii) is of course needed to ensure that E[ Xn+ tll� ] exists. Condition (iv) says that Xn is a version of E [ Xn + 111 �]; since Xn is measurable � ' the requirement reduces to .
•
•
•
•
•
( 35 .3)
A e �.
Since the � are nested, A e � implies that fA Xn dP = fA Xn + 1 dP = · · = fA Xn + k dP. Therefore, Xn, being measurable �, is a version of E [ Xn + kll�]: ·
( 3 5 .4) with probability 1 for k > 1 . Note that for A =
ll,
(35.3) gives
( 35 .5) The defining conditions for a martingale can also be given in terms of the differences ( 35 .6) (4 1 = X1 ). By linearity (35.1) is the same thing as ( 35 .7) Note that, since Xk = &1 + + & k and &k = Xk - Xk _ 1, the sets XI ' . . . ' xn and &It . . . ' & n generate the same a-field: ·
·
·
( 35 .8) Let &1, & 2 , be independent, integrable random vari ables such that E [&n] = 0 for n > 2. If � is the a-field (35.8), then by independence E [ & n + lll�] = E[ & n + t1 = 0. If & is another random varia ble, independent of the &n, and if � is replaced by a ( & , &1, , & n ), then the Xn = &1 + + &n are still a martingale relative to the �- It is natural and convenient in the theory to allow a-fields � larger than the minimal • ones (35.8). Example 35. 1.
•
•
•
•
·
·
·
•
•
482
DERIVATIVES AND CONDITIONAL PROBABILITY
Let ( 0, � ' P) be a probability space, let " be a finite be a nondecreasing sequence of a-fields measure on !F, and let !F1 , !F2 , in !F. Suppose that P dominates " when both are restricted to �- that is, suppose that A e � and P( A) = 0 together imply that J1( A) = 0. There is then a density or Radon-Nikodym derivative Xn of " with respect to P when both are restricted to �; Xn is a function that is measurable � and integrable with respect to P, and it satisfies Example 35. 2
•
•
•
A e �.
(35 .9)
If A e �, then A e � + l as well, so that fA Xn + 1 dP = J1( A) ; this and • (35.9) give (35.3). Thus the Xn are a martingale with respect to the �For a specialization of the preceding example, let P be Lebesgue measure on the a-field � of Borel subsets of 0 = (0, 1] and let � be the finite a-field generated by the partition of 0 into dyadic intervals (k2 - n , (k + 1)2 - n ], 0 < k < 2 n . If A e � and P(A) = 0, then A is emp ty. Hence P dominates every finite measure " on �- The Radon-Nikodym derivative is Example 35.3.
There is no need here to assume that P dominates " when they are viewed as measures on all of �- Suppose that " is the distribution of Ek_ 1 Zk 2 - k for independent Zk assuming values 1 and 0 with probabilities p and 1 p. This is the measure in Examples 31.1 and 31.3 (there denoted by p.), and for p ¢ � ' " is singular with respect to Lebesgue measure P. It is nonetheless absolutely continuous with respect to P when both are re• stricted to �-
For another specialization of Example 35.2, suppose that Jl is a probability measure Q on � and that � = a( Y1 , , Yn ) for on ( 0, �). Suppose that under the measure P random variables Y1 , Y2 , the distribution of the random vector (Y1 , , Yn ) has density Pn (y 1 , Yn ) with respect to n-dimensional Lebesgue measure and that under Q it has density qn( y 1 , , Yn > · To avoid technicalities, assume that Pn is everywhere positive. Then the Radon-Nikodym derivative for Q with respect to P on � is Example 35.4.
•
•
•
(35 . 1 1 )
•
•
•
•
•
•
•
•
•
•
•
•
,
SECTION
35.
483
MARTINGALES
To see this, note that the general element of � is [( Y1, n H E £1/ ; by the change-of-variable formula,
.
•
•
, Yn )
E
H],
In statistical terms, (35.11) is a likelihood ratio: Pn and qn are rival densities, and the larger Xn is, the more strongly one prefers qn as an explanation of the observation ( Y1, , Yn). The analysis is carried out under the assumption that P is the measure actually governing the Yn ; that is, Xn is a martingale under P and not in general under Q. In the most common case the Yn are independent and identically distrib uted under both P and Q: Pn ( y1, p(yn) and , Yn ) = p(y1 ) qn (y 1 , , Yn ) = q(y1 ) q(yn ) for densities p and q on the line, where p is assumed everywhere positive for simplicity. Suppose that the measures corresponding to the densities p and q are not identical, so that P [ Yn e H] 1 ¢ Q[ Yn E H] for some H E £1/ If zn = /[ Y,. H ) ' then by the strong law of large numbers, n - 1Ek = 1 Zk converges to P[ Y1 E H] on a set (in !F) of P-measure 1 and to Q[ Y1 E H] on a (disjoint) set of Q-measure 1. Thus P • and Q are mutually singular on !F even though P dominates Q on �•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
E
Suppose that Z is an integrable random variable on (D, !F, P) and � are nondecreasing a-fields in !F. If Example 35.5.
(3 5 .12 ) then the first three conditions in the martingale definition are satisfied, and by (3 4 .5), E [ Xn + 111 �] = E [ E [ Z II� + 1 ] 11 �] = E [ZII �] = Xn . Thus Xn is a martingale relative to �• Let Nn k' n, k = 1, 2, . . . be an independent array of identically distributed random variables assuming the values 0, 1, 2, . . . . Define Z0 , Z1, Z2 , . . . inductively by Z0(w) = 1 and Zn ( w ) = Nn . 1 ( w ) + · · · + Nn . z,. _ 1( w ) (w); Zn (w) = 0 if Zn_ 1 (w) = 0. If Nnk is thought of as the number of progeny of an organism, and if Zn _ 1 represents the size at time n 1 of a population of these organisms, then zn represents the size at time n. If the expected number of progeny is E [ Nn k 1 = m , then E [ Zn11Zn_ 1 ] = Zn_ 1 m , so that Xn = Zn/m n , n = 0, 1, 2, . . . , is a martingale. The sequence Z0, Z1 , . . . is a branching process. • Example 35.6.
-
484
DERIVATIVES AND CONDITIONAL PROBABILITY
In the preceding definition and examples, n ranges over the positive integers. The definition makes sense if n ranges over 1, 2, . . . , N; here conditions (ii) and (iii) are required for 1 < n < N and conditions (i) and (iv) only for 1 � n < N. It is, in fact, clear that the definition makes sense if the indices range over an arbitrary ordered set. Although martingale theory with an interval of the line as the index set is of interest and importance, here the index set will be discrete. Submartingales
Random variables Xn are a submartingale relative to a-fields � if (i), (ii), and (iii) of the definition above hold and if this condition holds in place of (iv): (iv') with probability 1, (35 .13) As before, the Xn are a submartingale if they are a submartingale with respect to some sequence � ' and the special sequence � = o( X1 , , Xn) works if any does. The requirement (35.13) is the same thing as •
(35 .14)
A
•
•
E �.
This extends inductively (see the argument for (35.4)), and
so
(35 .15) for k > 1. Taking expected values in (35.15) gives (35 .16) Suppose that the � n are independent and integrable, as in Example 35.1, but assume that E[� n1 is for n > 2 nonnegative rather • than 0. Then the partial sums � 1 + · · · + � n form a submartingale. Example 35. 7.
Suppose that the Xn are a martingale relative to the � Then I Xnl is measurable � and integrable, and by Theorem 34.2(iv), E[ I Xn + t l l l�] > IE[ Xn + I II�] I = I Xn l · Thus the I Xn l are a submartingale relative to the �- Note that even if XI' . . . ' xn generate � ' I XI I ' . . . ' I xn I • will in general generate a a-field smaller than �Example 35.8.
35. MARTINGALES 485 Reversing the inequality in (35.13) gives the definition of a super martingale. The inequalities in (35.14), (35.15), and (35.16) become reversed as well. The theory for supermartingales is of course symmetric to that of sub martingales. SECTION
Gambling
Consider again the gambler whose fortune after the n th play is Xn and whose information about the game at that time is represented by the a-field � - If � = a( X1 , , Xn ), he knows the sequence of his fortunes and nothing else, but � could be larger. The martingale condition (35.1) stipulates that his expected or average fortune after the next play equals his present fortune, and so the martingale is the model for a fair game. Since the condition (35.1 3) for a submartingale stipulates that he stands to gain (or at least not lose) on the average, a submartingale represents a game favorable to the gambler. Similarly, a supermartingale represents a game unfavorable to the gambler. t Examples of such games were studied in Section 7, and some of the results there have immediate generalizations. Start the martingale at n = 0, X0 representing the gambler' s initial fortune. The difference � n = Xn - Xn - l represents the amount the gambler wins on the nth play, * a negative win being of course a loss. Suppose instead that � n represents the amount he wins if he puts up unit stakes. If instead of unit stakes he wagers the amount Wn on the n th play, W,. � n represents his gain on that play. Suppose that Wn > 0, and that W,. is measurable � - I to exclude prevision: Before the n th play the information available to the gambler is that in � - 1 , and his choice of stake W,. must be based on this alone. For simplicity take Wn bounded. Then W,� n is integrable and measurable � if � n is, and if Xn is a martingale, then E[ W,. � nll � - 1 ] = Wn El� nll � - 1 ] = 0 by (34.2). Thus •
•
•
(35 .17) is a martingale relative to the �- The sequence W1 , W2 , represents a betting system, and transforming a fair game by a betting system preserves fairness; that is, transforming xn into (35.1 7) preserves the martingale property. The various betting systems discussed in Section 7 give rise to various martingales, and these martingales are not in general sums of independent •
•
•
t There is a reversal of terminology here: a sub fair game (Section 7) is against the gambler,
while a submartingale favors him. * The notation has, of course, changed. The F, and xn of Section 7 have become xn and � n ·
486
DERIVATIVES AND CONDITIONAL PROBABILITY
random variables-are not in general the special martingales of Example 35. 1 . If Wn assumes only the values 0 and 1, the betting system is a selection system; see Section 7. If the game is unfavorable to the gambler-that is, if Xn is a super martingale- and if W,. is nonnegative, bounded, and measurable � _ 1 , then the same argument shows that (35.17) is again a supermartingale, is again unfavorable. Betting systems are thus of no avail in unfavorable games. The stopping-time arguments of Section 7 also extend. Suppose that { Xn } is a martingale relative to { � } ; it may have come from another martingale via transformation by a betting system. Let T be a random variable taking on nonnegative integers as values, and suppose that
( 35 .18 ) T is the time the gambler stops, [ T = n] is the event he stops just after the n th play, and (35.18) requires that his decision is to depend only on the information � available to him at that time. His fortune at time n for this stopping rule is If
{
X* = Xn n X
( 35 .19)
T
if n < T, if n > T .
Here X,. (which has value X,.<w>( w) at w) is the gambler' s ultimate fortune, and it is his fortune for all times subsequent to T. The problem is to show that X0* , X1* , . . . is a martingale relative to $0 , $1 , First, n-1 n * Xn I dP < E E [ I xk I < 00 . E [I Xn l 1 = E I xk I dP + I [ 1' > n ] k - 0 [ 1' - k ] k -0 •
•
•
•
1
Since [ T > n ] = 0
[ Xn*
Moreover,
and
E
H]
[ T < n] e � ' n = u [ T = k ' xk k-0
1
1
-
E
H] u [ T > n '
xn
E
H] E � .
SECTION
35.
487
MARTINGALES
Because of (35.3), the right sides here coincide if A E �; this establishes (35.3) for the sequence X1*, X2*, . . . , which is thus a martingale. The same kind of argument works for supermartingales. Since Xn* = XT for n � ,., Xn* -+ XT . As pointed out in Section 7, it is not always possible to integrate to the limit here. Let Xn = a + � 1 + + � n ' where the � n are independent and assume the values ± 1 with probability � ( X0 = a), and let ,. be the smallest n for which � 1 + � n = 1. Then E[ Xn*1 = a and XT = a + 1. On the other hand, if + the xn are uniformly bounded or uniformly integrable, it is possible to integrate to the limit: E[ XT ] = E[ X1 ]. ·
·
·
·
·
·
Functions of Martingales
Convex functions of martingales are submartingales:
If X1 , X2 , • • • is a martingale relative to �1 , �2 , • • • , if cp is convex, and if the cp( Xn ) are integrable, then cp( X1 ), cp ( X2 ), • • • is a submartingale relative to �1 , �2 , • • • • (ii) If XI , x2 , . . . is a submartingale relative to �1 , �2 , . . . ' if cp is nondecreasing and convex, and if the cp( Xn ) are integrable, then cp( X1 ), cp( X2 ), • • • is a submartinga/e relative to �1 , �2 , • • • • PROOF. In the submartingale case, Xn < E( Xn + 1 11� ), and if cp is nonde creasing, then cp (Xn ) < cp (E[ Xn + t ll �]). In the martingale case, Xn = E [ Xn + t ll �], and so cp( Xn ) = cp(E[ Xn + t ll �]). If cp is convex, then by Jensen ' s inequality (34.7) for conditional expectations cp(E[ Xn + t ll �]) < • E [ cp( Xn + t ) ll �]. Theorem 35. 1. (i)
Example 35.8 is the case of part (i) for cp(x) =
1x1 .
Inequalities
There are two inequalities that are fundamental to the theory of martingales. Theorem 35.2.
If XI , . ' xn is a submartinga/e, then for . .
a
> 0,
(35 .20) This extends Kolmogorov' s inequality: If S1 , S2 , • • • are partial sums of independent random variables with mean 0, they form a martingale; if the variances are finite, then Sf, Sf, . . . is a submartingale by part (i)
488
DERIVATIVES AND CONDITIONAL PROBABILITY
of the preceding theorem, and (35.20) for this submartingale is exactly Kolmogorov' s inequality (22.9).
a, let A 1 = (X1 � a), [max; < k X; < a < Xk ], and A = U Z _ 1 A k = [max; � n X; > a]. Since !Fk = o( X1 , , Xk ), it follows (see (35.15)) that n n Xn dP = E Xn dP = E E [ Xnll $"d dP PROOF
OF
THE
•
•
For arbitrary
THEOREM.
Ak = Ak E
•
J
A
k-1
�
n E
k-1
J
Ak
jAk
k-1
J
Ak
n Xk dP > a E P ( A k ) = aP ( A ) . k-1
Thus
[
aP �saxn X; � a 1
] j ..s:
[max; .s n Xf � a )
Xn dP < E [ x: )
..s:
E [I Xn I ) .
•
If xl , . . . ' xn is a martingale, I XI I ' . . . ' I xn I is a submartingale, and so (35.20) gives P[max; � ni X; I � a] � a - 1E[ 1Xn 1 1 · The second fundamental inequality requires the notion of an upcrossing. Let [a, P1 be an interval (a < P> and let xl , . . . ' xn be random variables. The number of upcrossings of [a, P1 by X1 ( w ) . . . , Xn( w) is the number of times the sequence passes from below a to above p. More specifically, the number of upcrossings is the number of pairs u and v of integers such that 1 � u < v � n, Xu � a , a < X; < P for u < i < v (a vacuous condition if v = u + 1), and Xv � {J. In the diagram, n = 20 and there are three upcrossings; the corresponding pairs u, v are shown at the bottom. ,
\
u
T
v
I
\
u
v
I
\
u
T
v
I
SECTION
35.
For a submartinga/e X1, crossings of [ a, PJ satisfies
1beorem 35.3.
E[U]
( 3 5 .21 )
�
489
MARTINGALES • • •
, Xn , the number U of up
E (IXn l ] + a �.�tJ - a
•
The proof requires a lemma. Suppose X1 , , Xn is a submartingale with respect to .F1 , , �, and let T1 and T2 be stopping times: [ T; = k ] E :Fk , 1 � k � n. • • •
• • •
Lemma 1.
satisfy 1
�
If xl ,
T1
�
' xn is a submartinga/e and the stopping times T1 and T2 T2 � n with probability 1, then E[ XT1 ] � E[ XT ]. 2 • • •
EZ- 1 E[ I Xk I ], the expected values exist. If � k = Xk - Xk _ 1, then
E[ I XT1 1 ]
Since PROOF.
i- 1
lies in � and [ Tk = n] = [ Tk < n - 1 ] c lies in � ' and so "k is also a stopping time. With a similar argument for even k, this shows that the "k are stopping times, since clearly "o is one. Since Yn = YT > YT - YT0,
Yn > i..J � ( YT - YT e
k
k-l
) + L ( YT
k
o
II
II
- YT ) , k- l
where Ee and Eo extend respectively over the even and the odd k in the range 1 < k < n. By Lemma 1, the last sum has nonnegative expected value, and so
Suppose now that k is even. If "k - 1 = n, then YT = Yn = YTk ' while if "k _ 1 < n, then YT = 0; in either case, YT - YT > 0. Suppose that 1 < u < v < n, Yu = 0, 0 < Y; < 8 for u < i < v, and Yv > 8-that is, suppose that the pair u, v contributes an upcrossing of [0, 8] by Yh . . . , Yn. Then there is an even k, 1 < k < n, such that "k - 1 < u < v = "k (see the diagram); it follows that YT = 0 and YT > 8 , and hence YT - YT > 8. Since different upcrossings give different values of k, Ee( YT - YT ) > U8, and therefore E[Yn1 > 8E[U]. In terms of the original random variables, this is k-l
k
k- l
( ,8
- a
1 j
)E[U
a ]
k
( xn
- a
k
-
l
k-l
k
) dP
a] < • a - 1E[E[ZII�]] = a - 1E[Z] < 8 for large enough a. Suppose that � are a-fields satisfying �1 c �2 c · · · . If the union u�- 1 � generates the a-field �oo ' this is expressed by � i �oo · The requirement is not that �oo coincide with the union but that it be generated by it. Theorem 35.5.
If � j �oo and Z is integrable, then
( 35 .23) with probability 1. According to Example 35.5, the random variables xn = E[ZII� ] form a martingale relati�e to the � - By Lemma 2 the Xn are uniformly integrable. Since E[ I Xn11 s E[ I ZI], by Theorem 35.4 the Xn converge to an integrable X. The problem is to identify X with E[Z II �00 ]. Because of the uniform integrability, it is possible (Theorem 16.13) to integrate to the limit: fA XdP = limn fA Xn dP. If A E �k and n > k, then fA Xn dP = fA E[ZII�] . dP = fAZ dP. Therefore, fA XdP = fA ZdP for all A in the w-system Uk_ 1 �k ; since X is measurable �00 , it follows by Theorem 34.1 that X is a version of E[ZII �00 ]. • PROOF.
Reversed Martingales
A left-infinite sequence . . . , X _ 2 , X _ 1 is a martingale relative to a-fields . . . , :F_ 2 , �- 1 if conditions (ii) and (iii) in the definition of martingale are satisfied for n s - 1 and conditions (i) and (iv) are satisfied for n < - 1 . Such a sequence is a reversed or backward martingale.
For a reversed martingale, integrable, and E[ X] = E[X_n] for all n.
Theorem 35.6.
lim n � oo X - n = X exists
and is
SECTION
35.
493
MARTINGALES
The proof is almost the same as that for Theorem 35.4. Let X * and x. be the limits superior and inferior of the sequence X_ 1 , X_ 2 , Again (35.22) holds. Let un be the number of upcrossings of [a, P1 by X_ n , . . . , X_ 1 • By the upcrossing theorem, E[Un 1 S ( E[ I X 1 1 ] + l a i )/( P - a). Again E [Un ] is bounded, and so supnUn is finite with probability 1 and the sets in (35.22) have probability 0. Therefore, lim n __. 00 X_ n = X exists with probability 1. By the property (35.4) for martingales, X n = E [ X _ 1 11 � n ] for n = 1, 2, . . . . Lemma 2 now implies that the X_ n are- uniformly integrable. Therefore, X is integrable and E [ X] is the limit of the E [ X_ n ]; these all have the same value by • (35.5). PROOF.
•
•
•
•
-
-
=
If � are a-fields satisfying �1 ::> �2 • • • , then the intersection n:o_ 1 � �0 is also a a-field, and the relation is expressed by � � �0 •
1beorem 35.7.
If � � �0 and Z is integrable, then
(35 .24)
with probability 1. If x_ n = E [ ZII� ], then . . . , x_ 2 , x 1 is a martingale relative to . . . , �2 , �1• By the preceding theorem, E [ ZII�] converges as n --+ oo to an integrable X, and by Lemma 2, the E[ZII�] are uniformly integrable. As the limit of the E[ ZII�] for n > k, X is measurable �k ; k being arbitrary, X is measurable �0• By uniform integrability, A e �0 implies that PROOF.
_
J XdP = lim n j E [ E ( ZI I � ] II �o ] dP n j E ( ZII�] dP = lim A
A
A
= linm j E [ Z II �0] = j E [ ZII �0 ] dP. A A Thus X is a version of E[ZII �0].
•
Theorems 35.5 and 35.7 are parallel. There is an essential difference between Theorems 35.4 and 35.6, however. In the latter, the martingale has a last random variable, namely X _ 1 , and so it is unnecessary in proving convergence to assume the E[ 1 Xn 1 1 bounded. On the other hand, the proof in Theorem 35.6 that X is integrable would not work for a submartingale.
494
DERIVATIVES AND CONDITIONAL PROBABILITY
In applications of Theorem 35 .7, � is often o( Yn , Yn + 1 , ) for random variables Y1 , Y2 , In Theorem 35.5, � is often o( Y1, , Yn ); in this case j&'oo = o( Y1 , Y2 , ) is in general strictly larger than U�- 1'� · •
•
•
•
•
•
•
•
•
•
•
•
•
Applications: Derivatives
Suppose that (0, j&', P) is a probability space, is a finite measure on !F, and � t j&'oo c j&'_ Suppose that P dominates on � ' and let Xn be the corresponding Radon-Nikodym derivative. Then Xn --+ X with probability 1, where X is integrable. (i) If P dominates on j&'00 , then X is the corresponding Radon-Niko dym derivative. (ii) If P and are mutually singular on �00, then X = 0 with probabil ity 1. PROOF. The situation is that of Example 35.2. The density Xn is measur able � and satisfies (35.9). Since Xn is nonnegative, E[ IXn11 = E[Xn1 = J1(0), and it follows by Theorem 35.4 that Xn converges to an integrable X. The limit X is measurable � · Suppose that P dominates on j&'oo and let Z be the Radon-Nikodym derivative: Z is measurable � ' and fAZ dP = J1(A) for A E � · It follows that fAZ dP = fA Xn dP for A in � ' and so Xn = E[ZII�]. Now Theorem 35.5 implies that Xn --+ E[ Z ll j&'oo ) = Z. Suppose, on the other hand, that P and are mutually singular on �00, so that there exists a set S in � such that J1(S) = 0 and P(S) = 1. By Fatou ' s lemma fA XdP < lim infn fA Xn dP. If A E j&'k , then fA Xn dP = J1(A) for n > k, and so fA XdP � J1(A) for A in the field Uk- l j&'k · It follows by the monotone class theorem that this holds for all A in � · Therefore, • f X dP = fsX dP � J1(S) = 0, and X vanishes with probability 1. Theorem 35.8.
Jl
Jl
Jl
Jl
Jl
Jl
As in Example 35.3, let Jl be a finite measure on the unit interval with Lebesgue measure (0, j&', P). For � the a-field gener ated by the dyadic intervals of rank n, ( 35.10) gives Xn . In this casen � t j&'oo = :F. For each nw and n choose the dyadic rationals a n ( w) = k 2 and bn ( w ) = (k + 1 )2 - for which a n ( w ) < w < bn ( w ) . By Theorem 35.8, if !F is the distribution function for Jl , then Example 35.10.
(35 .25) except on a set of Lebesgue measure 0.
SECTION
35.
495
MARTINGALES
According to Theorem 31 .2, F has a derivative F' except on a set of Lebesgue measure 0, and since the intervals (a n (w), bn(w)] contract to w, the difference ratio (35.25) converges almost everywhere to F '(w) (see (31 .8)). This identifies X. Since (35.25) involves intervals (an( w ) bn( w )] of a special kind, it does not quite imply Theorem 31.2. By Theorem 35.8, X = F' is the density for " in the absolutely continu ous case, and X = F' = 0 (except on a set of Lebesgue measure 0) in the singular case, facts proved in a different way in Section 3 1 . The singular case • gives another example where E[Xn1 --+ E[X] fails in Theorem 35.4. ,
Likelihood Ratios
Return to Example 35.4: Jl = Q is a probability measure, � = o( Y1 , , Yn ) for random variables Yn , and the Radon-Nikodym derivative or likelihood ratio Xn has the form (35.1 1) for densities Pn and qn on R n . By Theorem 35.8 the Xn converge to some X which is integrable and measurable ,00 = (J ( yl ' y2 ' . . . ). If the Yn are independent under P and under Q, and if the densities are different, then P and Q are mutually singular on o(Y1 , Y2 , ), as shown in Example 35.4. In this case X = 0 and the likelihood ratio converges to 0 on a set of P-measure 1 . The statistical relevance of this is that the smaller Xn is, the more strongly one prefers P over Q as an explanation of the observation ( yl ' . . . ' yn ), and xn goes to 0 with probability 1 if p is in fact the measure governing the Yn . It might be thought that a disingenuous experimenter could bias his results by stopping at an xn he likes-a large value if his prejudices favor Q, a small value if they favor P. This is not so, as the following analysis shows. For this argument P must dominate Q on each � = o( Y1 , , Yn ), but the likelihood ratio xn need not have any special form. Let T be a positive-integer-valued random variable representing the experimenter' s stopping time. Assume that [ T = n] E �, which, like the analogous assumption (35.18) for the gambler' s stopping time, excludes prevision on the part of the experimenter. Let F., be the class of sets G for which • • •
• • •
• • •
(35 .26)
G n [T = n] e �,
n > 1.
Now � is a a-field. It is to be interpreted as representing the information the experimenter has when he stops: to know ( Y1 ( w ), . . . , YT ( "' ) ( w )) is the same thing as to know for each G satisfying (35.26) whether or not it contains w. In fact, � is generated by the sets [ T = n, (Y1 , , Yn ) E H] for n > 1 and H E gJ n . • . •
496
DERIVATIVES AND CONDITIONAL PROBABILITY
What is to be shown is that X,. is the likelihood ratio (Radon-Nikodym derivative) for Q with respect to P on !F.,. If H E !1/ 1 , then [ X,. E H] = U�- 1 ([ T = n] n [ xn E H]) lies in !F.,, so that X,. is measurable §'.,. More over, if G satisfies (35.26), then
00 Xn dP = xT dP = 'f. j n n -l Gt1(1' - ) G
J
00 Q( G n ( T = n ]) = Q(G) , '[. n -1
as required. Bayes Estimation
Let 8, Y1 , Y2 , . . . be random variables such that 0 < 8 < 1 and, condition ally on 8, the Yn are independent and assume the values 1 and 0 with probabilities 8 and 1 - 8: for u 1 , , u k a sequence of O' s and 1 ' s, • • •
( 35 .27) where
s
= u 1 + · · · + un.
To see that such sequences exist, let 8, Z1 , Z2 , • • • be independent random variables, where 8 has an arbitrarily prescribed distribution supported by (0, 1) and the Zn are uniformly distributed over (0, 1). Put Yn == l[z,. s 81 • If, for instance, f( x ) == x(1 - x ) == P[ Z1 s x, Z2 > x], then P[ Y1 == 1, Y2 == 01181cA) == /( 8(w)) by (33.13). The obvious extension establishes (35.27).
From the Bayes point of view in statistics, fJ is a par-ameter governed by some a priori distribution known to the statistician. For given Y1 , , Yn the Bayes estimate of 8 is E[IJIIY1 , , Yn1 · The problem is to show that this estimate is consistent in the sense that • • •
• • •
( 35 .28) with probability 1 . By Theorem 35.5, E[8 11Y1, , Yn1 .-. E[8 II F00], where F00 = o( Y1 , Y2 , ), and so what must be shown is that E[8 II F00] = 8 with probability 1 . By an elementary argument that parallels the unconditional case, i t follows from (35.27) that for S" = Y1 + · · · + Yn , E[Sn11 8] = n8 and E [( Sn - n8) 2 11 8] = n fJ (1 - 8). Hence E[(n - 1Sn - 8) 2 ] � n - 1 , and by Chebyshev' s inequality n - 1sn converges in probability to 8. Therefore (Theorem 20.5), lim k n k 1Sn k = 8 with probability 1 for some subsequence. Thus fJ = 8' with probability 1 for a 8' measurable F00, and E[811-Foo 1 = E[ 8' 11 $00] = 8' = 8 with probability 1 . • • •
• • •
SECTION
35.
MARTINGALES
A Central Limit Theorem*
Suppose
XI, x2 , . . .
- xn - 1 ' so that
is a martingale relative to $1 , ffl2, . . . ' and put
yn = xn
(35 .29)
Yn as the gambler's gain on the n th trial in a fair game. For example, if � I ' �2, . . . are independent and have mean 0, � = o(� l' . . . ' � n ), wn is measurable � - 1 , and Yn = Wn � n ' then (35.29) holds (see (35.17)). A specialization of this case shows that Xn = L k - l yk need not be approxi mately normally distributed for large n.
View
Suppose that �n takes the values ± 1 with probability ! each and WI = 0, and suppose that wn = 1 for n � 2 if �� = 1, while W,. = 2 for n � 2 if �� = - 1 . If Sn = � �+ + � n ' then Xn is Sn or 2Sn according as �� is + 1 or - 1. Since S,jvn has approximately the standard normal distribution, the approximate distribution of xn;rn is a mixture, with equal weights, of the centered normal distributions with standard • deviations 1 and fi. Emmpk 35.11.
·
·
··
To understand this phenomenon, consider conditional variances. Sup pose for simplicity that the Yn are bounded, and define
(3 5 .30) For large t consider the random variables .,
[
, = min n :
(35 .31)
£. a'f � 1].
k-1
Under appropriate conditions, X.,j{i will be approximately normally distributed for large t. Consider the preceding example. Roughly: if �� = + 1 , then r,k _ 1 af == n - 1, and so ,, t and X.,j{i S,j{i; I f �� = - 1, then Ek_1af == 4(n 1 ) and so ,, t/4 and X.,j{i 2S, 14j{i - S, 14/lt/4 . In either case, X,j{i approximately follows the normal law. If the n th play takes a; units of time, then ,, is essentially the number of plays that take place during the first t units of time. This change of the time scale stabilius the rate at which money changes hands. =
-
=
=
* This topic, which requires the limit theory of Chapter S, may be omitted.
=
498
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose the Yn = Xn - Xn - t are uniformly bounded and satisfy (35.29), and assume that Eno; = oo with probability 1. Then X.,j{i Theorem 35.9. � N.
Since Eon2 =
Jl1 is well defined by (35.31). PROOF. Let s; = Ek_ 1of , and let c, be that number such that 0 < c, � 1 and s; _ 1 + c,2a,2 = t. Let K bound the I Ynl · Since I Y, - c,Y., I < K, it is I
oo ,
enough to prove
I
I
( 35 .32 )
•, - 1
t - 1 /2 E
k-1
yk + c,Y.,I
Fix t and define random variables
=>
I
N.
Z1 , Z2 , • • • by if
.,
,
= k'
if .,, < k.
The left side of (35.32) is EC:_ 1 ZJJ{i. The notation does not reflect the dependence of the Zk on t. Since of is measurable .?Fk _ 1 , the events [P1 > k], [P1 = k], and [J11 < k] all lie in .?Fk _ 1 • If x > 0, the event 1£.,� - k J c, > x is exactly [J11 = k] n [c, > x ] = [J11 = k] n [sf _ 1 + x 2of < t], and hence ll., � -k J c, is measurable .?Fk _ 1 • It follows that for A e .?Fk _ 1 ,
Therefore
and the Zk satisfy the same martingale-difference equation (35.29) that the Yk do. Furthermore,
is measurable !Fk .
SECTION
35.
MARTINGALES
Now define Tf = E[Zf ll §"k _ 1 ]. Suppose that is measurable !Fk _ 1 ,
499
A e .Fk _ 1 . Since 1£,, - k J c;
Further,
f
A
Zi dP = 0. () [ •, < k ]
It follows that
( 35 .33)
0k2 7"k2 - C 20k2 I
0
if
,,
if if
,, ,,
> k' = k'
< k.
Introduce random variables N1 , N2 , , each with the standard normal distribution, that are ind9'endent of each other and of the a-field !F00 generated by U n�· The introduction of these new random variables entails no loss of generality: define them on a second probability space, and replace the original probability space (the one on which the Yk are defined) by its product with this second space. For n � 0 define •
•
•
The first step is to prove that the conditional characteristic function of T0j{i •
IS
(35 .34 )
DERIVATIVES AND CONDITIONAL PROBABILITY
are measurable j&'oo and the Nn are independent of -F00, the left-hand equality follows from (34.4): truncate the sum in T0 to E';_ 1 ,.k Nk , apply (34.4), then let m --+ oo and use Theorem 34.2(v). The right-hand equation in (35.34) follows from (35.33) and the definitions of ,t and ct . Therefore
Since the
Tn
(35 .35) Since the left side of (35.32) is 1',/ {i, the problem is to prove that E [ eisT.,Iv'i] --+ e - s2 12 , and this will follow by (35.35) if
I E [ eisT., Ift ]
_
E
[ eisTolft ] I = E
A 2 ::> and that P(A n ) > £ > 0 for all n. The problem is to show that nn A n must be nonempty. Since A n E gJ'{, and since the index set involved in the specification of a cylinder can always be permuted and expanded, there exists a sequence t 1 , t 2 , of points in T for which •
•
•
•
H
•
•
wheret n E Yi n. Of course, P(A n ) = p. 11 1,.( n ) . By Theorem 12.3 (regularity), there exists inside Hn a compact set Kn such that p. 11 1 ( n - Kn ) < £/2 n + l . If Bn = [x E R T : (x,1 ' . . . ' x,,. ) E Kn], then P(A n - Bn) < £/2 n + l . Put en = n k _ 1 Bk . Then en c Bn c A n and P(A n - en ) < £/2, so that P(en) > £/2 > 0. Therefore, en is nonempty. Choose a point x < n > of RT in en . If n > k, then x < n > E en c ek c B k and hence (x �1n > , . . . , x �"n >) e Kk . Since Kk is bounded, the sequence { x �! >, x �;> , . . . } is bounded for each k. By the diagonal method [A14] select an increasing sequence n 1 , n2, of integers such that lim ; x �"n ; ) exists for • •
H
.
•
•
•
•
•
,
H
•
t in general, A n will involve indices 11 , 12 , . . . , Ia where a 1 < a < For notational 2 simplicity an is taken as n; as a matter of fact, this can be arranged : start off by repeating 0 a1 - 1 times and then for n > 1 repeat An an + 1 an times. ,
-
,
·
·
·
.
515 36. KOLMOGOROV ' S EXISTENCE THEOREM each k. There is in Rr some point x whose t kth coordinate is this limit for each k. But then, for each k, (x 1 1 , x,lc) is the limit as i --+ oo of ( x}1n ; > , . . . , x}; ; > ) and hence lies in Kk . But that means that x itself lies in Bk • and hence in A k . Thus X E n �- l A k , which completes the proof t SECTION
•
•
•
The second proof of Kolmogorov's theorem goes in two stages, first for countable T, then for general T. *
SECOND PROOF FOR COUNTABLE
T.
The result for countable T will be proved in its second formulation, Theorem 36.2. It is no restriction to enumerate T as { 11 , 1 2 , } and then to identify In with n ; in other words, it is no restriction to assume that T = { 1 , 2, . . . } . Write P.n in place of p. 1 2 n · By Theorem 20.4 there exists on a probability space ( 0, !F, P) (which can be taken to be the unit interval) an independent sequence V1 , V2 , of random variables each uniformly distributed over (0, 1 ) . Let F1 be the distribution function corresponding to p.1 • If the " inverse" g1 of F1 is defined over (0, 1) by g1 ( s) == inf[x: s s F1 (x)], then X1 == g1 (V1 ) has distribution p.1 by the usual argument: P [g1 ( V1 ) < x ] == P [ li1 s F1 (x)] == F1 (x). The problem is to construct x2 ' x3 ' . . . inductively in such a way that • • •
•
• . . . •
• • •
{36 .23) for a Borel function h k and ( XI ' . . . ' xn ) has the distribution #Ln . Assume that xl ' . . . ' xn 1 have been defined ( n > 2) : they have joint distribution ,.,. n 1 and (36.23) holds for k < n - 1. The idea now is to construct an appropriate conditional distribution function F, ( X l xt ' . . . ' Xn - 1 ); here F, ( X I XI ( w ), . . . ' xn - 1 ( w )) will have the value P [ xn < xll XI ' . . . ' xn - t 1c., would have if xn were already defined. If 8n ( . I X 1 ' . . . ' X n - 1 ) is the " inverse" function, then xn ( w ) == gn ( vn ( w ) I XI ( w ) , . . . ' xn - 1 ( w )) will by the usual argument have the right conditional distribution given n XI , . . . ' xn h so that ( Xh . . . ' xn - h Xn ) will have the right distribution over
R . To nconstruct the conditional distribution function, apply Theorem 33.3 in n ( R , gt , p.n ) to get a conditional distribution of the last coordinate of (x1 , . . . , xn ) given the first n - 1 of them. This will have (Theorem 20.1) the form 1 •( H; xh . . . , xn _ 1 ); it is a probability measure as H varies over 9P , and
Since the integrand involves only XI ' . . . ' Xn
- 1 and ,.,. n by consistency projects to
t The last part of the argument is, in effect, the proof that a countable product of compact sets
.
compact. *This second proof, which may be omitted, uses the conditional probability theory of Section 33.
IS
516
STOCHASTIC PROCESSES
#L n - l under the map ( xh . . . , xn ) � ( xh . . . , xn_ 1 ), a change of variable gives
Define F, ( x l x1 , . . . , xn_ 1 ) == P (( - oo , x ] ; x1 , , xn_ 1 ) . Then F, ( · l x1 , . . . , distribution function over the line, F, ( x 1 ) is a Borel xn _ 1 ) is a probability n- l function over R , and .
•
.
•
, xn _ 1 ) == inf[x: u S F, ( x l xh . . . , xn_ 1 )] for 0 < u < 1 . Since F, ( x l x1 , . . . , xn _ 1 ) is nondecreasing and right-continuous in x, 8n ( u l x1 , . . . , xn_ 1 ) < X if and only if u s F, ( x l xl , . . . ' Xn - 1 > · Set xn == gn ( U, I XI , . . . ' xn - 1 >· Since ( X1 , , Xn _ 1 ) has distribution l' n - l and by (36.23) is independent of Un , an application of (20.30) gives
Pu t gn ( u l x1 , •
•
.
•
.
•
p
[ ( XI ' . . . ' xn - I ) E M ' xn
< X]
== p [ ( XI ' . . . ' xn - I) E M ' un s F, ( X I XI ' . . . ' xn ) ]
Thus ( XI ' . . . ' xn ) has distribution l' n . Note that xn ' as a function of XI ' . . . ' xn and u, ' is a function of ul ' . . . ' un because (36.23) was assumed to hold for k < n. • Hence (36.23) holds for k == n as well. SECOND PROOF FOR GENERAL T. Consider ( R T, 9f T ) once again. If S C T, let .Fs a[ Z,: t e S]. Then � c !FT = 9fT. Suppose that S is countable. By the case just treated, there exists a process [ X,: t e S ] on some (0, .F, P )-the process depends on S-such that ( X, , . . . , X, ) has distribution 1' , ,�< for every k-tuple (th . . . , tk ) from S. Define a m�p �: Q � R r by requiring iliat -1
=
. . .
if t E S, if t ll. s .
SECTION
36.
KOLMOGOROV ' S EXISTENCE THEOREM
517
Now (36. 15) holds as before if It , . . . , lk all lie in S, and sot € is measurable ��� · Further, (36. 16) holds for It , . . . , lk in S. Put Ps = P€- on �- Then P5 is a probability measure on ( RT, .Fs ), and ( 36 .24) if H E � k and I1 , , Ik all lie in S. (The various spaces ( n , !F, P) and processes [ X, : 1 e S ] now become irrelevant.) If So c S, and if A is a cylinder (36. 9) for which the t 1 , 1k lie in So then Ps0 ( A ) and Ps ( A ) coincide, their common value being p. ,1 . . , ( H). Since these " A A A A ) in 's cylinders generate 'so ' Ps0 ( = Ps ( ) for all lies both in �� and o · If � , then Ps1 ( A ) = Ps1 u s ( A ) == Ps ( A ). Thus P( A ) == Ps ( A ) consistently defines a 2 2 2 set function on the class Us's, the union extending over the countable subsets S of T. If A n lies in this union and A n e 's ( Sn countable), then S = U n Sn is countable and U n A n lies in 's· Thus Us's is a a-field and so must coincide with � T. Therefore, P is a probability measure on �T, and by (36.24) the coordinate • process has under P the required finite-dimensional distributions. • • •
,
•
•
,
•
II
The Inadequacy of a�r Theorem 36.3. Let [X,: t e T] be a family of real functions on 0. (i) If A e a[ X,: t e T] and w e A , and if X,(w) = X,(w') for all t e T, then w' e A . (ii) If A e a( X,: t e T], then A e a[ X,: t e S] for some countable subset S of T.
Define �: 0 --+ R r by Z,(�(w)) = X,(w). Let �= a[ X,: t e T] . By (36.15), � is measurable �;gp r and hence � contains the class [� - 1M: M e 9fT]. The latter class is a a-field, however, andkby (36.15) it contains the sets [w e 0 : ( X11(w), . . . , X,k(w)) e H], H e f11 , and hence contains the a-field :F they generate. Therefore PROOF.
(36 .25) This is an infinite-dimensional analogue of Theorem 20.l(i). As for (i), the hypotheses imply that w e A = � - 1M and �(w) = �(w'), so that w' e A certainly follows. For S c T, let � = a ( X,: t e S]; (ii) says that �= :Fr coincides with � = Us�' the union extending over the countable subsets S of T. If A1, A 2 , lie in f§, A lies in � for some countable Sn , and so U A lies in l§ because it lies in � for s = u s Thus f§ is a a-field, and since it contains the sets [X, e H], it contains the a-field � they generate. (This part of the argument was used in the second proof of the existence • theorem.) •
•
.
n
II
n
n
n.
n
518
STOCHASTIC PROCESSES
From this theorem it follows that various important sets lie outside the class fAT. Suppose that T = [0 , oo). Of obvious interest is the subset C of RT consisting of the functions continuous over [0 oo ) . But C is not in fAT. For suppose it were. By part (ii) of the theorem (let 0 = RT and put [Z, : t e T] in the role of [ X,: t e T]), C would lie in o[Z,: t e S ] for some countable S c [0 , oo). But then by part (i) of the theorem (let 0 = RT and put [Z,: t e S ] in the role of [X,: t e T]), if x e C and Z, (x) = Z,(y) for all t e S, then y e C. From the assumption that C lies in fAT thus follows the existence of a countable set S such that, if x e C and x( t) = y( t) for all t in S, then y e C. But whatever countable set S may be, for every continuous x there obviously exist functions y which have discontinuities but which agree with x on S. Therefore, C cannot lie in fAT. What the argument shows is this: A set A in RT cannot lie in fAT unless there exists a countable subset S of T with the property that, if x e A and x(t) = y(t ) for all t in S, then y e A. Thus A cannot lie in fAT if it effectively involves all the points t in the sense that, for each x in A and each t in T, it is possible to move x out of A by changing its value at t alone. And C is such a set. For another, consider the set of functions x over T = [0 , oo ) that are nondecreasing and assume as values x( t) only nonnega tive integers : ,
( 36 .26 )
[ x E R [O, oo > : x ( s ) S x ( t ) , s < t; x ( t ) E {0, 1, . . . } , t > 0] .
This, too, lies outside fAT. In Section 23 the Poisson process was defined as follows: Let X1 , X2 , be independent and identically distributed with the exponential distribution (the probability space 0 on which they are defined may by Theorem 20.4 be taken to be the unit interval with Lebesgue measure). Put S0 = 0 and Sn = X1 + · · · + Xn . If Sn (w) < Sn + 1 (w) for n > 0 and Sn (w) --+ oo , put N( t, w) = N,(w) = max[ n : Sn (w) < t ] for t > 0; otherwise, put N(t, w) N, ( w) = 0 for t � 0. Then the stochastic process [ N,: t > 0] has the finite dimensional distributions described by the equations (23.27). The function N( · , w) is the path function or sample functiont corresponding to w, and by the construction every path function lies in the set (36.26). This is a good thing if the process is to be a model for, say, calls arriving at a telephone exchange: The sample path represents the history of the calls, its value at t being the number of arrivals up to time t, and so it ought to be nondecreas ing and integer-valued. According to Theorem 36.1, there exists a measure P on RT for T = [0, oo) such that the coordinate process [Z,: t > 0] on (RT, fAT, P) has the finite•
•
•
=
t other terms are
realization of the process and trajectory.
36. KOLMOGOROV' S EXISTENCE THEOREM 519 dimensional distributions of the Poisson process. This time does the path function Z( · x) lie in the set (36.26) with probability 1? Since Z( · , x) is just x itself, the question is whether the set (36.26) has P-measure 1. But this set does not lie in £1/ r, and so it has no measure at all. An application of Theorem 36.1 will always yield a stochastic process with prescribed finite-dimensional distributions, but the process may lack certain path-function properties which it is reasonable to require of it as a model for some natural phenomenon. The special construction of Section 23 gets around this difficulty for the Poisson process, and in the next section a special construction will yield a model for Brownian motion with continu ous paths. Section 38 treats a general method for producing stochastic processes that have prescribed finite-dimensional distributions and at the same time have path functions with desirable regularity properties. SECTION
,
PROBLEMS 36. 1.
1 3.6 f Generalize Theorem 20.1(ii), replacing ( XI ' . . . ' xk ) by for an arbitrary T.
36.2.
A process ( . . . , X X0 • X1 , ) (here T is the set of allk integers) is stationary if the distribution of ( Xn , Xn + l ' . . . , Xn + k - l ) over R is the same for all n 0, ± 1, ± 2, . . . . Define T: RT --+ RT by Zn ( TX ) zn + I (x); thus T moves a doubly infinite sequence (that is, an element of RT) one place left: T( , x_ 1 , x0 , x1 , . ) ( , x0 , x1 , x 2 , . ). Show that T is measur able 31 r;� r and show that the coordinate process ( . . . , Z 1 , Z0 , Z1 , ) on ( R r, �r, P) is stationary if and only if T preserves the measure P in the sense that PT - 1 P. Show that, if X is measurable a [ X,: t e T ], then X is measurable a [ X,: t e S ] for some countable subset S of T. Suppose that [ X,: t e T ) is a stochastic process on ( Q, , , P) and A e ' · Show that there is a countable subset S of T for which P [A II X, , t e T ] P[ A II X, , t e S ] with probability 1. Replace A by a random variable and _
1
,
[ X, : t E
T]
• • •
==
==
• • •
• •
==
• • •
• •
• • •
_
==
36.3. 36.4.
==
prove a similar result.
36.5.
Let T be arbitrary and let K(s, t) be a real function over T X T. Suppose that K is symmetric in the sense that K(s, t) == K( t, s) and nonnegative definite in the sense that E�, J - 1 K( t; , tJ )x; xJ > 0 for k > 1, th . . , tk in T, x1 , , xk real. Show that there exists a process [ X,: t e T ] for which ( X11 , , X, ) has the centered normal distribution with covariances K( I; , t1 ) , i ' j = 1 , . . .�< ' k. 8.4 f Suppose that Pn ( u1 , , un ) is a nonnegative real for each n and each n-long sequence u 1 , , un of O's and 1 ' s. Suppose that p 1 (0) + p 1 (1) == 1 and Pn + 1 ( u 1 , , un , O) + Pn + 1 ( u 1 , , un , 1) == Pn ( u 1 , • • • , un ). Prove that on the a-field rt generated by the cylinders in sequence space there .
•
36.6.
•
.
•
•
•
• • •
• • •
• • •
.
•
.
520
STOCHASTIC PROCESSES
exists a probability measure P such that P( w : a ; ( w ) == u ; , i � n ] == Pn ( u1 , , un ) . Problems 2.16, 4.18, and 8.4 cover the coin-tossing, indepen dent, and Markov cases, respectively. .
•
•
36.7.
Let L be a Borel set on the line, let !R consist of the Borel subsets of L, and let LT consist of all maps from T into L. Define the appropriate notion of cylinder, and let !l'T be the a-field generated by the cylinders. State a version of Theorem 36.1 for ( LT, JR T ) . Assume T countable and prove this theorem not by imitating the previous proof but by observing that LT is a subset of R T and lies in � T. If L consists of 0 and 1 , and if T is the set of positive integers, then LT is the space considered in the preceding problem.
36.8.
Suppose that the random variables X1 , X2 , assume the values 0 and n1 and P[ Xn == 1 i.o.] == 1. Let p. be the distribution over (0, 1] of E�- 1 Xn/2 . Show that on the unit interval with the measure p. the digits of the nonterminating dyadic expansion form a stochastic process with the same finite-dimensional distributions as XI ' x2 ' •
0
36.9.
0
0
•
•
0
36.7 f There is an infinite-dimensional version of Fubini' s theorem. In the construction in Problem 36.7, let L == I == (0, 1 ), T == { 1, 2, . . . } , let J consist of the Borel subsets of I, and suppose that each k-dimensional distribution is the k-fold product of Lebesgue measure over the unit interval. Then I T is a countable product of copies of (0, 1 ), its elements are sequences x == (x 1 , x 2 , ) of foints of (0, 1), and Kolmogorov's theorem ensures the existence on ( I T, J ) of a product probability measure w: w [ x : n X; < a; , i < n ] = a1 an for 0 < a; < 1 . Let I denote the n-dimen sional unit cube. (a) Define l/;: In X I T --+ I T by • • •
•
•
•
Show that 1/1 is measurable J X J TjJ T and l/; - 1 is measurable J TjJ n X J T. Show that l/; - 1 (An X w ) == w, where "-n is n-dimensional Lebesgue n measure restricted to I . (b) Let f be a function measurable J T and, for simplicity, bounded. Define
n
in other words, integrate out the coordinates one by one. Show by Problem 34.1 8, martingale theory, and the zero-one law that ( 36.27) except for x in a set of w-measure 0. (c) Adopting the point of view of part (a), let gn ( xh . . . , xn ) be the result of integrating ( Yn + l ' Yn + 2 , ) out (with respect to w ) from •
•
•
36. , xn , Yn + 1 ,
KOLMOGOROV ' S EXISTENCE THEOREM
SECTION
/(x 1 ,
• • •
. . •
) This may suggestively be written as .
Show that gn ( x1 , , xn ) --+ /( x1 , x 2 , ) except for x in a set of w-mea sure 0. (a) Let T be an interval of the line. Show that �r fails to contain the sets of: linear functions, polynomials, constants, nondecreasing functions, func tions of bounded variation, differentiable functions, analytic functions, functions continuous at a fixed t0 , Borel measurable functions. Show that it fails to contain the set of functions that: vanish somewhere in T, satisfy x(s ) < x( t) for some pair with s < t, have a local maximum anywhere, fail to have a local maximum. (b) Let C be the set of continuous functions on T [0, oo ). Show that A e 31 r and A c C imply that A 0. Show, on the other hand, that A e 31 r an 0] for which the X, are independent and assume the values 0 and 1 with probability ! each. Compare Problem 1.1. By Kolmogorov' s t':�;stence theorem there is a process [ N,: t > 0] having the finite-dimensional distributions (23.27) specified for the Poisson process. As pointed out at the end of this section and in Section 23, the sample paths may be very irregular. Let D be the set of dyadic rationals, and let A be the w-set where the sample path when restricted to D (i) satisfies N0 ( w) 0, (ii) is nondecreasing, (iii) assumes only nonnegative integers as values, and (iv) has for each t in D the property that inf1 s s e D � ( w ) < N,( w ) + 1. Show that A is measurable and P(A ) 1. For w e A define N,'( w ) inf1 s s e D Ns ( w ); for w ll. A define N,'( w) 0. Show that [ N,' : t > 0] is a Poisson process each of whose paths is integer-valued and nondecreasing and has no jump exceeding 1. This kind of argument is used in the next section to construct a Brownian motion with continuous paths. Finite-dimensional distributions can be specified by conditional distribu tions. Suppose that ., is a probability measure on the line, and suppose that for n > 2, �'n ( H; x 1 , , xn _ 1 ) is a probability measurenas H varies over �1 and is a Borel function as (X 1 ' ' X 1 ) varies over R - 1 Show that there exists a stochastic process { X1 , X2 , } such that X1 has distribution ., and P[ Xn e H II X1 , Xn - t 1 w �'n ( H; X1 ( w ), . . . , Xn _ 1 ( w )). Show that E[/( Xn ) II X1 , , Xn - l 1w /�CXJJ(x) vn ( dx; X1 (w), . . . , Xn - l (w)) if /( Xn ) is integrable. • . .
• • •
36. 10.
==
36.11.
36. 12.
521
==
==
==
==
36. 13.
• • •
• • .
•
•
•
• • •
,
==
==
n -
. • .
==
==
•
36.14. Here is an application of the existence theorem in which T is not a subset of
the line. Let ( N, .IV, ., ) be a measure space, and take T to consist of the %-sets of finite v-measure. The problem is to construct a generalized
522
STOCHASTIC PROCESSES
Poisson process, a stochastic process [ XA : A e T] such that (i) XA has the Poisson distribution with mean v( A ) and (ii) XA , . . . , XA are independent if A 1 , , An are disjoint. Hint: To define the finite-diniensional distribu tions, generalize this construction: For A , B in T, consider independent random variables fi , }2 , 1'3 having Poisson distributions with means v(A n Be), v ( A n B), v(Ac n B); take IL A, B to be the distribution of ( Yi + fi , Y2 •
•
•
+ Yj ).
SECI'ION 37. BROWNIAN MOTION Definition
A Brownian motion or Wiener process is a stochastic process [W,: some (D, :F, with these three properties: (i) The process starts at 0:
P ),
t > 0], on
P ( W0 = 0 ] = 1 . (ii) The increments are independent: If 0 < t0 S t1 < · · · < tk , (37 .2 ) then (37 .1 )
(37 .3)
P [ W,; - W,; _ 1 E H; , i S k ] = i[Js k P ( W,; - W,;_ 1 E H; j . For 0 < s < t the increment W, - � is normally distributed with
(iii) mean 0 and variance t - s: (37 .4)
P ( W.1 -
1 W. e H) = /2 'IT ( t s
S
)
J
x 2 12< t - s > dx. e
H
The existence of such processes will be proved presently. Imagine suspended in a fluid a particle bombarded by molecules in thermal motion. The particle will perform a seemingly random movement first described by the nineteenth-century botanist Robert Brown. Consider a single component of this motion-imagine it projected on a vertical axis-and by denote the height at time t of the particle above a fixed horizontal plane. Condition (i) is merely a convention: the particle starts at 0. Condition (ii) reflects a kind of lack of memory. The displacements 0, . . . , W.1k - 1 - W.1k - 2 the particle undergoes during the interval s 1 [t 0 , t 1 ], . . . , [t k _ 2 , t k _ 1 ] in no way influence the displacement W.1k - W.,k - 1 it
W,
W, - W,
SECTION
37.
523
BROWNIAN MOTION
undergoes during [t k _ 1 , t k ]. Although the future behavior of the particle depends on its present position, it does not depend on how the particle got there. As for (iii) , that W, - � has mean 0 reflects the fact that the particle is as likely to go up as to go down-there is no drift. The variance grows as the length of the interval [s, t]; the particle tends to wander away from its position at time s, and having done so suffers no force tending to restore it to that position. To Norbert Wiener are due the mathematical foundations of the theory of this kind of random motion. The increments of the Brownian motion process are stationary in the sense that the distribution of W, - � depends only on the difference t - s. Since W0 = 0, the distribution of these increments is described by saying that W, is normally distributed with mean 0 and variance t. This implies (37.1). If 0 < s s t, then by the independence of the increments E [�W,] = E [( �( W, - �)] + £ [ �2 ] = E[�]E[W, - �] + E[�2 ] = s. This specifies all the means, variances, and covariances: (37 .5)
E ( Jf:W, ]
E ( W, ] 0, =
=
min { s, t } .
If 0 < t 1 < · · · < t k , the joint density of ( W, W, , W," W, W, ) is by (20.25) the product of the corresponding normal densities. By the Jacobian formula (20.20), ( W, , W," ) has density 1,
"_ 1
1,
•
•
2
-
1,
•
•
•
-
•
( x; - X ; - 1 ) 2 2 ( t; - t; - 1 ) ' where t 0 = x 0 = 0. Sometimes W, will be d�noted W( t ), and its value at w will be W( t, w ). The nature of the path functions W( · , w) will be of great importance. The existence of the Brownian motion process follows from Kolmogorov' s theorem. For 0 < t 1 < · · · < t k let p. be the distribution in R k with density (37 .6). To put it another way, let p. be the distribution of (S1 , . . . , Sk ), where S; = X1 + · · · + X; and where X1 , . . . , Xk are indepen dent, normally distributed random variables with mean 0 and variances t 1 ' t 2 - t 1 ' . . . ' t k - t k - I · If g( X 1 ' . . . ' X k ) = ( X 1 ' . . . ' X ; - 1 ' X ; + 1 ' . . . ' X k ), then g(S1 , . . . , Sk ) = (S1 , . . . , S;_ 1 , S; + 1 , . . . , Sk ) has the distribution prescribed for p. This is because X; + X; + 1 is normally distrib uted with mean 0 and variance t ; + t - t ;_ 1 ; see Example 20.6. Therefore, 11
11
( 37 .7)
• • •
1; _ 1 1 , + 1
.
•
.
1" .
• •
•
1"
11
.
.
•
1"
524
STOCHASTIC PROCESSES
The P, 11 1k defined in this way for increasing, positive t 1 , . . . , t k thus satisfy the conditions for Kolmogorov' s existence theorem as modified in Example 36.3; (37.7) is the same thing as (36.19). Therefore, there does exist a process [ W,: t > 0] corresponding to the p. 11 1�< . Taking W, = 0 for t = 0 shows that there exists on some ( 0, S&", P) a process [ W, : t > 0] with the finite-dimensional distributions specified by the conditions (i), (ii), and (iii ) . • • •
•
•
•
Alternatively, define p.,1 ... as t\le centered normal distribution (Section 29) in k R with covariances piJ == mi�{ 1; , t1 }. Since (take t0 == 0) 1
min{ i, j }
L Pi J X; x1 == E x; x1 E ( t1 - t1 _ 1 )
i, j
1- 1
i, j
the matrix ( pi J ) is nonnegative definite. Consistency is now obvious. Continuity of Paths
If the Brownian motion process is to represent the motion of a particle, it is natural to require that the path functions W( · , be continuous. But Kolmogorov' s theorem does not guarantee continuity. Indeed, for T = [0, oo) the space ( D, $") in the proof of Kolmogorov' s theorem is ( R r, gJ T ), and as shown in the last section, the set of continuous functions does not lie in gJ T_ A special construction- gets around this difficulty. The idea is to use for dyadic rational t the random variables W, as already defined and then to redefine the other W, in such a way as to ensure continuity. To carry this through requires proving that with probability 1 the sample path is uni formly continuous for dyadic rational arguments in bounded intervals. Fix a space ( D, §', P) and on it a process [ W, : t > 0] having the finite-dimensional distributions prescribed for Brownian motion. Let D be the set of nonnegative dyadic rationals, let In k = [k2 - n , (k + 1)2 - n ] , and put
w)
(37 .8) Bn =
[w: O s k < n 211 max
sup I W( r ,
r e l,. k rl D
w) - W( k 2 - n , w) I > !.] . n
Suppose it is shown that EP( Bn) converges. The first Borel-Cantelli lemma will then imply that B = lim supn B has probability 0. But suppose that w lies outside B . Then for every t and every £ there exists an n such that t < n , 3n - 1 < £ , and w E B:. Fix such an n and suppose that r and r ' are dyadic rationals in [0, t) and l r - r ' l < 2 Then r and r' must lie in the n
-
n
.
SECTION
37.
525
BROWNIAN MOTION
same or in adjacent dyadic intervals Ink ·
If
r,
r' E Ink ' then
I W( r, w ) - W( r', w ) I � I W( r, w ) - W( k 2 - n , w ) l + I W( k 2 - n , w ) - W( r', w ) l < 2n - 1 < £ . I W( r, w ) - W( r' , w ) I < I W( r, w ) - W( k2 - n , w ) I + I W( k2 - n , w ) - W(( k + 1)2 - n , w ) l + I W(( k + 1)2 - n , w ) - W( r', w ) I < 3n - 1 < £ . Therefore, w E B implies that W( r, w) is for every t uniformly continuous as r ranges over the dyadic rationals in [0, t ), and hence W( · , w) will have a continuous extensioll to [0, oo ) . . To prove that EP ( Bn ) converges requires a maximal inequality. Lemma I.
Suppose that X1 , , Xn are independent, normally distributed random variables with mean 0, and put sk = xl + . . . + xk . Then for positive a and £, n 2 P (Sn > a + 2£ ] - E P ( Xk > £ ] (37 .9) . • .
k-1
[
]
S P max Sk > a < 2P [ Sn PROOF.
Clearly,
(37 . 1 0 )
[
ksn
·
P max Sk > a, Sn > a ksn
]
=
]
> a .
P ( Sn > a ] .
Let A k = [max; < k S; < a < Sk ). Since Sn - Sk is independent of A k and has distribution symmetric about the origin,
[
P max Sk > a, sn < a ksn
]
=
n- 1
E P ( A k n [ Sn < a] )
1 n- 1 k-
< E P ( A k n [ sn - sk < 0 ] ) k-1
=
n- 1
E P ( A k n [ sn - sk > 0 ] )
n-1
< E P ( Ak n ( Sn > a] ) < P ( Sn � a ] . k-1
Adding this to (37.10) gives the right-hand inequality in (37.9).
526
STOCHASTIC PROCESSES
The other inequality is needed only for the calculation (37.16) and its consequence (37. 5 1). Now Sk _ 1 < a, Xk < £, Sn - Sk < - £ imply that sn < a , and sk - 1 < a, xk < £, sn > a + 2£ imply that sn - sk > £. There fore,
n-1 E P ( A k n [ Sn < a ] ) k-1 n-1 > E P ( A k n [ Xk < £ , sn < a ] ) k-1 n- 1 > E P ( Ak n [ xk < £ , sn - sk < - £ ] ) k-1 n- 1 = E P ( Ak n [ Xk < £ , sn - sk > £ ] ) k-1 n-1 > E P ( A k n [ Xk < £ , sn � a + 2£ ]) k-1 n-1 � E { P ( Ak n [ Sn > a + 2£ ]) - P [ Xk > £ ] } k-1 n- 1 � P ( Sn � a + 2£ ] - E P ( Xk > £ ] . k=1 Adding this to (37.10) gives the other inequality in (37.9).
•
The argument goes through if the xk are independent and sn - sk is symmetrically distributed about 0. Since S1 , . . . , Sn have the same joint distribution as - S1 , . . . , - Sn , the right-hand inequality in (37.9) implies that
(37 .11) To analyze the probability of the event (37.8), fix 8 and t for the moment. Since the increments of Brownian motion are independent and
37. BROWNIAN MOTION normally distributed with mean 0, (37.11) implies that
527
SECTION
P
[
�ax i W( t IS
2m
]
+ Bi2 - m ) - W( t ) I > a < 2P [ I W( t + B ) - W( t ) I > a ]
[ ] 68 2 < 4 E ( W( t + 8 ) - W( t )) = 4
2
a
a4
(see (21.7) for the moments of the normal distribution). The sets on the left here increase with m, and letting m --+ oo leads to
(37 . 12) Therefore,
P
[
sup Osrs l, rED
68 2 ] I W( t + rB ) - W( t ) I > a < a4
•
( 37 .13) and
EP( Bn ) does converge. Therefore, there exists a measurable set B such that P(B) = 0 and such that for w outside B, W(r, w) is uniformly continuous as r ranges over the dyadic rationals in any bounded interval. If w � B and r decreases to t through dyadic rational values, then W(r, w) has the Cauchy property and hence converges. Put (37 .14)
W, '( w ) = W'( t, w ) =
lim W( r, w ) if w � B, r J, t
if w
0
E
B,
where r decreases to t through the set D of dyadic rationals. By construc tion W'( t, w) is continuous in t for each w in D. If w � B, W(r, w) = W '( r, w) for dyadic rationals, and W '( · w) is the continuous extension to all of [0, oo ) . The next thing is to show that the W, ' have the same joint distributions as the W, . It is convenient to prove this by a lemma that will be used again further on. ,
Let Xn , Yn, X, Y be k-dimensional random vectors; let F be a distribution function and let Fn be the distribution function of Xn . (i) If Xn --+ X with probability 1 and Fn(x) --+ F(x ) for all x , then F is the distribution function of X. Lemma 2.
528
STOCHASTIC PROCESSES
(ii) If Xn and
Yn have the same distribution, and if Xn --+ X and Yn --+ Y with probability 1, then X and Y have the same distribution. t PROOF. For (i), let X have distribution function H. By Theorem 4.1, if
h > 0, then
< H( x 1 , . . . , xk ) F( x . . . , x k ) = lim sup n Fn ( x 1 , . . . , x k ) 1,
0] is a Brownian motion and (37.19) holds with probability 1. The behavior of W( · , w) near 0 can be studied through the behavior of W "( · , w) near oo and vice versa. Since W,"/t = W1 1 , , W "( · w ) cannot have a derivative at 0 if W( · w) has no limi t at oo. Now, in fact, ,
,
,
(37 .20)
inf Wn n
=
,
sup wn n
- oo '
= =
+ oo
with probability
1. To prove this, note that
Wn
(37 .21 )
[ sup w,. < oo ]
[ W; < u];
xk
=
wk - wk - 1 are independent. Put n
=
00
00
X1 +
· · ·
+ Xn, where the
u n r:nax
u- 1 m - 1
t<m
this is a tail set and hence by the zero-one law has probability 0 or 1. Now - xl, - x2, . . . have the same joint distributions as xl, x2 , . . . ' and so (37.21) has the same probability as
(37 .22)
[ inf Wn > - oo ] n
=
00
00
==
m- 1
U n u l
[r:nax(- W;) < u]. tSm
If these two sets have probability 1, so has [supn l Wn l < oo ], so that P[supn l Wn l < x] > 0 for some x. But P[ l W,. l < x] = P[ l W1 1 < xjn 1 12 ] .-. 0. This proves (37 .20).
532
STOCHASTIC PROCESSES
·,
Since (37 .20) holds with probability 1, W ( w ) has with probability 1 upper and lower right derivatives of + oo and - oo at t = 0. The same must be true of every Brownian motion. A similar argument shows that, for each fixed t, W( w ) is nondifferentiable at t with probability 1. In fact, W( w ) is nowhere differentiable: "
·
· ,
,
Theorem 37.3.
differentiable.
For
w
outside a set of probability 0, W(
·,
w
) is nowhere
The proof is direct-it makes no use of the transformations (37.18) and (37.19). Let PROOF.
{ ( ; ) - w( :n ) w( ; ) - w( ; ) , k 2 } k; 3 w( ) w( ; ) ·
(3 7 .23 ) Xn k = max w
k
1
,
k
2
k
1
_
By independence and the fact that the differences here have the distribution n of 2 - n i2 W1 , P[ Xnk < £] = P 3 [ 1 W1 1 < 2 /2£]; since the standard normal density is bounded by 1, P[ Xnk < £] � (2 2 n 12£) 3 • If Yn = min k � n 2 " Xnk ' then
·
(37 .24) Consider now the upper and lower right-hand derivatives
D w( t, w ) = lim sup ( W( t + h , w ) - W( t, w ))/h , h J,O
D w ( t, w ) � lim inf ( W( t + h , w ) - W( t, w ) ) /h . h J,O
Define E as the set of w such that D w(t , w ) and D w (t , w ) are both finite for some value of t. Suppose that w lies in E, and suppose specifically that
There exists a positive 8 (depending on w, t, and K ) such that t � s < t + 8 implies I W(s , w ) - W(t , w ) I < Kls - tl . If n exceeds some n0 (depending on 8, K, and t ) , then
8K < n ,
n > t.
3 7 . BROWNIAN MOTION 533 Given such an n, choose k so that ( k - 1)2 - n < t < k 2 - n . Then l i2 - n - t l n Xnk ( w ) < 2K(4 · 2 - ) < < 8 nfor i = k , k + 1, k n+ 2, k n+ 3, and therefore n2 - . Since k - 1 < t2 < n2 , Yn( w) < n2 - n . What has been shown is that if w lies in E, then w lies in A n = [ Yn < n 2 - n ] for all sufficiently large n: E c lim infn A n . By (37.24), SECTION
By Theorem 4.1, lim infn A n has probability 0, and outside this set W( · , w) is nowhere differentiable-in fact, nowhere does it have finite upper and lower right-hand derivatives. (Similarly, outside a set of probability 0, nowhere does W( · , w) have finite upper and lower left-hand derivatives.) • If A is the set of w for which W( · , w) has a derivative somewhere, what has been shown is that A c B for a measurable B such that P( B) = 0; P ( A ) = 0 if A is measurable, but this has not been proved. To avoid such problems in the study of continuous-time processes, it is convenient to work in a complete probability space. The space ( 0, $, P) is complete (see p. 39) if A c B, B e $, and P( B) = 0 together imply that A e $ (and then, of course, P( A ) = 0). If the space is not already complete, it is possible to enlarge $ to a new a-field and extend P to it in such a way that the new space is complete. The following assumption therefore entails no loss of generality: For the rest of this section the space ( 0, $, P ) on which the Brownian motion is defined is assumed complete. Theorem 37.3 now becomes: W( · , w ) is with probability 1 nowhere differentiable. A nowhere-differentiable path represents the motion of a particle that at no time has a velocity. Since a function of bounded variation is differentia ble almost everywhere (Section 31), W( · , w) is of unbounded variation with probability 1. Such a path represents the motion of a particle that in its wanderings back and forth travels an infinite distance. The Brownian motion model thus does not in its fine structure represent physical reality. The irregularity of the Brownian motion paths is of considerable mathe matical interest, however. A continuous, nowhere-differentiable function is regarded as pathological, or used to be, but from the Brownian-motion point of view such functions are the rule not the exception. t The set of zeros of the Brownian motion is also interesting. By property (iv), t = 0 is a zero of W( · , w) for each w. Now [ W,": t � 0] as defined by (37 .19) is another Brownian motion, and so by (37 .20) the sequence { W,.'': n = 1, 2, . . . } = { n W11 n: n = 1, 2, . . . } has supremum + oo and infimum - oo for w outside a set of probability 0; for such an w, W( · , w ) changes t For
the construction of a specific example, see
Problem
31.18.
534
STOCHASTIC PROCESSES
sign infinitely often near 0 and hence by continuity has zeros arbitrarily near 0. Let Z( w) denote the set of zeros of W( · w ) . What has just been shown is that 0 e Z( w) for each w and that 0 is with probability 1 a limit of positive points in Z( w ) . More is true: ,
The set Z( w) is with probability 1 perfect,t nowhere dense, and of Lebesgue measure 0. PROOF. Since W( · w) is continuous, Z( w) is closed for every 0. Let A denote Lebesgue measure. Since Brownian motion is measurable (Theorem 37.2), Fubini' s theorem applies: Theorem 37.4.
,
fo ( Z ( "' )) P ( d"' ) = ( X P ) ( ( t, "' ) : W( t, ) = 0) = foooP [ W( t , ) = 0] dt = 0. '>.
'>.
,.,
IAJ :
IAJ
Thus A ( Z( w )) = 0 with probability 1. If W( · , w) is nowhere differentiable, it cannot vanish throughout an interval I and hence must by continuity be nonzero throughout some subinterval of I. By Theorem 37.3, then, Z( w) is with probability 1 nowhere dense. It remains to show that each point of Z( w ) is a limit of other points of Z( w ) . As observed above, this is true of the point 0 of Z( w ) . For the general point of Z( w ), a stopping-time argument is required. Fix r � 0 and let T( w) = inf[ t : t � r, W(t, w) = 0]; note that this set is with probability 1 nonempty by (37.20). Thus T(w ) is the first zero following r. Now [ w : T( w)
s
[
t] = w : inf I W( s , w )I = rsss t
o] '
and by continuity the infimum here is unchanged if s is restricted to rationals. This shows that T is a random variable and that
A nonnegative random variable with this property is called a stopping time. To know the value of T is to know at most the values of Wu for u < T. Since the increments are independent, it therefore seems intuitively clear tSee A1 5.
SECTION
37.
535
BROWNIAN MOTION
that the process
t > 0, ought itself to be a Brownian motion. This is, in fact, true by the next result, Theorem 37. 5 below. What is proved there is that the finite-dimensional distributions of [ W,*: t > 0] are the right ones for Brownian motion. The other properties are obvious: W *( · , w) is continuous and vanishes at 0 by construction, and the space on which [ W,*; t > 0] is defined is complete because it is the original space (0, �, P), assumed complete. If [ W,* : t > 0] is indeed a Brownian motion, then, as observed above, for w outside a set B, of probability 0 there is a positive sequence { t n } such that t n ._. 0 and W*(t n , w) = 0. But then W(T(w) + t n , w) = 0, so that T ( w ), a zero of W( · , w ), is the limit of other larger zeroes of W( · , w ) . Now T( w ) was the first zero following r. (There is a different stopping time T for each r, but the notation does not show this.) If B is the union of the B, for rational r, then P( B) = 0. Moreover, if w lies outside B, then for every rational r, the first point of Z( w) following r is a limit of other larger points of Z( w ) . Suppose that w � B and t e Z( w ), where t > 0; it is to be shown that t is a limit of other points of Z( w ) . If t is the limit of smaller points of Z( w ), there is of course nothing to prove. Otherwise, there is a rational r such that r < t and W( · , w) does not vanish in [ r, t ) ; but then, since w � B,, t is a limit of larger points s that lie in Z( w ) . This completes the proof of Theorem 37.4 under the provisional assumption that (37.25) is • a Brownian motion. The Strong Markov Property
Fix t0 > 0 and put
t > 0.
(37 .26 )
It is easily checked that [ W,': t > 0] has the finite-dimensional distributions appropriate to Brownian motion. As the other properties are obvious, it is in fact a Brownian motion. Let !F, = o ( W. :
( 37 .27 )
s �
t].
The random variables (37.26) are independent of !F,0 To see this, suppose that 0 � s 1 � � s < t0 and 0 < t 1 � � t k . Put u ; = t0 + t;. Since i the increments are independent, ( W.,',1 W.12' - W.11', , W.1k' - W.1k' - 1 ) = ( Wu1 •
•
•
•
•
•
•
•
•
•
536
STOCHASTIC PROCESSES
W,0 , Wu2 - Wu1 , . . . , Wu - Wu 1 ) is independent of ( W�1 , Ws2 - Ws1 , . . . , Ws is independent of ( W:1 , W: 2, . . . , W:.). - W:1- 1 ). But then 1 2 By Theorem 4.2, . . . , w,;) is independent of �0• Thus ..
k k( W,', W,', . . . , W,k') ( W,;,
(37 .28 )
1
1
P ([( w,; , . . . , w,; ) E H] n A ) = P [ ( w,; , . . . , W,; ) E H] P ( A ) =
P [ ( W,1 , . . . , W,k ) e H ] P ( A ) ,
A E �0 ,
where the second equality follows because (37.26) is a Brownian motion. This holds for all H in 9l k . The problem now is to prove all this when t0 is replaced by a stopping time T-a nonnegative random variable for which (3 7 .29 )
[ w : T ( w ) < t] e � ,
t > 0.
It will be assumed that T is finite, at least with probability 1. Since [ T = t] = [ T S t] - U n [ T S t - n 1 ] , (37.29) implies that -
(3 7 .30 )
[ w : T ( w ) = t] e � '
t > 0.
The conditions (37.29) and (37.30) are analogous to the conditions (7.18) and (35.18), which prevent prevision on the part of the gambler. Now §',0 contains the information on the past of the Brownian motion up to time t0, and the analogue for T is needed. Let §'., consist of all measurable sets M for which
(37 .31 )
M n [ w : T( w ) < t] e �
for all t. (See (35.26) for an analogue in discrete time.) Note that §'., is a a-field and T is measurable §'.,. Since M n [ T = t] = M n [ T = t] n [ T < t ]
,
( 3 7 .32 )
M n [ w : T( w ) = t] e �
for M in :F.,. For example, T = inf[t: [infs T W: > - 1] is in §'.,.
W, = 1] is a stopping time and
�
Theorem 37.5.
(37 . 3 3)
Let T be a stopping time and put t > 0.
SECTION
37.
537
BROWNIAN MOTION
Then [ W, *: t � 0] is a Brownian motion, and it is independent of §;,-that is, o [ W, *: t > 0] is independent of §'.,:
P ([( w,� , . . . , w,: ) E H ] n M ) = P [( W,1* , . . . , w,: ) e H ] P ( M ) = P ((W,1 , , W,.) e H ) P ( M )
(37.34)
•
•
•
for H in !Ji k and M in §'.,. That the transformation (37 .33) preserves Brownian motion is the strong Markov property.t Part of the conclusion is that the are random variables.
W, *
Suppose . first that T has countable range general point of V. Since
PROOF.
(w:
V and let t0 be the
W, * ( w ) E H ) = t0UE V [ w : W,0+1( w ) - W,0( w ) E H, T ( w ) = 10 ] ,
W, * is a random variable. Also, P ([(w,1• , , w,: ) E H ] n M ) E P ( [( w,� . . . , w,: ) e n ] n M n [ .,. = t0]) . •
•
•
=
.
t0 E V
[
� ' then M n T = t o ] E �0 by (37.32). Further, if T = t o , then coincides with as defined by (37.26). Therefore, (37.28) reduces this last sum to If M E
W,*
W,'
E
, P[(W, 1 t0 E V
•
•
•
, w,. ) e H ] P ( M n [.,. = t 0])
.
This proves the first and third terms in (37 .34) equal; to prove equality with the middle term, simply consider the case M = D. t since the Brownian motion has independent increments, it is a Markov process (see Examples 33.9 and 33.10); hence the terminology.
538
STOCHASTIC PROCESSES
Thus the theorem holds if
37 .35 ( )
n { k2 'Tn = O
T has countable range. For the general T, put
if (k - 1 ) 2 - n < if T = 0.
T < k2 - n ,
k = 1, 2, . . .
If k2 - n < t < (k + 1)2 - n , then [Tn < t ] = [T < k2 - n ] E �k 2 - , C �- Thus each "n is a stopping time. Suppose that M e � and k2 - n < t < (k + 1)2 - n. Thenn M n [Tn < t ] = M n [T < k2 - n ] E �k 2 - " c �-n Thus $, c �,· Let w, < >( w) = W..., ("' ) + t ( w) - W..., < "' > ( w )-that is, let w,< > be the W, * corresponding to the stopping time "n · If M e §'., then M e �,, and by an application of to the discrete case already treated,
( 37.36)
(37.34)
P ( [( w,� n > , . . . , W,! n > ) e H )
n
M)
= P [ ( W,1 ,
•
•
•
, W,k ) e H ] P ( M ) .
Tn ( w) .-. T( w) for each w, and by continuity of the sample paths n W,*(w)n for each w. Condition on M and apply Lemma 2(i) w, < >(w) .-. n with ( W,: >, . . . , W,� >) for Xn , ( W,1* , . . . , w,:) for X, and the distribution function of ( W,1 , . . . , W,, ) for F = Fn . Therefore, (37 .34) follows from (37 .36). But
37.4
(37.25)
•
The T in the proof of Theorem is a stopping time, and so is a Brownian motion, as required in that proof. Further applications will be given below. If � · = o [ W,*: t � 0], then according to (and Theorem 4.2) the a-fields §'., and � * are independent:
(37.34)
(37.37)
P ( A n B ) = P( A ) P ( B ) ,
(37.35)
A
E
§'., ,
B e �* .
For fixed t define "n by but with t2 - n in place of 2 - n at each occurrence. Then [WT < x] n [ T � t ] is the limit superior of the sets [W,., < x] n [ T < t ], each of which lies in �- This proves that [ W,. < x] lies in � and hence that W,. is measurable �- Since T is measurable �,
(37 .38) for planar Borel sets H.
SECTION
37.
BROWNIAN MOTION
539
Skorohod Embedding *
Suppose that X1 , X2 , . . . are independent and identically distributed ran dom variables with mean 0 and variance o 2 A powerful method, due to Skorohod, of studying the partial sums Sn = X1 + · · · + Xn is to construct an increasing sequence T0 = 0, T1 , T2 , of stopping times such that W( Tn ) has the Same distribution as Sn . The differences Tk - Tk - 1 will tum OUt tO be independent and identically distributed with mean o 2 , so that by the law of large numbers n - 1-rn = n - 1 Ek _ 1 (Tk - Tk _ 1 ) is likely to be near o 2 • But if Tn is near n o 2 , then by the continuity of Brownian motion paths W( Tn ) will be near W(no 2 ), and so the distribution of Sn!o..fn, which coincides with the distribution of W( Tn )!o..fi, will be near the distribution of W( n o 2 )jo{i - that is, will be near the standard normal distribution. The method will thus yield another proof of the central limit theorem, one independent of the characteristic-function arguments of Section 27. But it will also give more. For example, the distribution of max k s n SJc!o..fi is exactly the distribution of max k s n W( Tk )/o..fn, and this in tum is near the distribution of sup1 s n a 2 W( t )jo..fi , which can be written down explicitly because of (37.16). It will thus be possible to derive the limiting distribution of max k s n sk . The joint behavior of the partial sums is closely related to the behavior of Brownian motion paths. The Skorohod construction involves the class !T of stopping times for which •
•
( 37 .39 ) ( 37 .40)
•
•
E ( WT] = 0, E ( T ) = E [ W./) ,
and
(37 .41 ) All bounded stopping times are members of !T. PROOF. Define Y11, 1 = exp(IJW, - �8 2 t) for all 8 and for t > 0. Suppose that s < t and A e � (see (37.27)). Since Brownian motion has indepen dent increments, Lemma 3.
* The rest of this section, which requires martingale theory, may be omitted.
540
STOCHASTIC PROCESSES
and a calculation with moment generating functions (see Example 21.2) shows that
fA Y, ' s dP fA Y, , , dP, =
(37 .42)
s < t,
A E �.
This says that for (J fixed, [ Y9, ,: t > 0) is a continuous-time martingale adapted to the a-fields §',. It is the moment generating function martingale associated with the Brownian motion. Let /( 8, t) denote the right side of (37.42). By Theorem 16.8,
£ Y,, , ( w, - Ot ) dP , ; ::2 /( 6, t) L Y, , , [ ( w, - 6t ) 2 - t ] dP, o, t ) 6 t(
=
=
Differentiate the other side of the equation (37.42) the same way and set (J = 0. The result is s < t, A E �' s S t, A E �'
J (�4 - 6 �2s + 3S 2 ) dP JA ( W,4 - 6 W, 2 t + 3t 2 ) dP , =
A
s :S: t, A e � .
This gives three more martingales: If Z, is any of the three random variables (37 .43) then Z0 (37.44)
w, , =
0, Z, is integrable and measurable §',, and
JA z. dP jA z, dP,
In particular, E[Z,]
=
=
[Z0]
=
0.
s
< t,
A E �.
SECTION
37.
541
BROWNIAN MOTION
If T is a stopping time with finite range { t1, . . . , t m } bounded by t, then (37.44) implies that
E [ zT ]
=
E; j[ 1' = 1,]z,, dP = E; j[ 1' = 1, ]z, dP = E [ z, ] = o.
Suppose that T is bounded by t but does not necessarily have finite range. n Put Tn = k2 - n t if ( k - 1)2 - n t < T < k2 - t, 1 < k < 2 n , and put Tn = 0 if T = 0. Then "n is a stopping time and E[Z,. ] = 0. For each of the three possibilities (37.43) for Z1 , sups s 1 1 Zsl is integrable because of (37.15). It therefore follows by the dominated convergence theorem that E[Z,.] = lim n E[Z,. ] = 0. Thus E[Z,.] = 0 for every bounded stopping time T. The three cases (37 .43) give II
II
. E ( W,. ] = E [ W,.2 - , )
=
E [ W,.4 - 6W,.2T + 3T 2 )
= 0.
This implies (37 .39), (37 .40), and 0 = E [ W,.4 ) - 6E [ W,.2T ) + 3E ( T 2 ] > E [ w,.4 ] - 6£ 1 /2 [ w,.4 ] £ 112 [ T 2 ] + 3E [ T 2 ] .
If C = E 1 12 [W,.4) and x = E 1 12 [T 2 ), the inequality is 0 > q(x) = 3x 2 6Cx + C 2 . Each zero of q is at most 2C, and q is negative only between • these two zeros. Therefore, x � 2C, which implies (37.41).
Suppose that T and Tn are stopping times, that each Tn is a member of !T, and that "n --+ T with probability 1 . Then T is a member of ff if (i) E[W,.4] < E[W,.4] < oo for all n, or if (ii) the W,.4 are uniformly integrable. PROOF. Since Brownian motion paths are continuous, W,. --+ W,. with probability 1. Each of the two hypotheses (i) and (ii) implies that E[W,.4) is bounded and hence that E[ Tn2 1 is bounded, and it follows (see (16.23)) that the sequences { "n }, { W,. }, and { W,.2 } are uniformly integrable. Hence (37.39) and (37.40) for T follow by Theorem 16.13 from the same relations for the "n · The first hypothesis implies that lim inf n E[W,.4 ] < E [ W,.4 ], and the second implies that lim n E[ W,.4] = E[W,.4 ). In either case it follows by Fatou ' s lemma that E[T 2 ] < lim in fn E[Tn2 1 � 4 lim infn E[W,.4 ] � 4E[ W,.4]. • Lemma 4. II
II
II
II
II
II
II
II
II
Suppose that a, b � 0 and a + b > 0, and let T( a, b) be the hitting time for the set { - a, b } : T(a, b) = inf( t : W, e { - a, b }]. By (37.20), T(a, b) is
542
STOCHASTIC PROCESSES
T(a, b) �
finite with probability 1 , and it is a stopping time because t if and only if for every m there is a rational r < t for which W,. is within m- 1 of or of From I W min { n }) I < max{ it follows by Lemma 4(ii) that is a member of ff. Since WT ( a , b assumes only the values > 0 implies that ] and b, E[WT ( a, b )
b. T(a, b)
-a -a
(
T(a, b),
=
a, b}
b P [ WT( a , b ) = -a ] = a + b '
(37 .45 )
P [ WT ( a, b) = b )
a = b.
= a:b.
This is obvious on grounds of symmetry in the case Let p. be a probability measure on the line with mean 0. The program is to construct a stopping time T for which WT has distribution p.. Assume that p. {O} < 1, since otherwise T = 0 obviously works. If p. consists of two point masses, they must for some positive and be a mass of at and a mass of at in this case "< a , b is by (37.45) the required > stopping time. The general case will be treated by adding together stopping times of this sort. Consider a random variable X having distribution p.. (The probability space for X has nothing to do with the space the given Brownian motion is defined on.) The technique will be to represent X as the limit of a martingale x1, x2 , . . . of a simple form and then to duplicate the martingale for stopping times Tn ; the "n will have a limit T such that WT by WT1 , W,.2, has the same distribution as X. The first step is to construct sets
aj(a + b) b;
•
•
a
b
bj(a + b) -a
•
, oo ) . n> 00 , � ] n n 1 ) 1 ' 1 )] '
I� Ik
Let M( H ) be the conditional mean: M( H)
=
p(� ) {xp(dx) if p(H) > 0 . 1
=
=
Let & 1 consist of the single point M( R ) E[ X] 0, so that 9'1 consists of IJ - oo, 0] and If (0, oo ). Suppose that &n and !l'n are given. If =
(
=
37. BROWNIAN MOTION 543 p. (( /f )0) > 0, split If by adding to �n the point M( /f ), which lies in ( /f )0 ; if p. (( /f )0) = 0, If appears again in .9'n + 1 Let rln be the a-field generated by the sets [ X e /f ) and put Xn = E [ XII rln1 · Then xb x2, . . . is a martingale and xn = M( /f) on [ X E /f ) . The Xn have finite range and their joint distributions can be written out explicitly. In fact, [ X1 = M( Il1 ), . . . , Xn = M( lf )] = [ X e Il1 , . . . , X e If ] , and this set is empty unless Il1 :J · · · :J If , in which case it is 1 [ Xn = M( I;,. )] = [ X E If,.]. Therefore, if k n _ 1 = j and IJ' - = /f_ 1 U It , SECTION
•
II
II
II
and
provided the conditioning event has positive probability. Thus the martingale n n 1 { xn } has the Markov property, and if X = M( lj - ), u = M( Ik 1), and v = M( lf ), then the conditional distribution of xn given xn 1 = X is concentrated at the two points u and v and has mean x. The structure of { xn } is determined by these conditional probabilities together with the distribution -
of x1 . If r9 = o(U nrln ), then Xn -+ E[ XII rl] with probability 1 by the martingale theorem (Theorem 35.5). But, in fact, Xn -+ X with probability 1, as the following argument shows. Let B be the union of all open sets of p.-measure 0. Then B is a countable disjoint union of open intervals; enlarge B by adding to it any endpoints of p.-measure 0 these intervals may have. Then p.( B) 0, and x e B implies that p. ( x - £, x] > 0 and p. [ x, x + £) > 0 for all positive (. Suppose that X = X( w ) e B and let X n = Xn( w ) . Let If be the element of !l'n containing x; then xn + 1 = M( lf ) and If � I for some interval /. Suppose that xn + 1 < X - ( for n +in1 an infinite sequence N of integers. Then xn + 1 is the left endpoint of 1;n + 1 for n in N and converges along N to the left endpoint, say a, of I, and (x - £, x] c I. Further, x n + 1 = M( lf ) -+ M( l) along N, so that M( / ) = a. But this is impossible because p. ( x - £, x] > 0. Therefore, x n > x - £ for large n . Similarly, x n < x + t: for large n , and so x n -+ x. Thus Xn( w ) -+ X( w ) if X( w ) f£. B, the probability of which is 1 . =
II
II
II
II
544
STOCHASTIC PROCESSES
Now X1 = E[ X II f91 ) has mean 0 and its distribution consists of point masses at - a = M( IJ) and b = M( If). If T1 = T(a, b) is the hitting time to { - a, b ) , then (see (37.45)) T1 is a stopping time, a member of !T, and W 1 has the same distribution as X1 • Let ,-2 be the infimum of those t for which t > ,-1 and W, is one of the points M( If ) , 0 < k < r2 + 1. By (37 .20), ,-2 is finite with probability 1 ; it is a stopping time because ,-2 < t if and only if for every m there are rationals r and s such that r < s + m - 1 , r < t , s < t, W,. is within m - 1 of one of the points M( I]), and � is within m - 1 of one of the points M( If). Since I W(min{ T2 , n })l is at most the maximum of the values I M( If) l , it follows by Lemma 4(ii) that ,-2 is a member of !T. Define W, * by (37.33) with T1 for T. If x = M( I]), then x is an endpoint common to two adjacent intervals If_ 1 and If; put u = M( If_ 1 ) and v = M(If ). If W,.1 = x, then u and v are the only possible values of W,.2 • If ,- • is the first time the Brownian motion [ W, *: t > 0] hits u - x or v - x, then by (37 .45), T
v-x v- u'
p [ W! = T
V -
X] =
X-
U
v-u
.
On the set [ W 1 = x], T2 coincides with T1 + ,-*, and it follows by (37.37) that T
This, together with the same computation with u in place of v, shows that for W,.1 = x the conditional distribution of W,.2 is concentrated at the two points u and v and has mean x . Thus the conditional distribution of W.,.2 given W,.1 coincides with the conditional distribution of X2 given X1 • Since W 1 and X1 have the same distribution, the random vectors ( W,.1 , W,.2 ) and ( X1 , X2 ) also have the same distribution. An inductive extension of this argument proves the existence of a sequence of stopping times Tn such that T1 � T2 < · · · , each Tn is a member of !T, and for each n, W,.1 , . . . , W,. have the same joint distribution as x1 , . . . ' xn . Now suppose that X has finite variance. Since Tn is a member of !T, E[,-n ] = E [ W,.2 ] = E [ X,?] = E[E 2 [ XII rln ]] < E[ X 2 ] by Jensen's inequality (34.5). Thus T = lim nTn is finite with probability 1. Obviously it is a stopping time, and by path continuity, W,. -+ W,. with probability 1. Since Xn -+ X with probability 1, it follows by Lemma 2(ii) that W,. has the T
II
11
II
SECTION
37.
545
BROWNIAN MOTION
distribution of X. Since x; < E[X 2 11 rln], the Xn are uniformly integrable by the lemma preceding Theorem 35.5. By the monotone convergence theorem and Theorem 16.13, E [ T] = lim n £ [ Tn] = lim n E [ WT2 ] = lim n£ [ X;] = E[ X 2 ] = £ [�2 ]. If E[X4 ] < oo, then £ [�4 ] = E[ X:] < E[ X4 ] = E [�4 ] (Jensen' s inequality again), and so T is a member of !T. Hence E[T 2 ] :s: 4E[�4 ]. This construction establishes the first of Skorohod ' s embedding theo rems: II
Theorem 37.6. Suppose that X is a random variable with mean 0 and finite
variance. There is a stopping time T such that � has the same distribution as X, E[T] = E[ X 2 ], and E[T 2 ] :S: 4E[ X4 ]. Of course, the last inequality is trivial unless E[X4 ] is finite. The theorem could be stated in terms not of X but of its distribution, the point being that the probability space X is defined on is irrelevant. Skorohod ' s second embedding theorem is this:
Suppose that X1 , X2 , . . . are independent and identically distributed random variables with mean 0 and finite variance, and put sn = x1 + . . . + xn . There is a nondecreasing sequence 7"1 , 7"2 , of stopping times such that the � have the same joint distributions as the sn and T1 , T2 - T1 , T3 - T2 , . . . are independent and identically distributed random variables satisfying E[Tn - Tn_ 1 ] = E[X12 ] and E[( Tn - Tn_ 1 ) 2 ] < 4E[ Xi]. Theorem 37.7.
•
•
•
II
The method is to repeat the construction above inductively. For notational clarity write W, = w, < 1 > and put F, < 1 > = a[�< 1 > : 0 < s < t] and � < 11> = a[W, ( l > : t � 0] . Let 8 1 be the stopping time1 of Theorem 37.6, so that W8� > and X1 have the same distribution. Let .fF,� > be the class of M such that M n [ 8 1 :s: t ] e F, < 1 > for all t. Now put w. = �B - �B ' F. < 2> = a[ W < 2> · 0 < s < t ] ' and � < 2> = a[ W, : t > 0]. By another application of Theorem 37.6, construct a stopping time 82 for the Brownian motion [ w,: t > 0] in such a way that W8�2> has the same distribution as X1 • In fact, use for 82 the very same martingale construction as for 8 1 , so that ( 8 1 , W8�1 >) and ( 82 , W8�2 > ) have the same distribution. Since .fF,� 1 > and � are independent (see (37.37)), it follows (see (37.38)) that ( 81, JJ8�1 >) and ( 82 , W8�2>) are independent. Let .fF,�2> be the class of M such that M n [ 82 < t] e F, for all t. If w: < 3> = w. - w. and � (3) is the a-field generated by these random variables, then again .fF,�2> and � are independent. These two a-fields are contained in � , which is independent of .fF,�1 > . Therefore, the three a-fields §8� 1 >, .fF,�2> , � < 3> are independent. The procedure therefore extends PROOF.
I
I
o2 + l
+I
o2
I
s
•
-
STOCHASTIC PROCESSES
inductively to give independent, identically distributed random vectors ( 8n ' w.< n >). If T.n = 8 I + . . . + 8n ' then w < 1 > = wu1.< 1 > + . . . + wu. has the • distribution of XI + . . . + xn . o,.
T,.
Invariance *
If E[ Xl] = o 2 , then, since the random variables Tn - Tn - l of Theorem 37.7 are independent and identically distributed, the strong law of large numbers (Theorem 22.1) applies and hence so does the weak one:
(37 .46) (If E[ Xt] < oo, so that the Tn - Tn - l have second moments, this follows immediately by Chebyshev' s inequality.) Now Sn has the distribution of W( Tn ), and Tn is near no 2 by (37.46); hence sn should have nearly the distribution of W(no 2 ), namely the normal distribution with mean 0 and variance no 2 • To prove this, choose an increasing sequence of integers Nk such that 1 n > N P [ l n 1Tn - o 2 1 > k - 1 ] < k - 1 for k and put £ n = k - for Nk � n < Nk + l · Then £ n --+ 0 and P [ l n - 1Tn - o 2 1 > £ n 1 < £ n . By two applications of -
(37.15),
8n ( E ) = P [ l w( na 2 ) - W( '�'n ) 1 /av'n � E] �
P [ l n - 1Tn - o 2 1 > £ n ]
+
P
[
sup lt- n a2 1 S (n n
I W( t ) - W( no 2 ) I > £ov'n
and it follows by Chebyshev's inequality that distributed as W( Tn ),
[ !;;
P W
2)
]
lim nBn ( £) = 0. Since Sn is
[ 7n- < x ] 2) W [ !;;
< x - E - 8n( E) < P 0 8 - 1 , then
on ( Bn ( B ))c. Since the distribution of this last random variable is unchanged if the Wn ( t ) are replaced by W( t ),
[
P sup I Zn ( t ) ts l
- W,.( t ) I > £
[
]
< P ( Bn ( B ) ) + P sup
]
sup I W( s ) - W( t ) I > £ .
t < 1 Is - tl s 28
Let n --+ oo and then 8 --+ 0; it follows by (37.49) and the continuity of Brownian motion paths that (37 .50) for positive £. Since �e processes (37.47) and (37.48) have the same finite-dimensional distributions, this proves the following general invariance principle or functional centra/ limit theorem.
Suppose that X1, X2 , . . . are independent, identically dis tributed random variables with mean 0, variance o 2, and finite fourth mo ments, and define Yn (t) by (37.47). There exist (on another probability space) for each n processes [Zn (t): 0 � t < 1] and [W,.(t): 0 < t < 1] such that the first has the same finite-dimensional distributions as [Yn (t): 0 < t < 1], the second is a Brownian motion, and P[sup, s 1 1 Zn (t) - Wn (t)l � £] --+ 0 for positive £. Theorem 37.8.
As an application, consider the maximum Mn = max k n Sk . Now M,jo{; = sup,Yn (t) has the same distribution as sup,Zn ( t ), and it follows by ( 37 .50) that s
[
]
P sup Zn ( t ) - sup W,. ( t ) > £ --+ 0. t x] = 2 P [ N > x] for x > 0 by (37.1 6). Therefore, SECTION
X > 0.
(37 .51 )
PROBLEMS 37.1.
Let X( t) be independent, standard normal variables, one for each dyadic rational t (Theorem 20.4; the unit interval can be used as the probability space). Let W(O) 0 and W(1) X(1 ). Suppose that W( t ) is already defined for dyadic rationals in [0, 1 ] of rank n , and put ==
w( 37.2.
2k + 1 n 1 2 +
)
==
==
( )
)
(
(
1 x 2k + 1 ! w _!_n + ! w k +n 1 + 1 1 n n 1 2 2 2 2 2 /2 2 +
+
)
·
Prove by induction that the W( t ) for dyadic t have the finite-dimensional distributions prescribed for Brownian motion. Now construct a Brownian motion for 0 < t < 1 with continuous paths by the method of Theorem 37.1 . This avoids an appeal to Kolmogorov's existence theorem. 36.10 f Let T [0, oo) and let P be a probability measure on ( Rr, !JIT) having the finite-dimensional distributions prescribed for Brownian motion. Let C consist of the continuous elements of Rr. (a) Show that P •}C) 0, or P* ( RT - C ) 1 (see (3.9) and (3 .10)). Thus completing ( Rr, !Jt , P) will not give C probability 1 . (b) Show that P*(C) 1 . Suppose that [ W,: t > 0] is some stochastic process having independent, stationary increments satisfying E[ W, ] 0 and E[ W, 2 ] t. Show that if the finite-dimensional distributions are preserved by the transformation (37.1 8), then they must be those of Brownian motion. Show that n, > 0a [ �: s > t ) contains only sets of probability 0 and 1. Do the same for nf > Oa( W,: 0 < I < £ ); give examples Of Sets in this a-field. Show by a direct argument that W( · , w) is nwith probability 1n of unbounded variation on [0,n 1]: Let Y, Ef! 1 1 W( i2 - ) - W(( i - 1)2 - ) l . Show that Y, has mean 2 12 E[ I W1 1 ] and variance Var[ l W1 1 ]. Conclude that E P [ Y, < ==
==
==
==
37.3.
==
37.4. 37.5.
==
==
n]
37.6. 37.7. 37.8.
0: T0 W, > a]. Show by (37.16) that
=
inf[ t:
(37.52) and hence that the distribution of T0 has over (0, oo) the density
(37.53 ) Show that E[ T0] = oo. Show that T0 has the same distribution as a 2/N 2 , where N is a standard normal variable. 37. 10. f (a) Show by the strong Markov property that T0 and Ta + fl - T0 are independent and that the latter has the same distribution as Tp. Conclude that h a * hp = h a+ fl · Show that /l Ta has the same distribution as Ta,fo · (b) Show that each h a is stable-see Problem 28.10. 37. 11. f Suppose that X1 , X2 , . • • are independent and each has the distribution
(37.53).
(a) Show that ( XI + . . . + xn )/n 2 also has the distribution (37.53). Con
trast this with the law of large numbers. (b) Show that P( n- 2 max k s n Xk < x] � exp( - a{2/wx ) for x > 0. Re late this to Theorem 14.3. 37. 12. 37.9 t Let p(s, t) be the probability that a Brownian path has at least one zero in (s, t). From (37.52), (37.53), and the Markov property, deduce
( 37 .54)
p ( s , t ) = 2 arccos - . t -
'IT
Hint: Condition with respect to �37. 13.
37.14.
If
f (a) Show that the probability of no zero in (t, 1) is (2/w) arcsin/i and hence that the position of the last zero preceding 1 is distributed over (0, 1) with density , - 1 (t( 1 - t)) - 1 12 • (b) Similarly calculate the distribution of the position of the first zero following time 1. (c) Calculate the joint distribution of the two zeros in (a) and (b).
f (a) Show by Theorem 37.8 that infs s u s 1Y, ( u) and infs s u s , Zn ( u ) both converge in distribution to infs s u s , W( u) for 0 < s < t < 1 . Prove a similar result for the supremum. (b) Let An (s, t) be the event that Sk , the position at time k in a symmetric random walk, is 0 for at least one k in the range sn < k < tn , and show that p (An ( s ' t)) --+ (2I'IT )arccos{ili . (c) Let T, be the maximum k such that k � n and Sk =1 0. Show that T,/n has asymptotically the distribution with density , - ( t (1 - t)) - 1 12 over (0, 1 ). As this density is larger at the ends of the interval than in the
38. SEPARABILITY 551 midule, the last time during a night' s play a gambler was even is more likely to be either early or late than to be around midnight. 37.15. f Show that p(s, t) == p(t - 1 , s - 1 ) == p(cs, ct). Check this by (37.54) and also by the fact that the transformations (37.1 8) and (37.19) preserve the properties of Brownian motion. 37.16. Show by means of the transformation (37.19) that for positive a and b the probability is 1 that the process is within the boundary - at < W, < bt for all sufficiently large t. Show that a/( a + b) is the probability that it last touches above rather than below. 37. 17. The martingale calculation used for (37.45) also works for slanting boundaries. For positive a, b, r, let T be the smallest t such that either W, == - a + rt or W, b + rt, and let p(a, b, r) be the probability that the exit is through the upper barrier-that W,. == b + rT. (a) For the martingale Y,, , in the proof of Lemma 3, show that E[ Y8. ,] == 1. Operating formally at first, conclude that SECTION
=-
(37 .55 ) Take fJ == 2r and note that 8 W,. - !8 2T is then 2 rb if the exit is above (probability p(a, b, r)) and - 2 ra if the exit is below (probability 1 p(a, b, r)). Deduce
p( a , b, r )
==
1 - e 2 ra
• e 2 rb - e - 2 ra
(b) Show that p(a, b, r) --+ aj( a + b) as r --+ 0, in agreement with (37.45). (c) It remains to justify (37.55) for fJ == 2 r. From E[ Y8. , ) == 1 deduce (37 .56) for nonrandom a. By the arguments in the proofs of Lemmas 3 and 4, show that (37.56) holds for simple stopping times a, for bounded ones, for a == T A n , for a == T . SECI'ION 38. SEPARABD..JTY * Introduction
As observed a number of times above, the finite-dimensional distributions do not suffi ce to determine the character of the sample paths of a process. To obtain paths with natural regularity properties, the Poisson and Brownian * This section may be omitted.
552
STOCHASTIC PROCESSES
motion processes were constructed by ad hoc methods. It is always possible to ensure that the paths have a certain very general regularity property called separability, and from this property will follow in appropriate cir cumstances various other desirable regularity properties. Section 4 dealt with "denumerable" probabilities; questions about path functions involve all the time points and hence concern " nondenumerable" probabilities. For a mathematically simple illustration of the fact that path properties are not entirely determined by the finite-dimensional distri butions, consider a probability space (0, �' P) on which is defined a positive random variable V with continuous distribution: P [V = x] = 0 for each x. For t > 0, put X( t, w ) = 0 for all w , and put Example 38. 1.
(38 .1 )
Y( t, w )
=
{�
if V( w ) = t, if V( w ) ¢ t.
Since V has continuous distribution, P[ X1 = Y,] = 1 for each t, and so [ X,: t > 0] and [ Y,: t > 0] are stochastic processes with identical finite-dimen sional distributions; for each t 1 , . . . , t k , the distribution p. 1 1 common to ( X,1 , , X,lc ) and ( Y,1 , , Y,lc) concentrates all its mass at the origin of R k . But what about the sample paths? Of course, X( · , w ) is identically 0, but Y( · , w ) has a discontinuity-it is 1 at t = V( w ) and 0 elsewhere. It is because the position of this discontinuity has a continuous distribution that • the two processes have the same finite-dimensional distributions. • • •
• • •
• • •
'1c
Definitions
The idea of separability is to make a countable set of time points serve to determine the properties of the process. In all that follows, the time set T will for definiteness be taken as [0, oo) . Most of the results hold with an arbitrary subset of the line in the role of T. As in Section 36, let R r be the set of all real functions over T = [0, oo ) . Let D be a countable, dense subset of T. A function x-an element of R r -is separable D, or separable with respect to D, if for each t in T there exists a sequence t 1 , t 2 , of points such that • • •
(38 .2) (Because of the middle condition here, it was redundant to require D dense at the outset.) For t in D, (38.2) imposes no condition on x, since t n may be taken as t. An x separable with respect to D is determined by its values at
38. SEPARABILITY 553 the points of D. Note, however, that separability requires that (38.2) hold for every t-an uncountable set of conditions. It is not hard to show that the set of functions separable with respect to D lies outside 9/ r. SECTION
If x is everywhere continuous or right-continuous, then it is separable with respect to every countable, dense D. Suppose that x( t) is 0 for t ¢ v and 1 for t = v, where v > 0. Then x is not separable with respect to D unless v lies in D. The paths Y( · , w) in • Example 38.1 are of this form. Example 38. 2
The condition for separability can be stated another way: x is separable D if and only if for every t and every open interval I containing t, x( t) lies in the closure of [x(s): s e I n D]. Suppose that x is separable D and that I is an open interval in T. If £ > 0, then x(t0) + £ > sup1 e 1x(t) = u for some t0 in I. By separability lx(s0) - x(t0) 1 < £ for some s0 in I n D. But then x(s0) + 2£ > u , so that
sup x ( t ) = sup x( t ) .
(38 .3)
tEl
t E I () D
Similarly, inf X ( t) =
(38 .4)
tEl
inf
t E I() D
X
(t)
and ( 38 .5 )
sup
t 0 < t < t0 + B
l x( t ) - x( t0 ) I =
sup
t 0 < t < t0 + 8 tED
l x ( t ) - x( t0) 1 -
A stochastic process [ X,: t > 0] on (0, � ' P ) is separable D if D is a countable, dense subset of T = [0, oo) and there is an jli:set N such that P( N ) = 0 and such that the sample path X( · , w) is separable with respect to D for w outside N. Finally, the process is separable if it is separable with respect to some D; this D is sometimes called a separant. In these definitions it is assumed for the moment that X( t, w) is a finite real number for each t and w. If the sample path X( · , w ) is continuous for each w , then the process is separable with respect to each countable, dense D. This • covers Brownian motion as constructed in the preceding section. Example 38.3.
Suppose that [ W, : t > 0] has the finite-dimensional dis tributions of Brownian motion, but do not assume as in the preceding Example 38.4.
554
STOCHASTIC PROCESSES
section that the paths are necessarily continuous. Assume, however, that [W,: t > 0] is separable with respect to D. Fix t0 and B. Choose sets Dm = { t m1 , , t mm } of D-points such that t0 < t m1 < · · · < t mm < t 0 + 8 and Dm j D n (t 0 , t 0 + 8). By (37.11) and Markov' s inequality, • • •
[
P 'l!:a.! I W( t m ;) - W( t0 ) l Letting
m
--+ oo
(3 8 .6 )
gives p
]
>a
a S
68 2 a4
> a]
a S
68 2 a4
•
Define Bn by (37.8) but with r ranging over all the reals (not just over the dyadic rationals) in [k2 - n , (k + 1)2 - n ]. Then P( Bn ) < 6n 5j2 n follows just as (37.13) does. But for w outside B = lim supn Bn , W( · , w ) is continuous. Since P( B ) = 0, W( · , w ) is continuous for w outside an �set of probabil ity 0. If (D, $, P) is complete, then the set of w for which W( · , w ) is continuous is an �set of probability 1. Thus paths are continuous with probability 1 for any separable process having the finite-dimensional distri butions of Brownian motion-provided that the underlying space is com • plete, which can of course always be arranged. As it will be shown below that there exists a separable process with any consistently prescribed set of finite-dimensional distributions, Example 38.4 provides another approach to the construction of continuous Brownian motion. The value of the method lies in its generality. It must not, however, be imagined that separability automatically ensures smooth sample paths: Suppose that the random variables X,, t > 0, are inde pendent, each having the standard normal distribution. Let D be any countable set dense in T = [0, oo ). Suppose that I and J are open interval s with rational endpoints. Since the random variables X, with t E D n I are independent, and since the value common to the P[ X, e J ] is positive, the Example 38.5.
38. SEPARABILITY 555 second Borel-Cantelli lemma implies that with probability 1, X, e J for some t in D n I. Since there are only countably many pairs I and J with rational endpoints, there is an jli:set N such that P( N ) 0 and such that for w outside N the set [ X(t, w): t e D n I ] is everywhere dense on the line for every open interval I in T. This implies that [ X, : t > 0] is separable with respect to D. But also of course it implies that the paths are highly irregular. This irregularity is not a shortcoming of the concept of separability-it is a necessary consequence of the properties of the finite-dimensional distribu • tions specified in this example. SECTION
=
Exllltlple
The process [Y,: t > 0] in Example 38.1 is not separable: The path Y( · w) is not separable D unless D contains the point V( w ). The set of w for which Y( · w) is separable D is thus contained in [ w: V( w) e D], a set of probability 0 since D is countable and V has a • continuous distribution. 38.6. ,
,
Existence Theorems
It will be proved in stages that for every consistent system of finite-dimen sional distributions there exists a separable process having those distribu tions. Define x to be separable D at the point t if there exist points t n in D such that t n --+ t and x( t n ) --+ x(t). Note that this is no restriction on x if t lies in D, and note that separability is the same thing as separability at every t.
Let [ X,: t > 0] be a stochastic process on ( 0, �' P). There exists a countable, dense set D in [0, oo ), and there exists for each t an jli:set N( t), such that P( N(t)) 0 and such that for w outside N( t) the path function X( · , w) is separable D at t. PROOF. Fix open intervals I and J and consider the probability Lemma 1.
=
(
p ( U ) = P n [ X. e J ) sE U
)
for countable subsets U of I n T. As U increases, the intersection here decreases and so does p(U). Choose Un so that p(Un ) --+ inf up(U). If U( I, J ) U nUn , then U( I, J ) is a countable subset of I n T making p(U) minimal : =
(38 .7)
p
(
n
s E U( / , J)
) (
[ x. e J ] � p n l x. e J ] sE U
)
556
STOCHASTIC PROCESSES
for every countable subset U of I n T. If t e I n T, then
(
(38 .8)
P [ x, e J ) n
n
s E U( l, J)
)
[ Xs 4t J ) = o ,
because otherwise (38.7) would fail for U = U( I, J) u { t } . Let D = UU( I, J), where the union extends over all open intervals I and J with rational endpoints. Then D is a countable, dense subset of T. For each t let
(38 .9)
{
N ( t ) = U [ x, e J ) n
n
s e U( I, J)
}
[ Xs 4t J ) .
where the union extends over all open intervals J that have rational endpoints and over all open intervals I that have rational endpoints and contain t. Then N( t) is by (38.8) an �set such that P( N( t )) = 0 . Fix t and w � N( t ). The problem is to show that X( · , w) is separable with respect to D at t. Given n, choose open intervals I and J that have rational endpoints and lengths less than n - 1 and satisfy t e I and X( t, w) e J. Since w lies outside (38.9), there must be an sn in U( I, J) such that X(sn , w) E J. But then sn E D, lsn - tl < n - 1 , and I X(sn , w) - X( t, w ) l < n - 1 • Thus sn --+ t and X(sn , w) --+ X( t , w) for a sequence s1 , s 2 , • • • in D. • For any countable D, the set of w for which X( · , w) is separable with respect to D at t is
(38 . 10)
00
n
U
n = l ls - tl < n - 1 sED
[ w : I X( t, w ) - X( s, w ) I
< n - 1] •
This set lies in !F for each t, and the point of the lemma is that it is possible to choose D in such a way that each of these sets has probability 1.
Let [ X,: t > 0] be a stochastic process on (0, $, P). Suppose that for all t and w Lemma 2.
(38 .11)
a < X( t, w ) < b.
Then there exists on ( 0, $, P ) a process [ X/: t > 0] having these three properties: (i) P[ X! = X,] = 1 for each t. (ii) For some countable, dense subset D of [0 , oo ), X'( · , w) is separable D for every w in D.
SECTION
38.
SEPARABILITY
557
(iii) For all t and w, a < X'( t, w ) < b .
(38 .12)
Choose a countable, dense set D and jli:sets N( t ) of probability 0 as in Lemma 1. If t e D or if w � N( t), define X'( t, w) = X( t, w). If t � D , fix some sequence { s�l ) } in D for which lim n s!'> = t and define X'(t, w) = lim supn X(s!l ) , w) for w e N( t). To sum up, PROOF.
if t e D or w � N( t ) if t tt D and "' e. N( t ) .
X( t, w ) (3 8 .1 3) X'( t "' ) = lim sup x( s!'l, "' ) n •
Since N( t ) e �, x: is measurable !F for each t. Since P( N(t)) = 0, P[ X, = x;] = 1 for each t. Fix t and w. If t e D, then certainly X'( · , w) is separable D at t, and so assume t � D . If w � N( t), then by the construction of N( t), X( · , w) is separable with respect to D at t, so that there exist points s n in D such that sn --+ t and X(s n , w) --+ X(t, w ). But X(sn, w) = X'(s n , w) because sn e D, and X( t, w) = X'( t, w) because w � N( t). Hence X'(s n , w) --+ X'( t, w), and so X'( · , w ) is separable with respect to D at t. Finally, suppose that t � D and w e N( t ). Then X'( t, w) = lim k X(s! :>, w) for some sequence { n k } of integers. As k --+ oo ' s 0] is any separable process with the same finite-dimensional distributions as [ X,: t > 0], then X'( · , w) must with probability 1 assume the value 1 somewhere. In this case (38.11) holds • for a < 0 and b = 1, and equality in (38.12) cannot be avoided. If ( 3 8 .14 )
sup I X( t, w ) I < I,
w
oo ,
558
STOCHASTIC PROCESSES
then (38.11) holds for some a and b. To treat the case in which (38.14) fails, it is necessary to allow for the possibility of infinite values. If x( t) is oo or - oo . - oo , replace the third condition in ( 38.2) by x(t n ) --+ oo or x( t n ) This extends the definition of separability to functions x that may assume infinite values and to processes [ X,: t > 0] for which X( t, w) = ± oo is a possibility. --+
If [ X,: t > 0] is a finite-va/uedprocess on (O, �, P), there exists on the same space a separable process [ x; : t > 0] such that P[ x: = X,] = 1 for each t.
Theorem 38. 1.
It is for convenience assumed here that X( t, w) is finite for all t and w, although this is not really necessary. But in some cases infinite values for certain X'( t, w ) cannot be avoided-see Example 38.8. If (38.14) holds, the result is an immediate consequence of Lemma 2. The definition of separability allows an exceptional set N of probability 0; in the construction of Lemma 2 this set is actually empty, but it is clear from the definition this could be arranged anyway. The case in which (38.14) may fail could be treated by tracing through the preceding proofs, making slight changes to allow for infinite values. A simple argument makes this unnecessary. Let g be a continuous, strictly increasing mapping of R 1 onto (0, 1). Let Y( t, w) = g( X( t, w )) . Lemma 2 applies to [ Y,: t � 0]; there exists a separable process [ Y,': t > 0] such that P[ Y,' = Y, ] = 1. Since 0 < Y(t, w) < 1, Lemma 2 ensures 0 < Y'( t, w) s 1. Define PROOF.
- oo
X '( t , w )
=
g - 1 ( Y' ( t , w ) ) + 00
if Y ' ( t, w) = 0, if O < Y' ( t, w ) < 1 , if y I ( t' (A) ) = 1 .
Then [ X,': t > 0 ] satisfies the requirements. Note that P[ X; = ± oo] = 0 • for each t. Suppose that V( w) > 0 for all w and V has a continu ous distribution. Define EX111ple 11 38.8.
if t ¢ 0, if t = 0, and put X( t, w) = h(t - V( w)). This is analogous to Example 38.7. If [ X/ :
SECTION
38.
559
SEPARABILITY
t > 0] is separable and has the finite-dimensional distributions of [ X,: t > 0], then X'( · , w ) must with probability 1 assume the value oo for • some t. Combining Theorem 38.1 with Kolmogorov' s existence theorem shows that for any consistent system of finite-dimensional distributions p. 11 1k there • • •
exists a separable process with the p. 11 1k as finite-dimensional distributions. As shown in Example 38.4, this leads to another construction of Brownian motion with continuous paths. • • •
Consequences of Separability
The next theorem implies in effect that, if the finite-dimensional distribu tions of a process are such that it "should" have continuous paths, then it will in fact have continuous paths if it is separable. Example 38.4 illustrates this. The same thing holds for properties other than continuity. Let Rr be the set of functions on T = [0, oo) with values that are ordinary reals or else oo or oo . Thus Rr is an enlargement of the R r of Section 36, an enlargement necessary because separability forces infinite values. Define the function Z, on R r by Z,(x) = Z( t, x) = x( t). This is just an extension of the coordinate function ( 3 6 . 8) . Let gJ T be the a-field in R r generated by the Z,, t > 0. Suppose that A is a subset of R r, not necessarily in gJ T_ For D c T = [0, oo ) , let A D consist of those elements x of R r that agree on D with some element y of A : -
( 38 .1 5 )
AD =
u n [x e R T : x ( t ) = y( t )] .
yEA tED
Of course, A c A D . Let SD denote the set of x in Rr that are separable with respect to D. In the following theorem, [ X, : t > 0) and [ X;: t > 0] are processes on spaces (D, !F, P) and (D', !F', P') which may be distinct; the path func tions are X( · , w ) and X'( · , w'). It
Theorem 38.2. Suppose of A that for each countable, dense subset D of T = [0, oo ), the set (38.1 5 ) satisfies
(38 .16) If [ X, : t > 0] and [ x; : t > 0] have the same finite-dimensional distributions, if [w: X( · , w ) e A] lies in � and has P-measure 1, and if [ X;: t > 0] is separable, then [w': X'( · , w ') e A ] contains an �'-set of P'-measure 1.
560
STOCHASTIC PROCESSES
If ( 0', !F ', P') is complete, then of course [w': X'( · , w') E A ] is itself an � '-set of P '-measure 1 . PROOF. Suppose that [ X;: t > 0] is separable with respect to D. The difference [w': X'( · , w ') E AD] - [w': X'( · , w') E A ] is by (38.16) a subset of [w': X'( · , w') E R r - SD], which is contained in an � '-set N ' of P'--measure 0. Since the two processes have the same finite-dimensional distributions and hence induce the same distribution on ( R r, !1/ r ), and since A D lies in gJ r, P '[w': X'( · , w') E AD] = P[w: X( · , w ) E AD] > P[w: X( · , w ) E A ] = 1 . Thus the subset [w': X'( · , w') E AD] - N ' of [w': X'( · , w') E A] lies in �' and has P'-measure 1. • Consider the set C of finite-valued, continuous functions on T. If x E SD and y E C, and if x and y agree on a dense D, then x and y agree everywhere: x = y. Therefore, CD n SD c C. Further, Example 38. 9.
CD = n u n [ x E R T: l x ( s )l < 00 , l x ( t )l < 00 , l x ( s ) - x ( t )l < (, I
8 s
£
]
,
where £ and 6 range over the positive rationals, t ranges over D, and the inner intersection extends over the s in D satisfying I s - t l < 6. Hence CD E gJ r_ Thus C satisfies the condition (38.16). Theorem 38.2 now implies that if a process has continuous paths with probability 1, then any separable process having the same finite-dimensional distributions has continuous paths outside a set of probability 0. In particu lar, a Brownian motion with continuous paths was constructed in the preceding section, and so any separable process with the finite-dimensional distributions of Brownian motion has continuous paths outside a set of probability 0. The argument in Example 38.4 now becomes supererogatory . •
Example 38. 10.
There is a somewhat similar argument for the step functions of the Poisson process. Let z + be the set of nonnegative integers, let E consist of the nondecreasing functions x in R r such that x(t) E z + for all t and such that for every n E z + there exists a nonempty interval I such that x( t ) = n for t E /. Then
ED == n [x: x ( t ) E z + ] n tED
00
n n u n
n - O I t E D rt l
n
s, tED, s < t
[x: x ( s ) < x ( t )]
[x: x( t ) = n] ,
where I ranges over the open intervals with rational endpoints. Thus ED E gJ r_ Clearly, ED n SD c E, and so Theorem 38.2 applies.
SECTION
38.
561
SEPARABILITY
In Section 23 was constructed a Poisson process with paths in E, and therefore any separable process with the same finite-dimensional distribu • tions will have paths in E except for a set of probability 0. For E as in Example 38.10, let E0 consist of the elements of E that are right-continuous; a fu�ction in E need not lie in E0, although at each t it must be continuous from one side or the other. The Poisson process as defined in Section 23 by N, = max[n: Sn < t] (see (23.5)) has paths in E0• But if N, ' = max[n: Sn < t], then [N,' : t > 0] is separable and has the same finite-dimensional distributions, but its paths are not in E0• Thus E0 does not satisfy the hypotheses of Theorem 38.2. Separability does not help distinguish between continuity from the right and continuity from the left. • Example 38. 11.
The class of sets A satisfying (38.16) is closed under the formation of countable unions and intersections but is not closed under complementation. Define X, and Y, as in Example 38.1 and let C be the set of continuous paths. Then [ Y,: t > 0] and [ X, : t > 0] have the same finite-dimensional distributions and the latter is separable; Y( · , w) is in • R_T - C for each w, and X( · , w) is in R_T - C for no w. Example 38. 12 ·
As a final example, consider the set J of functions with discontinuities of at most the first kind: x is in J if it is finite-valued, if x( t + ) = lim s .L t x(s) exists (finite) for t > 0 and x( t - ) = lims t ,s(s) exists (finite) for t > 0, and if x( t) lies between x( t + ) and x( t ..- ) for t > 0. Continuous and right-continuous functions are special cases. Let V denote the general system Emmple 38.13.
V: k ; r1 , . . . , rk ; s 1 , . . . , sk ; a1 , . . . , ak ,
{ 38.17)
where k is an integer, where the r; , s; , and a; are rational, and where
Define
J ( D , V, £ )
=
k
n ( x : a; < x ( t ) < a; + £ , t e ( r; , s; )
i- 1
k
n
D)
() n [ X : min { a; - 1 ' a; } < X ( t ) < max { a; - ' a; } + ( ' t E ( s i-2
1
I
Let �m . k , B be the class of systems (38.17) that have a fixed value for k
1
, r; ) () D ] . and
satisfy
562
r;
- s; _ 1
{ 38.18)
STOCHASTIC PROCESSES
00
m.
00
It will be shown that
Jo = n n u n m-1 f k-1
8
u
VE 4',.. , k , l
J ( D , V, £ ) ,
where £ and B range over the positive rationals. From this it will follow that J0 e gpr_ It will also be shown that J0 n S0 c J, so that J satisfies the hypothesis of Theorem 38.2 Suppose that y e J. For fixed £ , let H be the set of nonnegative h for which there exist finitely many points I; such that 0 = t0 < t1 < · · · < t, = h and l y( t) - y( t') I < £ for t and t' in the same interval ( 1; _ 1 , I; ) · If hn e H and hn f h, then from the existence of y(h - ) follows h e H. Hence H is closed. If h e H, from the existence of y(h + ) it follows that H contains points to the right of h. Therefore, H = [0, oo). From this it follows that the right side of (38.18) contains J0 . Suppose that x is a member of the right side of (38.18). It is not hard to deduce that for each t the limits ( 38.19)
lim x ( s ) ,
s � t, s E D
lim x ( s )
s t t,sE D
exist and that x ( t ) lies between them if t e D. For t e D take y( t) = x( t), and for t � D take y( t) to be the first limit in (38.19). Then y e J and hence x e J0 . This argument also shows that J0 n S0 c J. Separability in Product
Space
Kolmogorov's existence theorem applies in ( Rr, 9/T): Construct on ( Rr, !JIT) a probability_ measure P0 with the specified finite-dimensional distributions. The1 map p : RT --+ RT defined by p ( x ) = x is measurable !JITj91T; take P = P0 p - . The process [ Z, : t � 0] on ( Rr, !Jtr, P) then has the required finite-dimensional distribu tions. Apply Theorem 38.1 to [ X, : t > 0] = ( Z,: t > 0] and ( 0 , S', P) = ( Rr, !Jtr, P ). Specifically, apply the construction in Lemma 2, but without the restriction (38.11) , so that infinite values are allowed. Since the set (38.10) lies in S' = !Jtr, the sets N( t ) of Lemma 1 can be taken to be J[sets. Thus each N(t) can be taken to be exactly the w-set where X( · , w) is not separable with respect to D at t. Let 0' be the set where the new path X' ( · , w ) is the same as the old path X( · , w): ( 38 .20)
0' =
n ( w : X' ( t, w) = X( t, w) ) .
t�O
The proof of Lemma 2 shows that X' ( · , w) is separable with respect to D for every w. Therefore, X( · , w) is separable with respect to D if w e 0' . It will be shown that 0' has outer measure 1 : 0' c A and A e !F= !JIT together imply that P ( A ) = 1. By Theorem 36.3(ii), A e a [ X, : t e S] for some countable set S in T. If { 38 .21 )
n ( w : X' ( t, w) = X( t, w ) ) c A ,
SECTION
38.
SEPARABILITY
563
then P( A ) = 1 will follow because P[ X; = X,] = 1 for each t. Now X'( · , w) is an element of Q = RT; denote it l/;( w ), so that X'( t, w ) = X( t, l/;( w )). Since X'( · , w) is by construction separable D for each w, so is X( · , 1/1( w )) = 1/1( w ). Thus 1/1( w) E N( t) for each t. Then by the construction (38.13), X( t, l/;(w)) = X'( t, l/;(w)) for all t, so that l/; ( w ) 0' and hence l/;(w) e A . But X( t, w) = X'(t, w) for all t S implies that X( t, w) = X(t, l/;( w )) for all t E S; hence (Theorem 36.3(i)) A contains w as well as 1/1 ( w ) . This proves (38.21 ). Let !F' consist of the sets of the form 0' n A for A !F. Then !F' is a a-field in 0', and because 0' has outer measure 1, setting P'(O' n A ) = P(A ) consistently defines a probability measure on !F ' . The coordinate functions X, = Z, restricted to 0 ' give a stochastic process on ( 0', !F', P') for which all paths are separable and the finite-dimensional distributions are the same as before.
E
E
E
Appendix
Gathered here for easy reference are certain definitions and results from set theory and real analysis required in the text. Although there are many newer books, HAUSDORFF (the early sections) on set theory and HARDY on analysis are still excellent for the general background assumed here. Set 1beory A 1. In this book, the empty set is denoted by 0. Sets are variable subsets of some space that is fixed in any one definition, argument, or discussion; this space is k denoted either generically by Q or by some special symbol (such as R for
Euclidean k-space). A singleton is a set consisting of just one point or element. That A is a subset of B is expressed by A c B. In accordance with standard usage, A c B does not preclude A == B; A is a proper subset of B if A c B and A =I= B. The complement of A is always relative to the overall space 0; it consists of the points of Q not contained in A and is denoted by A c. The difference between A and B, by denoted A - B, is A n Be; here B need not be contained in A , and if it is, then A - B is a proper difference. The symmetric difference A � B = ( A n B'") U ( A ( n B ) consists of the points that lie in one of the sets A and B but not both. A2. The set of w that lie in A and satisfy a given property p ( w) is denoted
(w
e
A:
p( w )) ; if A
== 0,
this is usually shortened to ( w: p( w )) .
A3. In this book, to say that a collection ( A 8 : 8 e 9] is disjoint always means that
it is pairwise disjoint: A8 n A8, = 0 if 8 and 8' are distinct elements of the index set 9. To say that A meets B, or that B meets A , is to say that they are not disjoint: A n B =I= 0. The collection [ A 8 : 8 e 9] covers B if B c U8 A8 • The collection is a decomposition or partition of B if it is disjoint and B = U8 A8 • A4. By A , f A is meant A 1 c A 2 c · · · and A == U, A , ; by A , � A is meant ••
AI
::>
A2
::>
•
and A
== nn A , .
A5. The indicator, or indicator function , of a set A is the function on Q that
assumes the value 1 on A and 0 on Ac; it is denoted lA . The alternative term "characteristic function" is reserved for the Fourier transform (see Section 26).
564
565 A6. DeMorgan ' s laws are (U8 A8r = n8A8 and (n8A8 )' = U8A8. These and the other facts of basic set theory are assumed known: a countable union of countable sets is countable, and so on. APPENDIX
A7. If T: n --+ 0' is a mapping of Q into 0' and A' is a set in 0', the inverse
checked that each of these image of A' is r- 1A' = [ w E n : Tw E A'). It is easily 1 statements is equivalent to the next: w e 0 - 1r- A', w 1E r- 1A', Tw E A', Tw e 0' - A', w e r- 1 ( 0' - A'). Therefore Simple consid 0 - r1- A' = r- (0' - A'). 1 1 erations of this kind show that U81 A8 = 1 (U8 A8) and n , T- A8 = r- 1 (n8 A8), and that A' n B' = 0 implies r- 1A' n r- 1 B' = 0 (the reverse implication is false unless TO = 0'). If f maps 0 into another space, /( w ) is the value of the function f at an unspecified value of the argument w. The function f itself (the rule defining the mapping) is sometimes denoted /( · ). This is especially convenient for a function /( t) of two arguments: For each fixed t, /( · , t) denotes the function on Q with value /( w , t) at w. w,
AS. The axiom of choice. Suppose that [A 8: 8 e 8] is a decomposition of 0 into nonempty sets. The axiom of choice says that there exists a set (at least one set) C that contains exactly one point from each A8: C n A8 is a singleton for each 8 in 8. The existence of such sets C is assumed in "everyday" mathematics, and the
axiom of choice may even seem to be simply true. A careful treatment of set theory, however, is based on an explicit list of such axioms and a study of the relationships between them; see HALMOS' or JECH. A few of the problems require Zorn 's lemma, which is equivalent to the axiom of choice; see HALMOS'. The Real
Line
A9. The real line is denoted by R1 • It is convenient to be explicit about open,
closed, and half-open intervals:
( a , b) = [ x : [ a , b] = [ x: ( a , b] = [ x: ( a , b) = [ x :
a < x < b] , a < x < b] , a < x < b] , a < x < b] .
AIO. Of course xn --+ x means limnxn = x; xn t x means x1 < x 2 < · · · and Xn , --+ x; Xn � X means X1 � X 2 � • • • and Xn --+ X . A sequence { xn } is bounded if and only if every subsequence { xn " } contains a further subsequence { xnk (j) } that converges to some x: lim;xn k (j) = x. If { xn } is not bounded, then for each k there is an n k for which 1 xn 1 > k ; no subsequence of J
{ x } can converge. The implication in the other directi�n is a simple consequence of lite fact that every bounded sequence contains a convergent subsequence. If { xn } is bounded, and if each subsequence that converges at all converges to x, then limn xn = x. If xn does not converge to x, then l xn - x l > £ for some positive £ and some increasing sequence { n k } of integers; �me subsequence of { x n " } converges, but the limit cannot be x. n
566
APPENDIX
All. A set G is defined as open if for each x in G there is an open interval I such
that x e I c G. A set F is defined as closed if F( is open. The interior of A , denoted A 0 , consists of the x in A for which there exists an open interval I such that x e I c A . The closure of A , denoted A - , consists of the x for which there exists a sequence { Xn } in A with Xn --+ x. The boundary of A is aA = A - - A 0• The basic facts of real analysis are assumed known: A is open if and only if A = A 0 ; A is closed if and only if A = A - ; A is closed if and only if it contains all limits of sequences in it; x lies in a A if and only if there is a sequence { xn } in A and a sequence { Yn } in A'' such that xn --+ x and Yn --+ x; and so on. A 12. Every open set G on the line is a countable, disjoint union of open intervals.
To see this, define points x and y of G to be equivalent if x < y and [ x, y] c G or y < x and [ y, x ] c G. This is an equivalence relation. Each equivalence class is an interval, and since G is open, each is in fact an open interval. Thus G is a disjoint union of open (nonempty) intervals, and there can be only countably many of them, since each contains a rational. Al3. The simplest form of the Heine-Bore/ theorem says that if [a, b)
c
Uf.... 1 (a k , bk ), then [ a , b) c Uk _ 1 (a k , bk ) for some n. A set A is defined to be compact if each cover of it by open sets has a finite subcover-that is, if [ G8 : 8 e 9] covers A and each G8 is open, then some finite subcollection { G8 1 , , G8 } • • •
covers A . Equivalent to the Heine-Borel theorem is the assertion that a bounded , closed set is compact. Also equivalent is the assertion that every bounded sequence of real numbers has a convergent subsequence.
Al4. The diagonal method. From this last fact follows one of the basic principles of
analysis.
Theorem.
( 1)
Suppose that each row of the array xl. I ' x1 . 2 ' xl .J ' . . . X 2. 1 ' X 2.2 ' x 2.J ' ·
·
·
is a bounded sequence of real numbers. Then there exists an increasing sequence n1 , n 2 , of integers such that the limit lim k xr. n �c exists for r == 1, 2, . . . . •
PROOF.
•
•
From the first row, select a convergent subsequence
( 2) here { n 1 • k } is an increasing sequence of integers and lim k x1 • n • . �c exists. Look next at the second row of (1) along the sequence n 1 , 1 , n 1 • 2 , : •
•
•
(3) As a subsequence of the second row of (1 ), (3) is bounded. Select from it convergent subsequence
a
here { n 2• k } is an increasing sequence of integers, a subsequence of { n 1 • k } , and lim k x 2• n exists. 2. A:
567
APPENDIX
Continue inductively in the same way. This gives an array
n l . I ' n l .2 ' n l . J ' . . . n 2. 1 ' n 2.2 ' n 2. 3 '
( 4)
·
·
·
..............
with three properties. (i) Each row of (4) is an increasing sequence of integers. (ii) The rth row is a subsequence of the ( r - 1)st. (iii) For each r, lim k x r exists. Thus . ,
( 5)
Xr
•
n,
.1
' Xr . n , ' X r . n , ' .2
.3
•
'·�
0. If there is some interval on which f is bounded above, then f( x) == xf(l) for x > 0.
Theorem.
569 PROOF. The problem is to prove that g( x) = /( x ) - x/(1) vanishes identically. Clearly, g(1) 0, and g satisfies Cauchy's equation and on some interval is bounded above. By induction, g( nx) ng( x); hence ng( m/n ) g( m) mg(1) 0, so that g( r) 0 for positive rational r. Suppose that g(x0) ::1= 0 for some x0• If g( x0 ) < 0, then g( r0 - x0) - g(x0) > 0 for rational r0 > x0• It is thus no restriction to assume that g( x0) > 0. Let I be an open interval in which g is bounded above. Given a number M, choose n so that ng(x0 ) > M and then choose a rational r so that nx0 + r lies in I. If r > 0, then g( r + nx0 ) g( r) + g( nx0) g( nx0 ) ng( x0 ). If r < 0, then ng(x0) g( nx0) g(( - r) + ( nx0 + r)) g( - r) + g( nx0 + r) g( nx0 + r ). In either case, g( nx0 + r) ng( x0 ); of course this is trivial if r 0. Since g( nx0 + r) ng(x0 ) > M and M was arbitrary, g is not bounded above in I, a contradiction. • APPENDIX
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Obviously, the same proof works if f is bounded below in some interval. Coronary.
Let U be a real function on (0, oo) and suppose that U(x + y ) U( x )U( y ) for x, y > 0. Suppose further that there is some interval on which U is bounded above. The1t either U( x) 0 for x > 0, or else there is an A such that A U ( x ) e x for x > 0. ==
==
==
Since U (x) U 2 (x/2), U is nonnegative. If U( x) 0, then U(x/2n ) 0 and so U vanishes at points arbitrarily near 0. If U vanishes at a point, it must by the functional equation vanish everywhere to the right of that point. Hence U is identically 0 or else everywhere positive. In the latter case , the theorem applies to /(x) log U( x), this function being bounded above in some interval, and so /( x) Ax for A log U(1 ) . • ==
PROOF.
==
==
==
==
==
All. A number-theoretic fact.
Suppose that M is a set of positive integers closed under addition and that M has greatest common divisor 1. Then M contains all integers exceeding some n0 .
Theorem.
Let M1 consist of all the integers m , - m , and m - m' with m and m' in M. Then M1 is closed under addition and subtraction (it is a subgroup of the group of integers). Let d be the smallest positive element of M1 • If n e M1 , write n qd + r, where 0 < r < d. Since r n - qd lies in M1 , r must actually be 0. Thus M1 consists of the multiples of d. Since d divides all the integers in M1 and hence all the integers in M, and since M has greatest common divisor 1, d 1. Thus M1 contains all the integers. Write 1 m - m' with m and m' in M (if 1 itself is in M, the proof is easy), and take n 0 ( m + m') 2 • Given n > n0, write n q( m + m') + r, where 0 < r < m + m'. From n > n 0 > (r + 1 )( m + m') follows q ( n - r)/( m + m') > r. But n q( m + m') + r( m - m') ( q + r)m + ( q - r)m', and since q + r > q - r • > 0, this lies in M. PROOF.
==
==
==
==
==
==
==
==
==
A22. One- and two-sided derivatives.
Suppose that f and g are continuous on [0, oo) and g is the right-hand derivative of I on (0, 00 ) : r ( t) g( t) for t > 0. Then r (0) g(O) as well' and g is the two-sided derivative off on (0, oo ).
Theorem.
==
==
570
APPENDIX
It suffices to show that F( t) /( t) - /(0) - /Jg(s) ds vanishes for t > 0. By assumption, F is continuous on [0, oo) and F+ ( t) 0 for t > 0. Suppose that F( t0 ) > F( t1 ), where 0 < t0 < t1 Then G ( t) F( t) - (t - t0 )( F( t1 ) - F( t0))/( t 1 - t0 ) is continuous on [0, oo ) , G ( t0) G ( t1 ), and c + ( t) > 0 on (0, oo ) . But then the maximum of G over [ t0, t1 ] must occur at some interior point; since G + < 0 at a local maximum, this is impossible. Similarly F( t0) < F( t1 ) is impossible. Thus F is constant over (0, oo) and by continuity is constant over [0, oo ). Since F(O) 0, F vanishes on [0, oo ) . • ==
PROOF.
==
==
•
==
==
A differential equation. The equationA / ( t) AAsf( t) + g( t) ( t > 0; g continu solution /0(t) e 'JJg(s)e - ds; for an arbitrary solution ous) has the particular A f, ( /( t) /0 ( t)) e - ' has derivative 0 and hence equals /(0) identically. All soluAs A tions thus have the form /( t) e ' [/(0) + fJg(s ) e - ds ]. '
A13.
==
==
-
==
Al4. A trigonometric identity. If z 1
E
z
k- -1
k
21
==
z-1
E
z
k-0
k
::1= ==
1 and z
::1=
0, then
2/ + 1 1 z z-1 1 -z
and hence m- 1 � �
I
� �
z
1- 0 k - - 1
k
m- 1 ==
==
Take z
==
ei x . If
x
� �
1- o
z - 1 - z1+ 1 1 -z
==
1 - z- m + 1 - zm ( 1 - z )( 1 - z- 1 )
- ----
[
,
1 - z- m 1 - zm -z 1 - z 1 - z- 1 1 -z 1
==
---
]
( z m /2 - z - m/2 ) 2 ( z 1 /2 z - 1 /2 ) 2 _
. 1 mx ) s ( m k x ... i e E E 2 1- o k - - I ( sin t x )
x --+
_
is not an integral multiple of 2w, then m- 1
If x
z/+ 1 1 -z
z- 1
I
2
2 •
2 w n , the left-hand side here is m 2 , which is the limit of the right-hand side as 2 wn.
==
Infinite Series
Nonnegative series. Suppose x 1 , x 2 , . . are nonnegative. If E is a finite set of integers, then E c { 1 , 2, . . . , n } for some n, so that by nonnegativity Ek e E xk < EZ _ 1 xk . The set of partial sums EZ _ 1 xk thus has the same supremum as the larger AlS.
•
set of sums Ek e E xk ( E finite). Therefore, the nonnegative series Ek'.. 1 xk converges if and only if the sums Ek e E xk for finite E are bounded, in which case the sum is the supremum: Ek'.... 1 xk supEEk e Exk . ==
Al6.
Dirichlet ' s theorem. Since the supremum in A25 is invariant under permuta
tions, so is Ek'- 1 xk : If the xk are nonnegative and Yk
=
xf< k )
for some one-to-one
571 map f of the positive integers onto themselves, then E k xk and Ek Yk diverge or converge together and in the latter case have the same sum. APPENDIX
1, 2, . . . , are nonnegative. The i th row gives a series E1 x;1 , and if each of these converges, one can form the series E;E1xiJ . Let the terms x;1 be arranged in some order as a single infinite series E; 1 x;1; by Dirichlet's theorem, the result is the same whatever order is used. Suppose each E 1xiJ converges and E;E1 xiJ converges. If E is a finite set of the pairs ( i , j), there is an n for which E(i . J) e E XiJ < E , s nEJ s n x, 1 < E; < nEJxiJ < E;E 1 xiJ ; hence EiJ xiJ converges and has sum at mos� E;E1x; ; · On the other hand, if EiJ xiJ converges, then E; s mEJ s n xiJ < EiJ xiJ ; letting n --+ oo and then m --+ oo shows that each E1 xiJ converges and that E;E1 x;; < EiJ xiJ . Therefore, in the nonnegativ� case! EiJ xiJ converges if and only if the E1 xiJ all converge and E;E1 xiJ converges, tn which case EiJxiJ E;E1 xiJ . By sy�etry, E;.J. x;1 ":' �1E; x;J " Th:s the order of summation can be reversed in a nonnegative double senes. E;E1xiJ E1E;xiJ. A27.
Double series. Suppose that xiJ , i, j
==
==
A28.
The Weierstrass M-test.
Theorem. Suppose that limn xn k == xk for each k E k Mk < oo . Then Ek xk and all the Ek xn k converge,
and that l xn k I < Mk , where and limnE k xn k Ek xA . ==
The series of course converge absolutely since Ek Mk < oo . Now IE k xn k E k xk I < E k < k l xn k - xk I + 2Ek k Mk . Gj91en £ , choose k0 so that E k k 0 Mk < £/ 3, and then �boose n 0 so that n � n 0 implies l xn k - xk I < £/3k0 for k < k0 . Then n > n 0 implies IEk xn k - Ek xk l < £. • PROOF.
>
>
Power series. The principal fact needed is this: If f( x ) verges in the range l x l < r, then it is differentiable there and A29.
( 8)
/'( x )
==
==
k Ek' 0a k x con
00
k E ka k x - I .
k-1
For a simple proof, choose r0 and ri so that l x l < r0 < ri < r. If I h i < r0 - l x l , so that l x + h I < r0 , then the mean-value theorem gives (here 0 < (Jh < 1) ( 9)
k
( x + h) - x
h
k
- kx
k_I
==
I
k 1 k k ( x + 8h h ) - - kx
-
1
1 < 2 kr0k -
1
.
Since 2 krt - I jrf goes to 0, it is bounded by some M, and if M" l a k I · Mrf , then Ek Mk < oo and I a k I times the left member of (9) is at most Mk for I h I < r0 - I x I · By the M-test [A28] (applied with h --+ 0 instead of n --+ oo ) =
,
Hence (8).
APPENDIX 572 Repeated application of (8) gives
J < i > ( x ) ==
00
E k ( k - 1)
· · ·
k -J
(k
k
- j + 1) ak x -J.
For x == 0, this is a1 == j < l > (O)/j !, the formula for the coefficients in a Taylor series. This shows in particular that the values of f( x) for 1 x 1 < r determine the coeffi cients ak .
Cesaro averages. If xn --+ x, then n- 1 EZ- t xk --+ x. To prove this, let M bound l xk I , and given £, choose k0 so that l x - xk I < £/2 for k > k0. If n > k0 A30.
and n > 4k0 Mj£, then
n
ko - 1
n
k- 1
k-1
k - k0
x - -1 E xk < -1 E 2 M + -1 E � < £ . 2 n n n AJl. Dyadic expansions. Define a mapping T of the unit interval Q == (0, 1] into
itself by
{
Tw ==
2w 2w - 1
if O < w < t , if t < w < 1 .
Define a function d1 on Q by if O < w < ! , if t < w < 1 ,
( 10)
� �( w)
Lti
i- 1
·
2'
( X ) + ( 1 - t) fP( y)
573
APPENDIX
for x, y e I and 0 � t
( b ) - q> ( a ) b-a
==
-
bc; ds s) ( J b-a a 1
�
q> '( b )
( c) - q> ( a ))/( c - a), which is the same as (13). As B moves to A from the right, slope A B is thus nonincreasing and hence has a limit. In other words, q>
has a right-hand derivative fP + . Figure (ii) shows that slope DE < slope EF < slope FG. Let E move to D from the right and let G move to F from the right: The right-hand derivative at D is at most that at F, and fP + is nondecreasing. Since the slope of XY in Figure (iii) is at least as great as the right-hand derivative at X, the curve to the right of X lies on or above the line through X with slope q> + ( x ) : ( 14) Figure (iii) also makes it clear that
y � X.
fP is right-continuous.
574 APPENDIX Similarly, q> has a nondecreasing left-hand derivative q>- and is left-continuous. Since slope A B < slope BC in Figure (i), q>- ( b) < q> + ( b). Since clearly q> + ( b) < oo and - oo < q>- ( b), q> + and q>- are finite. Finally, (14) and its right-sided analogue show that the curve lies on or above each line through Z in Figure (iv) with slope between q> - ( z ) and " + ( z ):
{ 15)
q> ( x ) � q> ( z ) + m( x - z) ,
This is a support line.
Notes on the Problems
These notes consist of hints, solutions, and references to the literature. As a rule a solution is complete in proportion to the frequency with which it is needed for the solution of subsequent problems. Section 1 1.1.
(a) Each point of the discrete space lies in one of the four sets A 1 n A 2 ,
A f n A 2 , A 1 n A2 , Af n A2 and hence 2- 2 ; continue. (b) If, for each i, B; is A ; or Af, then most n;_ 1 (1 a; ) < exp[ - E?.. 1 a; ).
would have probability at most
B1
n · · · n Bn
has probability at
-
1.4.
1.5.
1.6.
(b)
Suppose A is trifling and let A - be its closure. Given £ choose intervals (a k , bk ], k = 1, . . . , n , such that A c Ui: - 1 (a k , bk ] and EZ- 1 (bk - a k ) < e/2. If xk = a k - £j2n, then A - c uz_ l (x k , bk ] and E Z - I ( bk xk ) < £. For the other parts of the problem, consider the set of rationals. n n (a) Cover A b ( /l) by (b - 1) intervals of length b- . (c) Go to the base bk . Identify the digits in the base b with the keys of the typewriter. The monkey is certain eventually to reproduce the eleventh edition of the Britannica and even, unhappily, the fifteenth. (a) The se t A 1 (3) is itself uncountable, since a point in it is specified by a sequence of O' s and 2's (excluding the countably many that end in O's). (b) For sequences u 1 , , un of O' s, 1 ' s, and 2' s, let Mu . . . u consist of the points in (0, 1 ] whose nonterminating base-3 expansions �tart" out with those digits. Then A 1 (3) = (0, 1] - U Mu . . . u , where the union extends over n > 1 and sequences u1 , , un containi�g afleast one 1. The set described in part (b) is [0, 1] - U Nu1 u,. ' where the union is as before and Nu1 u,. is the interior of Mu . . . u , and this is the closure of A 1 (3). From this �epresentation of C, it is not hard to deduce that it can be defined as the set of points in [0, 1] that can be written in base 3 without any • • •
• • •
• • •
•
•
575
576
NOTES ON THE PROBLEMS
1 's if terminating expansions are also allowed. For example, C contains 2/3 . = .1 222 · · · = .2000 · · · because it is possible to avoid 1 in the expanSIOn. (c) Given an £ and an w in C, choose w' in A 1 (3) within £/2 of w; now define w" by changing from 2 to 0 some digit of w' far enough out that w" differs from w' by at most £/2.
1.8.
The interchange of limit and integral is justified because the series k E k rk ( w )2 - converges uniformly in w (integration to the limit is studied systematically in Section 16). There is a direct derivation of (1.36) : sin t = n n 2 sin 2 - tn k - 1 cos 2- k t ' as follows inductively by the double-angle for mula; use the fact that x - 1 sin x --+ 1 as x --+ 0.
1.9.
Within each b-adic interval ofnrank n - 1, rn ( w) assumes the value b - 1
length b - and the value - 1 on subintervals of total on a subinterval of n length ( b - 1) b - . The proofs therefore parallel those for the case b = 2, /l = 1 .
1.10.
Choose q0 so that Eq "?:. q0qf( q) < £; cover A1 by the intervals ( pq- 1 /( q), pq- 1 + /( q)] for q � q0 and 0 < p < q. See HARDY & WRIGHT, p. 1 58, for a proof that infinitely many pjq come within q- 2 of an irrational w. If q 2f( q) is nonincreasing and E q qf( q ) == oo , it can be shown that A1 has negligible complement, but this is harder; see p. 69 of A. Ya. Khinchirie: Continued Fractions (University of Chicago Press, Chicago, 1964). For example, except for a negligible set of w, l w - p/ql < 1/( q 2 log q) has infinitely many solutions.
1. 13.
(a) Given m and a subinterval (a, b] of (0, 1], choose a dyadic interval I in (a, b ], and then choose in I a dyadic interval J of order n > m such that I n - 1 sn ( w) I > t for w E J. This is possible because to specify J is to specify the first n dyadic digits of the points in J; choose the first digits in such a way that J c I and take the following ones to be 1, with n so large that n - 1 sn ( w ) is near 1 for w e J.
(b) A countable union of sets of the first category is also of the first category; (0, 1] == N u Nc would be of the first category if Nc were. For Baire's theorem, see ROYDEN, p. 1 39.
Section 2 Let n consist of four points and let � consist of the empty set, n itself, and all six of the two-point sets.
2.3.
(b)
2.4.
(b) For example, take 0 to consist of the integers and let � be the
I
2.5.
a-field generated by the singletons { k } with k s n. As a matter of fact, any example in which � is a proper subclass of � + I for all n will do, because it can be shown that in this case Un � necessarily fails to be a a-field; see A. Broughton and B. W. Huff: A comment on unions of sigma-fields, Amer. Math . Monthly, 84 (1977), 553-554.
(b) The class in question is certainly contained in /( JJI) and is easily seen to be closed under the formation of finite intersections. But ( U;'!, 1 nj!.. 1 A;1 ) == n ;"- 1 Uj !.. 1 A r1 , and Uj!.. 1 A f1 == Uj!.. 1 [A f1 n ni:� A ; k ] has the required form.
c
577
NOTES ON THE PROBLEMS
2.7.
2.8. 2.9. 2.10.
If � is the smallest class over .JJI closed under the formation of countable unions and intersections, clearly fc a( .JJI ). To prove the reverse inclusion, first show that the class of A such that Ac e f is closed under the formation of countable unions and intersections and contains .JJI and hence contains f. Note that Un Bn E a ( U n .JJIB ). (a) Show that the class of A for which lA ( w ) lA ( w') is a a-field. See Example 4. 7. (b) Suppose that !F is the a-field of the countable and the co-countable sets in Q. Suppose that !F is countably generated and Q is uncountable. Show that !F is generated by a countable class of singletons; if 00 is the union of these, then !F must consist of the sets B and B u 00 with B c n0 , and these do not include the singletons in ng, which is uncountable because Q is. (c) Let � consist of the Borel sets in Q (0, 1] and let ,. consist of the countable and the co-countable sets there. II
==
==
2.1 1.
2.13. 2.15.
Suppose that A1 , A 2 , . is an infinite sequence of distinct sets in a a-field !F, and let 16 consist of the nonempty sets of the form n�.... 1 Bn , where Bn = A n or Bn = A � , n = 1, 2, . . . . Each An is the union of the �sets it contains, and since the An are distinct, t§ must be infinite. But there are uncountably many distinct countable unions of �sets, and they all lie in !F. If Bn = A n n A f n · · · n A � _ 1 , then A1 = Un Bn , the Bn are disjoint, and P( Bn ) = P(An ) because An � Bn c U :,:. 1 ( A m n A n ) · • •
For this and the subsequent problems on applications of probability theory to arithmetic, the only number theory required is the fundamental theorem of arithmetic and its immediate consequences. The other problems on stochastic arithmetic are 4.21, 5.16, 5.17, 6.17, 18.18, 25.15, 30.9, 30.10, 30.11, and 30.12. See also Theorem 30.3. (b) Let A consist of the even integers, let Ck = [ m : vk < m < vk + 1 ], and let B consist of the even integers in C1 u C3 u · · · together with the odd integers in c2 u c4 u . . . ; take vk to increase very rapidly with k and consider A n B. (c) If c is the least common multiple of a and b, then M0 n Mb = Me. From M0 e 1) conclude in succession that Ma n Mb e 1) , M01 n · · · n Ma n Mt n · · · n Mt e 1) , f(J/) c 1). By the same sequence of steps, show how1 D on Jl de termines D on f(J/). (d) If B1 = Ma - U p s / Map ' then a e B1 and (the inclusion-exclusion formula requires only finite additivity) 1 D ( B, ) = ! - E _!_ + E _ - . . . p s. l
a
(
p < q s. l
ap
)
=!n 1 - .!. < s. l a
p
P
apq
(
! exp - E .!. a
a Choose /0 so that, if Ca = B1 , then D( Ca ) < 2 ity measure on f(J/), D (n ) ��< ! would follow.
psi - •.
P
If
)
D
--+ o .
were a probabil
578 2. 16.
NOTES ON THE PROBLEMS
The essential fact is that if r < s and a cylinder of rank r is the disjoint union of cylinders all of rank s, then there must be 2s - r cylinders in the union, and so the P-values add. Further, a cylinder of rank r can be split into cylinders of rank s (r < s). In what follows, let C1 , , Ck be disjoint cylinders of ranks r1 , . . . , rk . (a) To see that (U, C; ) c lies in �0 , choose s greater than all the r; and represent C, as a disjoint union U1 D, 1 of cylinders of rank s; then (U, C; ) is the union of the cylinders of rank s not appearing among the DiJ . Since the intersection of two cylinders is either the empty set or a cylinder, �0 is also closed under the formation of finite intersections. (b) Suppose that C = U; C, is itself a cylinder. Choose s and the D, 1 as before. Then s exceeds the rank of C and C = U;1 D, 1 , and so P( C) = E; P( C; ) follows because P( C) = E;1 P( D, 1 ) and P( C, ) = E1 P( D, ). (c) Follow the pattern of the argument involving (2.12) and (2.1 3'). and (d) It suffices (Example 2. 1 0) to prove that An e �0 , A1 :::J A 2 :::J P( An) > £ > 0 together imply that n n A n is nonempty. But each A n is nonempty, and the diagonal method [A14] can be used to produce a sequence (an w ) common to the An : Choose '-'n in An . Choose a sequence { n k } along which the limits lim k a; ( '-'n �c > = x, all exist and put w = (x1 , x2 , ). Given m , choose s so that whether or not a point of Q lies in Am depends only on its first s components. Now choose k large enough that n k > m and (a . ( w ), . . . ' a s < w )) = (a . ( '-'n k ), . . . ' as < '-'n k )). Since '-'n k E An k c A m , it follows that w e A m . (This is the standard proof that sequence space, as a topological product, is compact; the sets in �0 are closed.) (e) The sequence (1 .7) cannot end in O' s, but the sequence ( a 1 ( w ), a 2 ( w ), . . . ) can. Dyadic intervals can contract to the empty set, but nonempty cylinders cannot. The diagonal method plays here the role the Heine-Borel theorem plays in the proof of Lemma 2 to Theorem 2.2. The argument for part (d) shows that if P is finitely additive on t or P2 ( B) < t . This corresponds to the scheme for dividing a piece of cake between two children: one divides and the other chooses. (b) If P1 , , Pn are nonatomic, Q partitions into sets A 1 , , A n such that P; ( A ; ) > 1 In . The cake version of the proof runs as follows. The n children line up. The first cuts off 1/n of the cake (according to his measure) and passes it down the line, and any child who thinks it too large shaves it down to 1/n , returning the shavings (somehow) to the body of the cake. The last child to shave the piece down (or the first child if none do) keeps it, and the remaining n - 1 children proceed to divide the remaining cake. • • •
•
• •
579
NOTES ON THE PROBLEMS
For further information, see Walter Stromquist: How to cut a cake fairly, Amer. Math . Monthly, 87 (1980), 640-644. It can also be shown, although this is much harder, that [( P1 (A), . . . , Pn ( A )): A e '1 is a closed, convex set in n-space: see Paul R. Halmos: The range of a vector measure, Bull. 2.21. 2.22. 2.13.
Amer. Math . Soc., S4 (1948), 416-421 . (c) If J,_ 1 were a a-field, !I c J,_ 1 would follow. Use the fact that, if a1 , a 2 , • • • is a sequence of ordinals satisfying an < 0, then there exists an ordinal a such that a < Q and an < a for all n. Suppose that B1 e Up < 0-.lp , j == 1 , 2 . . . . Choose odd integers n1 in such a way that � e �a E n A ( In ) > En b - 1 A( A () In ) > b - 1 A( A ). Let n l In; be the dyadic intervals of order n. Given ln1 and Ink , choose 2 - - 1 of the In; distinct from In1 and Ink and from each other. If B is their union, then P(B U In J ) and I'( B U Ink ) have common value } , and so P( In J ) == P( Ink ). Now apply Theorem 3.3 to the class of dyadic intervals (a w-system).
Section 4 4.2.
Let r be the quantity on the right in (4.33), assumed finite. Then x < r implies that X < vr- n xk for each n > 1 , which in tum implies X < r. Now X < Vk- n Xk for each n � 1 if and only if for each n � 1 there is a k > n such that x < xk , and this holds if and only if x < xn infinitely often.
NOTES ON THE PROBLEMS
4.4. 4.8. 4.11.
4. 13.
4.14. 4.19.
4.20.
4.11.
4.13.
581
Therefore, r == sup[x: x < xn i.o.], and this is easily seen to be the supre mum of the limit points of the sequence. The argument for (4.34) is similar. (a) lim infnAn == A 1 () A 2 , lim supnAn == A 1 U A 2 • (c) The limits inferior and superior are, respectively, the closed disc of radius 1 and the open disc of radius fi. sequence, ensure by passing to a subsequence that If { An } is a Cauchy n take A == lim supn A n , and apply the first Borel P(A n � An + 1 ) < 2- , Cantelli lemma and Problem 4.6(d). First use the inclusion-exclusion formula to show that the equation holds for A == Ur_ 1 A; and then pass to the limit to show that it holds for A == 0. Now use the w-A theorem, as in the proof of Theorem 4.2. If ( H n G1 ) u ( He n G2 ) == ( H n G} ) u ( He n G2 ), then Gt �G} c He and G2 �G2 c H; nconsistency now 0. n follows because 'A .( H)m== 'A .( He) n ) c==H'' If An == m( H n Gi > ) U ( He n G! ) ) are disjoint, then G1 ) n Gi n and G� ) n G� ) c Hn for m ::1= n , nand therefore (see Problemn 2.13) n P( U n A n ) == !'A ( UnG1 ) ) + !'A (UnG! > ) == En ( ! A ( G 1 ) ) + !'A (G! ) )) == En P(An ) · The intervals with rational endpoints generate t§. (c) Consider disjoint events A 1 , A 2 , . . . for which P(An) == 1/2 n . The problem being symmetric in An and A � , suppose that an == Pn and define An as in Problem 4.18. Define B == lim infn A� , apply the first Borel-Cantelli lemma, and pass from sequence space to B and from An to B n An. Show as in Problem 1.1 (b) that the maximum of P ( B1 nn · · · n Bn ), where B; is A; or Af, goes to 0. Let Ax == [w: En /.,� (w )2- S x], show that P(A n Ax ) is continuous in x, and proceed as in,.Problem 2.17(a). Calculate D( Fi) by (2.22) and the inclusion-exclusion formula, and estimate Pn ( F, - F) by subadditivity; now use 0 < Pn (Fi) - Pn ( F) == Pn ( Fi - F). For the calculation of the infinite product, see HARDY & WRIGHT, p. 245. If An is the union of the intervals removed at the n th stage, then P(An I AI n · · · nA�_ 1 ) == an , where P refers to Lebesgue measure.
Section 5 5.5. 5.8
== 0, a � 0, and x > 0, then P [ X > a) < P [(X + x) 2 > (a + x) 2 ] < E[(X + x) 2 ]/(a + x) 2 == ( a 2 + x 2 )j(a + x) 2 ; minimize over x. (b) It is enough to prove that q>(t) == f(t(x', y') + (1 - t)(x, y)) is convex in t (0 < t < 1) for (x, y) and (x', y') in C. If a == x' - x and fl = y' - y, then (if /11 > 0) (a) If
m
cp"
== /u a 2 + 2ft 2 a/l + /22 fl 2 1 2 + 1 / / - /l ) fl 2 � 0. == f f a + fl) t 2 2 111 ( u 111 ( u 22
582 5.9. 5. 10. 5. 16.
5.17.
NOTES ON THE PROBLEMS
Check (5. 35) for f(x , y) == x1 1Py1 1q . (x1 1P + y1 1P ) P . Check (5.3 5) for f(x , y) For (5.39) use (2.22) and the fundamental theorem of arithmetic: since the p, are distinct, the pf• individually divide m if and only if their product does. For (5.40) use inclusion-exclusion. For (5.43), use (5.25) (see Problem 5.14). k (a) By (5.43), En [ ap ] < Er:.., 1 p- < 2jp. And, of course, n- 1 log n ! En [log] == Ep En [ap ]log p. (See Problem 18.21 for one proof of Stirling's formula.) k (b) Use (5.44) and the fact that En [ ap - Bp ] < Er:.. 2 p- . (c) By (5.45), -
=
-
==
E
n
0. Then
0
s o .
Transfer from N to P any i for which
fl; ==
0 and use a similar argument.
8.18.
Denote the sets (8.32) and (8.52) by P and by F, respectively. Since F c P , gcd P s gcd F. The reverse inequality follows from the fact that each integer in P is a sum of integers in F.
8.19.
Consider the chain with states 0, 1, . . . and a and transition probabilities Poi == /j + t for j � 0, Po a == 1 - /, P; ; - t == 1 for i � 1, and Paa == 1 ( a is an absorbing state). The transition matrix is ,
It
1
0
/2 13 0
1
0 0
0
0
...
1 -/ 0 0
.. ..... ...... ......
0
1
n
Show that /� ) == In and p&)> == un , and apply Theorem 8.1. Then assume f == 1 , discard the state a and any states j such that fk == 0 for k > j, and apply Theorem 8.8. In FELLER, Volume 1, the renewal theorem is proved by purely analytic means and is then used as the starting point for the theory of Markov chains. Here the procedure is the reverse.
8.21.
The transition probabilities are p0 , == 1 and Pi - i + t == p, !'_;, , -; == q, 1 < t i == u, == q- u0 == ( r + q ) - . s r; the stationary probabilities are Ut == The chance of getting wet is u0p, of which the maximum is 2 r + 1 2,Jr( r + 1 ) . For r 5 this is .046, the pessimal value of p being .523. Of course, u0p s 1j4r. In more reasonable climates fewer umbrellas suffice: if p == .25 and r == 3, then u0p == .050; if p J.: .1 and r == 2, then u0p = .03 1 . •
-=
•
•
,
r
-
NOTES ON THE PROBLEMS
At the other end of the scale, if p == .8 and p == .9 and r == 2, then u0p == .043.
8.24. 8.25.
r == 3, then
u0p = .050; and if
For the last part, consider the chain with state space Cm and transition probabilities piJ for i , j E Cm (show that they do add to 1).
Let C' == S - (T u C) and take U == T u C' in (8.5 1 ). The probability of absorption in C is the probability of ever entering it, and for initial states i in T u C' these probabilities are the minimal solution of Y;
== E
jE T
0 < Y;
S
Pi} >'i
+ E
jE C
pij
1,
+ E
j E C'
Pi} Y.i '
iE
T u C' '
i e
T U C' .
Since the states in C' ( C' == 0 is possible) are persistent and C is closed, it is impossible to move from C' to C. Therefore, in the minimal solution of the system above, Y; == 0 for i e C'. This gives the system (8.55). It also gives, for the minimal solution, EJ e TPiJ Y.i + E1 e c PiJ == 0, i e C' . This makes probabilistic sense: from an i in C', not only is it impossible to move to a j in C, it is impossible to move to a j in T for which there is positive probability of absorption in C.
8.26.
Fix on a state i and let S. consist of those j for which P!J > > 0 for some n congruent to , modulo t Choose k so that p are > O '· if p and p IJ positive, then t divides m + k and n + k, so that m and n are congruent modulo t. The S., are thus well defined. •
8.27. 8.29.
Show that Theorem 8.6 applies to the chBi;n with transition probabilities PIJ ( �> . Show that an
==
w;
and
== where 8.34.
p
o{ � k�1 n - k) l} == o ( ! ) (
,
is as in Theorem 8.9.
The definitions give E;
[ I ( Xa,.)) == P; [ an < n ] I( 0) + P; [ an == n ] I( i + n ) == 1 - P; [ an == n ] + P; [ an == n ] ( 1
- h + n ,o)
== 1 - P; · · · P; + n - t + Pt · · · Pi + n - t ( P; + n Pi + n + t · · · ) ,
586
8.35.
NOTES ON THE PROBLEMS
and this goes to 1. Since P; [ T < n == an ] > P;([ T < n ] n [ Xk > 0, k > 1]) � 1 - /;0 > 0, there is an n of the kind required in the last part of the problem. And now
If i >
P;[T ==
1, n 1 < n 2 , (i, . . . , i + n 1 ) E ln 1 , and (i, . . . , i + n 2 ) E /n 2 , then n 1 , ,. == n 2 ] > P; [ Xk == i + k, k < n 2 ] > 0, which is impossible.
Section 9
9.4.
See BAHADUR.
9.9.
Because of Theorem 9.6 there are for P[ Mn > a] bounds of the same order as the ones for P[ Sn > a] used in the proof of (9. 3 6).
Section 10
10.7.
(a) Order by inclusion the family of all disjoint collections of J[sets of
finite positive measure; each of them is countable by hypothesis. Use Zorn's lemma to find a maximal collection { Ak } ; show that p.((Uk Ak ) c ) is finite and is in fact 0.
10.8.
Suppose p. 1 and p. 2 are probability measures on a(B') and agree on the w-system B'. Since p.1 ( Q ) == p. 2 ( Q ), Q can be added to B', and then the hypotheses of Theorems 10.3 and 10.4 are both satisfied.
10.9.
Let p. 1 be counting measure on the a-field of all subsets of a countably infinite Q , let 11 2 == 2 1£ 1 , and let B' consist of the cofinite sets. Granted the existence of Lebesgue measure A on �1 , one can construct another exam ple: let p. 1 == A and p. 2 == 2 A , and let 9' consist of the half-infinite intervals ( - 00 , X ). There are similar examples with a field Fo in place of 9'. Let Q consist of the rationals in (0, 1], let p. 1 be counting measure, let p. 2 == 2 p. 1 , and let Fo consist of finite disjoint unions of " intervals" [ r e Q: a < r < b ].
10.10
(a) Modify the proof of Theorem 10.3, using the monotone class theorem in place of the w-A theorem.
Section 11
11.4.
(a) See the special cases (12.16) and (12.17).
11.5
(d) Suppose that A e f(JJI); if A can be represented as a finite or
11.6
countable disjoint union of �sets An, put p.(A) == En p.(An) (prove con sistency); otherwise, put p.(A ) = oo . This extension need not be unique-suppose JJI = {0} for example.
(c) See HALMOS, p. 31, for the proof that pairwise additivity implies
finite additivity in strong semirings. For a counterexample in the case of
587
NOTES ON THE PROBLEMS
semirings, take .JJI to consist of 0, { 1 } , { 2 } , { 3 } , and Q = { 1, 2, 3 } and make sure that p. { 1 } + p. { 2 } + p. { 3 } < p. ( Q ). Accounts treating only strong semi rings omit the qualifier " strong."
11.7.
Take preliminary -Fo-sets Bk satisfying p.(Bk �A k ) < (/n 3 ; add U1 . k ( � n Bk ) to one of these sets and subtract it from the others and let B be the union of the new Bk . If A lies in -Fo , subtract B n A c from all the Bk and add A n Be to one of them.
11.10. Complementation is easy. Suppose G = UnGn , Gn e t§. Choose -Fo-sets An k
and Bn k such that n1 An c Gn c U k Bn k and p.((U k Bn k ) - (n k An k )) < n (;2 . Now choose N so that p.(G - Un s N Gn ) < (. Approximate G from without by Un k Bn k and from within by n k (Un s NAn k ).
U k ( lk , gk ], then ( I( w ), g( w )] c U k ( lk ( w ), lk ( w )] for all w, and Lemma 2 to Theorem 2. 2 gives g( w ) - l(w) < Ek ( gk (w) - lk (w)). If h m (g - 1 - Ek s n (gk - lk )) V 0, then h n � 0 and g - I � E k s n(gk - fk ) + h n . The positivity and continuity of A now give v0 ( 1, g) < Ek v0 ( fk , gk ]. A similar, easier argument shows that E k v0( lk , gk ] � v0( 1, g] if ( lk , gk ] are disjoint subsets of ( I, g ]. 11.12. (b) From (1 1.11) it follows that [I > 1] e -Fo for I in .ftJ. Since .ftJ is linear, [ I > x ] and [ I < - x ] are in -Fo for I e .ftJ and x > 0. Since the sets ( x, oo ) and ( - oo , - x) for x > 0 generate 911 , each I in .ftJ is 11.11. (b) If ( I, g]
c
=
measurable a( -Fo ). Hence !F = a( -Fo ). It is easy to show that -Fo is a semiring and is in fact closed under the formation of proper differences. It can happen that Q ll. -Fo-for example, in the case where Q = { 1 , 2 } and .ftJ consists of the I with 1(1) = 0. (On the other hand, -Fo is a a-ring and JJ/0 is a strong semiring in the sense of Problems 11.5 and 1 1.6, and the arguments can be carried through in that context; see J\irgen Kindler: A simple proof of the Daniell-Stone represen tation theorem, Amer. Math . Monthly, 90 (198 3), 3 96- 3 97.)
Section 12
12. 1.
Take a = p.(O, 1]; for A an interval with rational endpoints, compare p.( A ) with a by exhibiting A and (0, 1] as disjoint unions of translates of a single interval. Such A 's form a w-system.
12.4.
(a) If (Jn = fJm , then (Jn - m = 0 and n = m because fJ is irrational. Split G
into finitely many intervals of length less than (; one of them must contain points 82 n and 82 m with 82 n < 82 m . If k = m n , then 0 < 82 m - 82 n = 82 m e 82 n = 82 k < ( , and the points 82 k 1 for 1 < I < [ 8ik1 ] form a chain in which the distance from each to the next is less than (, the first is to the left of ( , and the last is to the right of 1 - ( (c) If s 1 e s 2 = 82 k + 1 e 82 n 1 E9 82 n 2 lies in the subgroup, then s 1 = s2 and (J2 k + 1 = (J2( nl n 2 ) " -
.
12.5.
-
(a) The S E9 (Jm are disjoint, and (2 n + 1 ) v + k = (2 n + 1 ) v' + k' with
I k I , I k' I < n is impossible if v ::1= v'. (b) The A E9 8< 2 n + I > v are disjoint, contained in Lebesgue measure.
G, and have the same
588
NOTES ON THE PROBLEMS
12.6.
See Example 2.10 (which applies to any finite measure).
12.8.
By Theorem 12.3 and Problem 2.17(b), A contains two disjoint compact sets of arbitrarily small positive measure. Construct inductively compact sets n Ku1 .. . u,. (each U ; is 0 or 1) such that 0 < p.(Ku1 ... u,. ) < 3 - and Ku1 . . . u,. O disjoint subsets of Ku1 .. . u,. . Take K = nn U u1 . . . u,. Ku1 . .. u,. . and Ku1 . . . u,. 1 are . . The Cantor set ts a speCial case.
12.14. Suppose that An
E
closed en .Jf'. To show that A = Un An E .Jf', choose n
and open Gn so that C, c An c Gn and p.(Gn - C, ) < f./2 , choose N so that p.(A - Un s N An) < f., and consider c = Un s N Cn and G = UnGn. Complementation is easy. Approximate a closed rectangle [x: a; < X; < b; , i < k ] by open rectangles [ x: a; - m - 1 < X; < b; + m - 1 , i s k ].
Section 13 13.6.
E ; x; IA ' and A ; e 1 1!F', take Aj in !F' so that A ; = 1 1Aj and set q> = E ; x ; J.��· For the general1 1 measurable r- 1/F', there exist simple functions Jn , measurable 1 /F', such that ln ( w ) --+ l( w ) for each w. Choose fl>n , measurable !F', so that In = fl>n T. Let C' be the set of w' for which 'Pn ( w') has a finite limit, and define q>(w') = lim n'Pn ( w') for w' e C' and q>( w') 0 for w' � C'. Theorem 20.1(ii) is a special case. If 1 ==
==
13.10. The class of Borel functions contains the continuous functions and is closed
under pointwise passages to the limit and hence contains !I. By imitating the proof of the fT-A theorem, show that, if I and g lie in !I, then so do I + g, lg, I - g, I v g (note that, for example, [g: I + g e !I] is closed under passages to the limit). If {n ( x) is 1 or 1 - n(x - a) or 0 as x s a or a s x s a + n - 1 or a + n - < x, then In is continuous and In ( X ) -+ /( - crJ� X). Show that [A : lA E !£] is a A-system. Conclude that !!I contains all indicators of Borel sets, all simple Borel functions, all Borel functions. oo ,
13.1 1. The set [ x: l(x) < a ] is open. 13.14. On the integers with counting measure, consider the indicator of { n , n + 1, . . . } .
, bk } , where k < n , E; = C - b;- 1A , and E = Uf.... 1 E; . Then E = C - Uf_ 1 b;- 1A. Since p. is invariant under rotations, p.( E; ) = 1 - p.(A) < n - 1 , and hence p.( E ) < 1. Therefore C - E == nf_ 1 b;- 1A is non empty. Use any fJ in C - E.
13.20. Let B = { b1 ,
• • •
Section 14 14.4. 14.5.
(b) Since u < F(x) is equivalent to q>( u) s x, u < F( q> ( u)). Since F( x)
< u is equivalent to x < q>( u), F(q>(u) - f.) < u for positive f.. (a) If 0 < u < v < 1, then P[ u s F( X) < v, X e C) = P[q> ( u) < X < q>(v), X e C). If q>(u) e C, this is at most P[ q> ( u ) s X < q>(v)] F( q> ( v ) - ) - F( q> ( u) - ) = F( q> ( v) - ) - F( q> ( u)) < v - u; if q> ( u) ll. C , it is at most P[ q> ( u ) < X < q>(v)] = F( q>( v ) - ) - F( q>( u)) < v - u. Thus P[ F( X) e [ u, v), X e C) s A[ u, v) if 0 < u < v < 1. This is true also for =
589
NOTES ON THE PROBLEMS
(let u J, 0 and note that P[ F( X) = 0] == 0), and for v = 1 (let v f 1 ) . The finite disjoint unions of intervals [ u , v) in [0, 1) form a field there, and by addition P[ F( X) e A , X e C) < A(A ) for A in this field. By the monotone class theorem, the inequality holds for all Borel sets in [0 1 ). Since P[ F( X) = 1, X e C) == 0, this holds also for A == { 1 }. The sufficiency is easy. To prove necessity, choose continuity points X; of F in such a way that x0 < x1 < · · · < xk , F(x0) < £, F(xk ) > 1 - £ , and X; - X; _ 1 < £. If n exceeds some n 0 , I F(x; ) - F, ( x; ) l < £/2 for all i. Suppose that X;_ 1 < x s X; . Then F,(x) < F, (x; ) < F(x; ) + £/2 < F( x + £ ) + £/2. Establish a similar inequality going the other direction, and give special arguments for the cases x < x0 and x > xk . u
=-
0
,
14.9.
Section 16. 16.6.
(a) By Fatou's lemma,
and
f b dtL - ffdtL f li� ( b, - /, ) dtL ==
j
< lim inf ( b, - /, ) d11 n
Therefore
16.9.
o s In
...
j b dtL - lim sup j/, d11 . n
- It t I - It ·
16.11. (a) Take !F == {0, Q ) , p.(O) == oo , I = 2, and g = 1 .
(b) Suppose An f [ g < oo , I > 0] and p.(An ) < oo. If Bn == [0 S g < f, g � n ], then p.(An n Bn ) == 0 and An n Bn f [g < /]. (c) Suppose en f g and P (Cn) < 00 . Then n- 1p.([/ > n- 1 ] n Cn ) < P[/ > n - 1 ] n C, ) < oo , and [ / > n - 1 ] n C, f [/ > 0]. 16.13. Whatever A may be,
and there is equality here if A
==
[ B > Bn 1 ·
590
NOTES ON THE PROBLEMS
16.16. By the result in Problem 16.15, f lin I dp. --+ f 1/1 dp. implies f 11 /1 - 1/n l l dp. --+ 0. If An == lfn < 0 < / ] U l/ < 0 < fn 1 and Bn == l l/n l < 1/1 ], then o
a] < a - 1f Ifn i dp. < B and hence f,1� l � a d/n l dp. < £ for all n. For the reverse implication adapt the argument m" the preceding note. 16.20. (b) Suppose that In are nonnegative and satisfy condition (ii) and p. is nonatomic. Choose B so that p.(A) < B implies fA in dp. < 1 for all n. If p.lfn == oo] > 0, there is an A such that A c l in == oo] and 0 < p.( A ) < B; but then fAin dp. == oo. Since p. lfn == oo] == 0, there is an a such that p. l fn > a] < B < p. l fn > a]. Choose B c l fn == a] in such a way that A == l in > a) U B satisfies p.(A ) == B. Then aB == ap.(A ) < fA in dp. < 1 and I In dp. s 1 + ap.(A c) < 1 + B- 1p.(O) . 16.21. By Theorem 16.13, (i) implies (ii), and the equivalence of (ii) and (iii) was
established in Problem 16.16. Suppose (ii) and (iii) hold. Then certainly f lfn I dp. is bounded. Given £, choose n0 so that f lin - /1 dp. < £ for n > n 0 and then choose B so that p. ( A ) < B implies fA 1/1 dp. < £ and fA lfn I dp. < £ for n < n0 • If p.(A) < B and n > n0, then fA lfn I dp. < / lfn - /1 dp. + fA 1/1 dp. < 2 £. 16.25. (b) Suppose that f e .ftJ and f > 0. If In == (1 - n- 1 )/ V 0, then In e !R and In i /, so that v( fn, / 1 == A ( / - fn ) � O. Since v( /1 ,/] < oo , it follows that v l( w, t): /( w) == t] == 0. The disjoint union B,
n 2"
-
il}l
([ ;
;+1 < I < 2" 2"
] X ( 0, 2" ] ) i
increases to B, where B c (0, /] and (0, / ] - B c l(w, t): /(w) == fore
t).
There
. [ . .+ ] A ( / ) ... ..( 0, .1] ... lim 11 ( B,) ... lim .E ;, p. ;, < I < 2 , ... jfdp. . n 2"
n
n
1
•-1
Section 17 17. 1 1. The binomial formula is (1
a
00
+ x) == L
n-o
( � ) xn ,
1
591
NOTES ON THE PROBLEMS
valid for arbitrary a and l x l < 1; here ( : ) == a( a - 1) · · · ( a - n + 1)/n ! A simple way to verify the formula without considering the analyticity of the left-hand side is to show by the ratio test that the series converges for 1 x 1 < 1 to /a (x), say, and then by a binomial identity to derive (1 + x )/;(x ) == af01 (x) [A29], whence it follows that the derivative of /a (x )/(1 + x ) 01 vanishes. Note that the series in (17 .17) is nonnegative. 17.14. Part (a) can be elaborated: If f is Lebesgue integrable over each [ t, 1] and
(17.20) holds, then f has Kunweil integral I over [0, 1]. This integral was introduced in J. Kunweil: Generalized ordinary differential equations and continuous dependence on a parameter, Czechoslovak Math . Jour., 7 (82) (1957), 418-446. 17.15. If A == A, n S, 1 , (17.21) follows from (12.19). But these A 's form a w-system generatihg � 2 • And now (17 .22) follows by the usual argument starting with indicators for f. 17.16. If g(x ) is the distance from x to [ a , b), then In == (1 - ng) V 0 � I[ a . h J and In e 9' ; since the continuous functions are measurable �1 , it follows that !F == 911 • If In ( x) � 0 for each x, then the compact sets [ x: In ( x) � £] decrease to 0 and hence one of them is 0; thus the convergence is uniform. 17.17. The linearity and positivity of A are certainly elementary facts, and for the continuity property, note that if 0 s f s £ and f vanishes outside [ a , b ], then elementary considerations show that 0 s A( / ) � £(b - a). ,2
Section 18
If E e !I X tty , then E lies in the a- field generated by some countable class of measurable rectangles, which by the definition of !I== tty may be taken as But then E the singletons A m n == {(xm , xn )} for some sequence x1 , x 2 , must be the union of certain of the A m n ' or such a union together with ( U m n A m n )"; either case is impossible. 18.3. Consider A X B, where A consists of a single point and B lies outside the completion of �1 with respect to A . 18.11. See DOOB, p. 63 18.18. Put /p == p - 1log p and put In == 0 if n is not a prime. In the notation of (18.1 8), F(x) == log x + cp (x), where cp is bounded because of (5.47). If G(x) == - 1jlog x, then 18.2.
• • •
L
psx
_! p
..,
F( X ) + log x
l
x
2
F( I ) dt t log 2 t
cp ( X ) == 1 + log + x
lx 2
dt + t log t
.
loo cp( t ) dt - oo cp ( t ) 2dt t log 2 t Jx t log t
0
2
18.21. Take the point of view of Problem 18.6 and use the translation invariance of Lebesgue measure in the plane. The Tk do not overlap because the slope log 1 (1 + k - 1 ) of the lower boundary of T is less than the slope k - of a
k
tangent to the upper boundary of Tk 1 at its right-hand endpoint. _
592
NOTES ON THE PROBLEMS
Section 19
m
19.2.
Choose sets Bn such that diam Bn < £ and cm En (diam Bn ) < h m .f ( A ) + £ . Replace Bn by the open set [ x: dist( x, Bn ) < Bn ]; if the Bn are small enough, the inequalities are preserved. Now use compactness.
19.3.
(a) Inserting the extra point /( t) cannot make a polygon shorter.
oo , /( · ) is not well defined, but in this case there is nothing to prove. Apply Lemma 4 to the maps f: [a, b ) --+ R k and /: [ a, b) --+ R1 .
(b) If L(/) ==
(c) Make sure that the trace is swept out several times as t runs from a to b. (d) Let tn be the supremum of the t in [a, b) for which /( t) e B;; . Now /(a) lies in some Bn o ' /( tn 0 ) lies in some Bn 1 , /(tn 1 ) lies in some Bn 2 , and so on. The process must terminate with a tn of b, because n 0, n 1 , are distinct until then. Therefore, 1/(b) - /( a ) l s Ef.:J 1/ ( tn . ) - /( tn. ) l < "'N � n - 1 diam Bn < h 1 ., (/ [a, b ]) + £. (e) If a == t0 < · · · < t, == b, then, since f is one-to-one and h 1 is 0 for singletons, Ej - 1 1/(ti ) - /(ti _ 1 ) 1 s Ej _ 1 h 1 (/[ti _ 1 , ti ]) == h 1 (/[a, b)). (f) Suppose, for example, that /(t + ) - /( t) == a > 0. For some B, t < s 1 s s2 < t + B implies /(s2 ) - /( s 1 ) < aj3. If t < s < t + B, there is a parti tion t t0 < · · · < t, == s such that E7- 1 1/(t; ) - /(1; _ 1 ) 1 > 2aj3; but since /[ t1 , s] < aj3, l/(t1 ) - /(t) l > aj3. Let s � t and get a contradiction • •
•+l
•
I
==
to the continuity of f.
19.4.
If the curve has no multiple points, the result follows from (19.14), but it is easy to derive it in the general case directly from the definition of L(/).
19. 7.
(a) 2 ,.fi.
19.8.
(b) 2'1T [1/ v'2 + ! log(1 + v'2 )]. In the diagram below the circle has radius 1. Take fJ == 'ITjn and x == 1/2 m , calculate the lengths of the numbered segments in order, and show that the triangles abc and abd have areas sin fJ [(1 - cos fJ ) 2 + x 2 ] 1 12 and x [2 d
NOTES ON THE PROBLEMS
593
2 cos fJ ] 1 12 • Since the polyhedron has 2 mn triangular faces of each kind, the formula for A m n follows. Since sin t - t as t --+ 0, if m and n go to infinity in such a way that m 2 jn 4 --+ p, then A m n --+ 'IT + w [ 1 + pw 4 ] 1 12 • See FEDERER and RADO for accounts of this unimaginably complicated subject. ,
19.9.
The rectangular coordinates are (x, y, z ) == (cos fJ cos 4J, sin fJ cos 4J , sin 4J ), and I D8• • 1 comes out to cos 4J. The sector therefore has area ( 82 8 1 )(sin 4J2 - sin 4J 1 ).
m
such that A C Un Bn , diam Bn < £, and En (diam Bn ) S h m ( A ) + 1, and note that hm +& ( (A ) < £3( h m (A) + 1). (b) If T is a map from R m to R k to which Theorem 19.3 applies, and if fA I Dx i Am ( dx ) is finite and positive, then dim TA == m. If T is not smooth, of course, anything can happen-consider a Peano curve.
19. 11. (a) Choose sets
Bn
19. 12. This definition of dimension is due to Felix Hausdorff: Dimension und
ausseres Mass, Math . Ann . , 79 (1919), 157- 179. He showed that the Cantor set has dimension log 2jlog 3. Eggleston showed that the set of points in the unit interval in whose dyadic expansion 1 appears with asymptotic relative frequency p has dimension -p log p - q log q; see BILLINGSLEY, Sec tion 14. For the connection between Hausdorff and topological dimension, see Withold Hurewicz and Henry Wallman: Dimension Theory (Princeton University Press, Princeton, New Jersey, 1948).
Section 20 20.4.
Suppose U1 , • • • , Uk are independent and uniformly distributed over the unit interval, put V; == 2 n lf; - n , and let l'n be (2 n) k times the distribution of ( V1 , • • • , Vk ). Then l'n is supported by Qn == ( - n , n ] X • · · X ( - n , n ], and if I == (a 1 , b1 ] X · · · X (ak , bk ] c Qn , then l'n (/) == n f_ 1 (b; - a; )· Fur ther, if A C Qn C Qm (n < m), then l'n ( A ) == l'm (A). Define A k (A) ==
lim nl'n ( A
n
Qn)·
n
n 2- nt>
20.8.
preceding (8.16), P; [ T1 == n 1 , • • • , Tk == n k ] == /;� t>jj� By then argument n • · · �� 1c - �c- a>. For the general initial distribution, average over i.
20.9.
Integrate rather than differentiate.
20. 10. Use the w-A theorem. 20.11. (a) See Problem 20.10.
n
(b) Use part (a) and the fact that Y, == r if and only if r,.c > == n. (c) If k s n , then Yk == r if and only if exactly r - 1 among the integers n 1 , . . . ' k 1 precede nk in the permutation r< >. n+ (d) Observe that r< > == (11 , . . . , In ) and Yn + t == r if and only if r< l > == (11 , • •n • , t,_ 1 , n + 1, t,, . . . , tn ), and conclude that a( Y, + 1 ) is independent of a(T< > ) and hence of a ( Y1 , • • • , Y, )-see Problem 20.7. -
20. 16. If X and Y are independent, then P ( I ( X + Y) - (x + y) l < £] � P( I X x l < ! £] P[ I Y - Y l < ! £] and P[ X + Y == x + y] � P[ X == x ] P [ Y == y].
594
NOTES ON THE PROBLEMS
20.20. The partial-fraction expansion gives
2y( y - x ) B == 2 ' u + ( y - x )2
After the fact this can of course be checked mechanically. Integrate over [ - t, t ] and let t --+ oo : f '_, D dx == 0, f '_ , B dx --+ 0, and /� 90 ( A + C) dx
== ( y 2 + v2 - u 2 ) u - 1 w + (y 2 - v2 + u 2 ) v - 1 w == u - 1v - 1w 2 Rcu + t,(y). There is a very simple proof by characteristic functions; see Problem 26.9. 20.22. See Example 20.1 for the case n == 1, prove by inductive convolution and a n < 2 change of variable that the density must have the form Kn x 1 > - • e - x l2 , and then from the fact that the density must integrate to 1 deduce the form of Kn .
20.13. Show by (20.38) and a change of variable that the left side of (20.48) is some constant times the right side; then show that the constant must be 1.
20.27. (a) Given £ choose M so that P[ 1 X I > M] < £ and P[ 1 Yl > M] < £ , and then choose B so that lx l , IY I < M, l x - x' I < B, and IY - y' I < B imply that 1/( x', y') - /( x, y) l < £ . Note that P( lf( Xn , Yn) £ ] < 2 £ + P[ I Xn - X I > B ) + P[ l Yn - Yl > B).
- f( X, Y) l >
20.30. Take, for example, independent Xn assuming the values 0 and n with
probabilities 1 - n - 1 and n - 1 • Estimate the probability that xk = k for some k in the range n/2 < k < n . 20.31. (b) For each m split A into 2m sets A m k of probability P( A )/2 m . Arrange all the A m k in one infinite sequence and let Xn be the indicator of the n th set in it. Section 21
21.6.
Consider E IA . A random variable is finite with probability 1 if (but not only if) it is integrable.
J:x dF(x) = J:Jt dy dF(x) = 1:1� dF( x) dy. 21. 10. (a) Write E[ Y - X] == fx < Y fx < 1 � y dt dP - fY < x fY < t � x dt dP. 21.12. Show that 1 - F( t) < - log F( t) < (1 - F( t))/F(a) for t > a and apply 21.8.
Calculate
(21.10).
595
NOTES ON THE PROBLEMS
21.13. Calculate
21.14. Use (21 .12). 21.15. (a) The most important dependent uncorrelated random variables are the trigonometric functions-the random variables sin 2 , n w and cos 2 , n w on the unit interval with Lebesgue measure.
21. 19. Use Fubini' s theorem; see (20.29) and (20.30).
21.20. Even if X == - Y is not integrable, X + Y == 0 is. Since 1 Yl < l x l + l x +
Yl , E[ I Yl ] ==
oo implies that
E[ I x + Yl ] ==
21 .19. See also the lemma in Section 28.
oo for each
x; use Problem
21.21. Use Theorem 20.5. 21.25. (b) Suppose { Xn } is d -fundamenta1. By Markov' s inequality, the se
quence is fundamental in Pprobability and hence converges in probability to some X. By Fatou's lemma it follows from xn E LP that X E L P . Another application of Fatou's lemma gives dp ( X, Xn ) < lim infm dp ( Xm , Xn), and so dp ( X, Xn ) --+ 0. 21.28. If x < ess sup i XI , then E1 1P [ I X I P ] > xP1 1P [ I X I > x] --+ x as p --+ oo .
Section 22 22.3. 22.9.
E [l: I X! c > I ] == EE( I X!c> I ]. (a) Put U == E k Il k s T J x: and V == Ek Ilk s T] X"A , so that ST == U - V. Since [ T > k ] == g - [ T < k - 1] lies in a( x1 ' . . . ' xk - 1 ), E[ /[ T � k ] xk+ ] == E( �T � k 1 ) E[ X: ) == P(T > k ] E( Xi ). Hence E(U) == Ek 1 E( Xi ] P( T > k ) == E( X( ] E[ T ]. Treat V the same way. n (b) To prove E[T] < oo , show that P[T > ( a + b) n ] < (1 - p a + h ) . By (7.7), ST is b with probability (1 - p0 )/(1 - pa + h ) and - a with the opposite probability. Since E[ X1 ] == p - q. For sufficiency, use
E[ T J -
a
_
q
_
_
p
a + b q
_
p
1 - pa 1 pa + b ' _
p
q
== p
::1=
1.
8 n has the same probabilistic behavior as the original series because the xn + n 8 reduced modulo 2w are independent
22. 12. For each 8, Ene ;x,. ( e; z)
and uniformly distributed. Therefore, the rotation idea in the proof of Theorem 22.9 carries over. See KAHANE for further results.
22. 16. (b) Let A == r 1 B and suppose p is a period of f. Let n
= [1/p ]. By periodicity,
P( A n
m
= [ x/p ] and
[y, y + p ]) is the same for all
y;
there-
NOTES ON THE PROBLEMS
596
fore, I P( A n [0, x]) - mP( A n [0, p]) I S p, I P( A ) p, and I P( A n [O, x]) - P( A)xl � 2 p + l x - m/n l taken arbitrarily small, P( A n [0, x]) P( A ) x.
- nP( A n [0, p ]) 1 < S 3p. Since p can be
=
Section l3 13.3. 13.4.
Note that A, cannot exceed
v] =
P[ N, + v - N, _ u
== 0] =
t. If 0 s u < t and v > 0, then P[A, > u, B, > e- a ue- a v .
(a) Use (20.37) and the distributions of A, and B, .
(b) A long interarrival interval has a better chance of covering t than a short one does.
== j is
13.6.
The probability that N5, . k - N5,
13.7.
Since N( A ) is E/A ( Sn ), it is a random variable. Restrict attention to subsets of a fixed interval [0, a]. Show that the sets with the desired property include the finite disjoint unions of intervals and form a monotone class. If A and B are disjoint, then N(A ) and N( B) are certainly independent if A and B are finite disjoint unions of intervals; now consider in sequence the cases where A and B are (i) open (countable disjoint unions of intervals), (ii) closed (hence compact-use disjoint, open A n and Bn satisfying An � A and Bn � B), (iii) countable unions of closed sets, and (iv) general (use Problem 12.7).
13.9.
Let M, be the given process and put q> ( t) == E [ M, ). Since there are no fixed discontinuities, q> ( t) is continuous. Let 1/1 ( u) = inf[ t: u s q> ( t)] and show that Nu == M_,< u > is an ordinary Poisson process and M, = N• < t > .
13.10. Let t -+ oo in
SN, + l N, + 1 SN, t < - s -- N, N, N, + 1 N,
__,;,_ _
13.12. Restrict t in Problem 23.11 to integers. The waiting times are the Zn of
Problem 20.8, and account must be taken of the fact that the distribution of zl may differ from that of the other zn .
Section lS 25.1.
(e) Let G be an open set that contains the rationals and satisfies A ( G) < t . For k == 0, 1, . . . , n - 1 , construct a triangle whose base contains k/n and is contained in G: make these bases so narrow that they do not overlap, and adjust the heights of the triangles so that each has area 1/n . For the n th
597
NOTES ON THE PROBLEMS
density, piece together these triangular functions, and for the limit density, use the function identically 1 over the unit interval.
25.2.
By Problem 14.12 it suffices to prove that F, ( · , w ) = F with probability 1 , and for this it is enough that F,(x, w ) --+ F(x) with probability 1 for each rational x.
25.3.
(b) It can be shown, for example, that (25.19) holds for xn = n ! See Persi Diaconis : The distribution of leading digits and uniform distribution mod 1, Ann. Prob., S (1 977), 72-81. (c) The first significant digits of numbers drawn at random from empirical compilations such as almanacs and engineering handbooks seem approxi mately to follow the limiting distribution in (25.20) rather than the uniform distribution over 1, 2, . . . , 9. This is sometimes called Benford ' s law. One explanation is that the distribution of the observation X and hence of log10 X will be spread over a large interval; if log10 X has a reasonably smooth density, it then seems plausible that {log10 X } should be approxi mately uniformly distributed. See FELLER, Volume 2, p. 62, and R. Raimi: The first digit problem, Amer. Math . Monthly, 83 (1976), 521 -538.
25.9.
Use Scheffe's theorem.
25.10. Put /n ( X )
==
==
Yn + kBn JBn- 1 for Yn + kBn < X < Yn + ( k + 1 ) Bn . Construct random variables Y, with densities fn , and first prove Yn = X. Show that Zn == Yn + [( Y, - Yn )/Bn 1 Bn has the distribution of Xn and that Y, - zn = 0.
P[ Xn
25.11. For a proof of (25.21)
see
FELLER, Volume 1 , Chapter 7.
25.13. (b) Follow the proof of Theorem 25.8, but approximate I< x. y J instead of
/( - oo . x ] •
25.22. Let Xn assume the values n and 0 with probabilities Pn 1 - Pn ·
=
1 /(n log n ) and
Section 26
26.1.
26.3.
(b) Let p. be the distribution00 of X. If lcp(t) l = 1 and t00::1= 0, then cp( t) = e i ta for some a, and 0 == / 00 (1 - ei t( x -a > )p. ( dx ) == / 00 (1 - cos t( x a)) l' ( dx). Since the integral vanishes, I' must confine its mass to the points where the nonnegative integrand vanishes, namely to the points x for which t( x - a ) == 2 wn for some integer n. (c) The mass of p. concentrates at points of the form a + 2wnjt and also at points of the form a' + 2 wnjt'. If p. is positive at two distinct points, it follows that tjt' is rational. 1 (a) Let f0 (x) == w - x - 2 (1 - cos x) be the density corresponding to cp0 (t). If Pk == (sk - sk + 1 )tk , then Ek- 1 Pk == 1 ; since Ek- 1 Pk CJ>o ( tjtk ) = cp(t ) (check the points t == t1 ), cp(t) is the characteristic function of the continuous density Er_ 1 pk tk /0 ( tk x) . 1 (b) For example, take dk == 1, and sk == k - 1 ( k + 1) - • (c) If lim, � 00cp(t) == 0, approximate cp by functions of the kind in part
598
NOTES ON THE PROBLEMS
(a), pass to the limit, and use the first corollary to the continuity theorem. If fP does not vanish at infinity, mix in a unit mass at 0.
26.12. On the right in (26.30) replace q>(t) by the integral defining it and apply Fubini's theorem; the integral average comes to
Now use the bounded convergence theorem.
26.15. (a) Use (26.40) to prove that I'Pn ( t + h) - fl'n ( t) l < 2p.n ( - a, a) ( + a l h l . (b) Use part (a).
26.17. (a) Use the second corollary to the continuity theorem. 26.19. For the Weierstrass approximation theorem, see RUDIN, Theorem 7.32. 26.13. (a) I f an goes to 0 along a subsequence, then 11f(t) 1
=
1 ; use part (c) of
Problem 26.1. (c) Suppose two subsequences of { an } converge to a 0k and to 0 < a 0 < a; put 8 == a0ja and show that l q>( t) l == lq>( S t) l . (d) Observe that
a , where
Section 27 27.8.
By the same reasoning as in Example
27.3, ( Rn - log n)/ {log n
(Sn - n 2/4)/ Vn3 /36 27.16. Write J:e - " 2 1 2 du == x - 1e - x 2 12 - J: u - 2e - u 2 /2 du. 27.9.
The Lindeberg theorem applies:
27. 17. For another approach to large deviation theory,
=
=
N.
N.
see
Mark Pinsky: An elementary derivation of Khintchine's estimate for large deviations, Proc. Amer. Math . Soc., 22 (1969), 288-290.
27.19. (a) Everything comes from (4.7). If A a( Ik + n , Ik
+
n 1, +
I P( A
(l
• • •
) , then
=
[(11 ,
• • .
, lk ) e H) and
Be
B ) - P( A ) P( B ) I
where the sum extends over the k-tuples (i 1 , , i k ) of nonnegative integers in H. The summand vanishes if u + i u < k + for nu s k; the remaining terms add to at most 2E�_ 1 P[ Iu � k + u] S 4 2 . (b) To show that a 2 == 6 (see (27.26)), show that 11 has mean 1 and • • •
n-
n
/
599
NOTES ON THE PROBLEMS
variance
2 and that if i < if i >
n, n.
Section 28 28.2.
Pass to a subsequence along which P.n ( R1 ) --+ oo , choose £n so that it decreases to 0 and ( n #Ln ( R1 ) --+ 00 ' and choose Xn so that it increases to 00 and P.n ( - xn , xn ) > ! P.n ( R1 ); consider the I that satisfies l( ± xn ) = £n for all n and is defined by linear interpolation in between these points.
28.4.
(a) If all functions (28.12) are characteristic functions, they are all cer
(b)
tainly infinitely divisible. Since (28.12) is continuous at 0, it need only be exhibited as a limit of characteristic functions. If #Ln has density /[ - n. n J (1 + x 2 ) with respect to ., , then exp
[ iyt + it1
oo
X
- 1 + x2 oo
1 dx P.n ( ) +
oo
- oo
.x 1 1 e'' itx 2 - ) P.n ( dx ) ( X
]
is a characteristic function and converges to (28.12). It can also be shown that every infinitely divisible distribution (no moments required) has char acteristic function of the form (28.12); see GNEDENKO & KOLMOGOROV, p. 76. (b) Use (see Problem 18.23) - I l l = , - lloo oo (cos IX - 1)x - 2 dx .
28. 14. If XI , x2 , . . . are independent and have distribution function F, then
( X1 +
· · ·
+ Xn )/,f; also has distribution function F. Apply the central
limit theorem.
28. 15. The characteristic function of zn is c
1 n 1 k / t i e exp - E - ) n lc (l k l /n ) l + a (
1 00
--+ exp c
= exp
- oo
eit x - 1 dx l + a x l l
[ - 1 t 1 a100 1 l-x l 1 + a dx ] c
- oo
COS
X
. •
Section 29 29. 1.
(a) If I is lower semicontinuous, [x : l( x) > t] is open. If I is positive,
which is no restriction, then I Idp. = 1:p. [I > t] dt < t] dt < lim infn 1:p. n [ I > t] dt = lim infn I Idp. n . (b) If G is open, then IG is lower semicontinuous.
1: lim infn p. n [I >
29. 13. Let l: be the covariance matrix. Let M be an orthogonal matrix such that the entries of M� M' are 0 except for the first r diagonal entries, which are
1. If Y = MX, then Y has covariance matrix M� M', and so Y = and have the stan ( Y1 , Y,. , 0, . . . , 0), where Y1 , , Y,. are independent dard normal distribution. But 1 X1 2 = E�- 1 ¥;2 • .
•
•
,
.
•
.
600
NOTES ON THE PROBLEMS
the centered normal distribution Xn has asymptotically 2 with covariances aii . Put x == ( p\1 , , p}/2 ) and show that l:x' = 0, so that 0 is an eigenvalue of l: . Show that l: y ' == y ' if y is perpendicular to x, so that l: has 1 as an eigenvalue of multiplicity k 1. Use Problem 29.13 together with Theorem 29.2 (h(x) = l x l 2 ) . (Xn , , Xn,) has the same distri (a) Note that n-1E 7.. 1Y,� =1 1 and that 1 1 bution as ( Y,1 , , Y, , )/( n- E7_ 1 Y,� ) 12 .
29.14. By Theorem 29.5,
• • •
-
29.15.
. • •
• • •
29.16. Express the probability as a ratio of quantities (19.21), change variables and X; == Y;fn , show that the new integrands converge to exp( -
� E�_ 1 y/),
use the dominated convergence theorem.
Section 30 30. 1.
30.4.
se increas.. s; 1 Ln( £) == Ek f1x,."' :?; ( x;k dP. Choo 1 nu Ln(u- ) u-3 n > nf and put �n == u- for nu < n < < Mn . Put Y, k - Xnk l[lx I < M 1 Show Mn --+ nu+ t · Ln(Mn) 2 Ek E[ Y, k] --+ Ek E[ Y,k ] --+ 1, and apply to Ek Y, k tli� central limit Ek P[ Xnk =F Y, k] --+ 0. Suppose that the moment generating function Mn of #Ln converges to the moment generating function M of 1-' in some interval about s. Let �'n have density esx/Mn(s) with respect to 1-'n' and let " have density esx/M(s) with respect to p.. Then the moment generating function of �'n converges to that Show that of00 " in some interval about 0, and hence �'n = 00 / 00 /(x)p.n(dx) --+ / 00/(x)p.(dx) if f is continuous and has bounded Rescale so that == 1, and put ing so that < for Then 0 and that 0 and theorem under (30.5). Show that
•
"·
support; see Problem 25.13(b).
30.5.
(a) By Holder's inequality
IE�_ 1 t1x1 1' < k' - 1E�-1 1t1x1 1',
E 8 'f 1E1 t1x1 1'1-'(dx)jr! has positive radius of convergence. Now , , x J.R J-t1 ti i I' ( dx) ... L tl' · · · tk' a ( '1 , . . . , rk ) , ,
)
(
and so
r.
where the summation extends over k-tuples that add to Project p. to the line by the mapping apply Theorem 30.1, and use the fact that p. is determined by its values on half-spaces.
E1 t1x1,
30.6.
Use the Cramer-Wold idea.
30.8.
Suppose that k ==
2 in (30.30). Then
M [ ( cos At X ) 'l ( cos A 2 X ) '2 ] i11 2 x + e - i11 2 x ) 'z] a,,x x e e ei1'•1 ( [ ) ( + ... M ' •
2
2
By (26.33) and the independence of
A1
and
A 2 , the last mean here is 1 if
601
NOTES ON THE PROBLEMS
2 j1 - r1 == 2j2 - r2 == 0 and is 0 otherwise. A similar calculation for k == 1 gives (30.28), and a similar calculation for general k gives (30.30). The actual form of the distribution in (30.29) is unimportant. For (30.31) use the multidimensional method of moments (Problem 30.6) and the mapping theorem. For (30.32) use the central limit theorem; by (30.28), X1 . has mean 0 and variance ! . 30. 10. If n1 12 < m s n and the inequality in (30.33) holds, then log log n1 12 < log log n £ (log log n ) 1 12 , which implies log 1� n < £ - 2 log 2 2. For large n the probability in (30.33) is thus at most 1/ vn . -
Section 31 31. 1.
Consider the argument in Example 31.1. Suppose that F has a nonzero derivative at x and let In be the set of numbers whose base-r expansions agree in the first n places with that of x. The analogue of (31.16) is P( X e In + t 1/P( X e In ] --+ r- 1 , and the ratio here is one of p0 , . . . , p,. If t P; =F ,- for some i, use the second Borel-Cantelli lemma to show that the ratio is P; infinitely often except on a set of Lebesgue measure 0. (This last part of the argument is unnecessary if r == 2.) The argument in Example 31.3 needs no essential change. The analogue of (3 1.11) is
F( x ) == Po + · · · + P; - t + P;F( rx - i ) ,
. -r < x < i +r 1 ' I
--
O < i < r - 1. (b)
== I8- 1 Ho and /2 == F; (/1 /2 ) - 1{1 } == H0 is not a Lebesgue set. 31.9. Suppose that A is bounded, define p. by p.( B) == A ( B n A ), and let F be the corresponding distribution function. It suffices to show that F'( x) == 1 for x in A , apart from a set of Lebesgue measure 0. Let 4 be the set of x in A for which F'( x) < 1 - £ . From Theorem 31 .4(i) deduce that A ( 4 ) = I'( C: ) s (1 - £ ) A ( 4 ) and hence A ( 4 ) = 0. Thus F'( x) > 1 - £ almost everywhere on A . Obviously, F'( x) < 1. 31.11. Let A be the set of x in the unit interval for which F'( x) = 0, take a = 0, and define A n as in the first part of the proof of Theorem 31.4. Choose n so that A ( A n ) > 1 - £ . Split {1, 2, . . . , n } into the set M of k for which (( k - 1)/n , k/n ] meets A n and the opposite set N. Prove successively that E k e M [ F( k/n) - F(( k - 1)/n)] < £ , Ek e N [ F( k/n ) - F(( k - 1)/n )] > 1 - £ , Ek e M1/n � A ( A n ) > 1 - £ , EZ 1 1/( k/n) - /(( k - 1)/n ) l > 2 -
31.3.
Take /1
....
2£.
31. 15. n :'- t < ! + !e 2 it/311 ). 31.18. For x fixed, let un of successive dyadic rationals of order n and vn be the pair
n
( vn - un =
2 - ) for which
un
< x < vn . Show that
602
NOTES ON THE PROBLEMS
where a k is the left-hand derivative. Since a ;; ( the difference ratio cannot have a finite limit.
31.22.
A
x) == + 1
for all
and k ,
x
(31 .35)
A
be the x-set where fails if I is replaced by lrp ; then has Let Lebesgue measure Let G be the union of all open sets of p.-measure 0; represent G as a countable disjoint union of open intervals and let B be G together with any endpoints of zero p.-measure of these intervals. Let be the set of discontinuity points of If ll. ll. B , and ll. then and < h)
( t ) ) dt --+ /( cp ( F( x ) )) .
x £ < rp(F( x)) < x follows from F( x £ ) < F( x), and hence rp( F( x)) == x. If A is Lebesgue measure restricted to (0, 1), then p. == Arp - 1 , and (31 .36) follows by change of variable. But (36.36) is easy if x e D, and 1 hence it holds outside B u ( De n F- A ) . But p. ( B) == 0 by construction 1 and p. ( DC () r A ) == 0 by Problem 14.5. Now
-
-
Section 32 32.7.
#Ln
32. 7)
l' n ·
32.9.
-n ( n ) + Jls( n ) vs< >
( n) •
. . Defi ne and �'n as m ( , Where �'ac IS , and wn te �'n - �'ac absolutely continuous with respect to and is singular with respect to and == Take �'ac == Suppose that �'ac ( E) for all E in �' ( E == ���c ( E) and satisfies p. ( S ) == Choose an S in !F that supports �'s and Then �'ac ( E) == �'ac ( E n s c) == �'ac ( E n s c ) �'s ( E n s c ) == ��� c: ( E n s c.' ) + c c n s == ���c( E n s == ���c ( E). A similar argument shows that �' ( E
#Ln n "s En v; > . + s )
En v�:>
) v;( E == v;( E). Define I and "s
� 0.
+ v;( E) v; +
)
)
s
as in (32. 8) and let 1 ° and be the corresponding function and measure for !F0 : p. E) == �' ( E ) for E e and there is an !F 0 -set so such that Jls0( 0 - S == and p. ( S == 0. If E E then f£1 ° p. == p. == p. == J1 ° ( E S == ( E S � == It is instructive to consider the extreme case == in which is absolutely continuous with respect to p.0 (provided p. ( Q ) > vanishes. and hence
fE - sol dp. {0, n }, 0)
d
IE -sol0 d fEidp. . v0 �'so
"so v( fE/0 d + s0 0) 0 IE - sol0 d 0
-
0) 0)
.1F"0
v
-
�0, 0) �o
Section 33 33.2.
(a) (b)
To prove independence, check the covariance. Now use Example 33.7. Use the fact that and 8 are independent (Example As the single event = [ 8 == u [8 = == = == has probability the conditional probabilities have no meaning, and strictly speaking there is nothing to resolve. But whether it is natural to ° regard the degrees of freedom as one or as two depends on whether the line through the origin is regarded as an element of the decomposition of lines or whether it is regarded as the union of two the plane into elements of the decomposition of the plane into rays from the origin. Borel's paradox can be explained the same way: The equator is an element of the decomposition of the sphere into lines of constant latitude ;
R
(c) 5 wI4]
0,
[X
Y]
[X
-
Y
0]
20.2). w/4]
45
45°
603
NOTES ON THE PROBLEMS
the Greenwich meridian is an element of the decomposition of the sphere into great circles with common poles. The decomposition matters, which is to say the a- �eld matters.
33.3.
(a) If the guard says, "1 is to be executed," then the conditional probabil ity that 3 is also to be executed is 1 /(1 + p ). The " paradox" comes from assuming that p must be 1, in which case the conditional probability is
indeed ! . If p == ! , then the guard gives no information. (b) Here "one" and "other" are undefined, and the problem ignores the possibility that you have been introduced to a girl. Let the sample space be
bbo a4 ' bb�
1 -a
4 '
/l ' bgo 4
y gbo 4 ,
bgy 1 -4 /l '
gby
1 -
ggo 4B ' y
4 '
ggy
1 -
B
4 .
For example, bgo is the event (probability /l/4) that the older child is a boy, the younger is a girl, and the child you have been introduced to is the older; and ggy is the event (probability (1 - B)/4) that both children are girls and the one you have been introduced to is the younger. Note that the four sex distributions do have probability 1 . If the child you have been introduced to is a boy, then the conditional probability that the other child is also a boy is p == 1 /( 2 + fl - y). If {J == 1 and y == 0 (the parents present a son if they have one), then p == ! . If fl == y (the parents are indifferent), then p == ! . Any p between ! and 1 is possible. This problem shows again that one must keep in mind the entire experiment the sub-a-field t§ represents, not just one of the possible outcomes of the experiment.
33.6.
There is no problem, unless the notation gives rise to the illusion that p ( A l x) is P(A n [ X == x])jP[ X == x].
33.18. If N is a standard normal variable, then 1
rn
( ) + y Pv X
rn
...
1 e l2 w
_
x 2 /2
I
( y + ) /E [ ( y + N ) . rn rn X
I
Section 34
]
34.3.
If ( X, Y) takes the values (0, 0) , (1 , - 1), and (1 , 1) with probability t each, then X and Y are dependent but E[ Yll X] == E[ Y] == 0. If ( X, Y) takes the values ( - 1 , 1), (0, - 2) , and (1, 1 ) with probability t each, then E[ X] == E[ Y] == E[ XY] == 0 and so E[ XY] == E[ X] E[ Y], but E[ Y ll X] == Y ::1= 0 == E[ Y]. Of course, this is another example of dependent but uncorrelated random variables.
34.4.
First show that f fdP0 == /8 /dPjP( B) and that P0-measure 1 . Let G be the general set in �-
P[ B II �] > 0
on a set of
604
NOTES ON THE PROBLEMS
(a) Since •
/GP0 ( A II � ) P ( BII � ) dP ... {P0 [ A II � ) I8 dP ... fslG P0 [ A II � ) dP
it follows that
P0 [ A II � ] P ( B II � ] == P [ A
n
Bll �]
holds on a set of P-measure 1. (b) If P;(A) == P(A I B;), then
==
1GnB P [ A II � 1
v
.}�") dP.
Therefore, fcl8 P; [A II �] dP == fcl8.P[A II � v Jt'] dP if C == G n B; , and of course this holds for C == G n ' � if j =F i. Since C 's of this form constitute a w-syst�m generating � V Jt', 18 P;[A II �] == 18 P[A II � V Jt') 1 1 on a set of P-measure 1. Now use the result in part (a).
34.9.
All such results can be proved by imitating the proofs for the unconditional
case
or else by using Theorem
34.1).
34.5 (for part (c), as generalized in Problem
34.10. (a) If Y == X· - E[ XII �1 ], then X - E[ XII �2 ] == Y - E[ YII �2 ], and
E[( Y - E[ YII �2 ]) 2 11 �2 ] == E[ Y2 11 �2 ] - E 2 [ YII �2 ] s E[ Y2 11 �2 ). Take ex
pected values.
34.11. First prove that
From this and (i) deduce
(ii). From
(ii), and the preceding equation deduce fA n A P[A 3 11 �2 ] dP == P[A 3 11 �1 2 ] dP. The sets A 1 n A 2 form a w-system generating �1 2 •
/A 1 n A2
60S
NOTES ON THE PROBLEMS
34.16. (a) Obviously (34.1 5) implies (34.14). If (34.14) holds, then clearly (34.15) holds for lim " xk ==
X simple. For the general X, choose simple X" such that X and I xk I < I X I . Note that
fA,. XdP - a. j XdP
let n � oo and then let k � oo. (b) If n e fJI, then the class of E satisfying (34.14) is a A-system, and so by the w-A theorem and part (a), (34.15) holds if X is measurable a(BI). Since An e a(BI), it follows that
j XdP = j E( XII a ( !JI ) ) dP -+ a. j E( XII a ( !JI ) ) dP A ,.
A ,.
j
- a. XdP. (c) Replace X by X dP0jdP in (34.15). 34.17. (a) The Lindeberg-Uvy theorem. (b)
(c)
(d)
(e) (f)
(g) (h)
Chebyshev's inequality. Theorem 25.4. Independence of the Xn . Problem 34.16(b). Problem 34.16(c). Part (b) here and the £-8 definition of absolute continuity. Theorem 25.4 again.
34. 19. See S. M. Samuels: The Radon-Nikodym theorem as a theorem in probabil ity,
Amer. Math . Monthly, 85 (1978), 155-165, and SAK.S, p. 34.
Section 35 35.4.
(b) Let Snn be the number of k such that 1 < k < n and Yk == f . Then Xn == 3s,. j2 . Take logarithms and use the strong law of large numbers.
35.9.
Let K bound l XI I and the I Xn - xn - t l · Bound I XT I by KT. Write /T s k XT dP == E�- 1 /T - ; X; dP == Ef_ 1 ( /T :?. ; X; dP - /T :?. i + I X; dP). Transform the last integral by the martingale property and reduce the expression to E( X1 ] - fT > k xk + t dP. Now
1
T> k
xk + 1 dP
.s K( k + I) P[ T > k] .s K( k + I) k - 1
1
TdP -+ O.
T> k
606
NOTES ON THE PROBLEMS
35. 13. (a) By the result in Problem 32. 9, X1 , X2 , • • • is a supermartingale. Since
E[ I Xn 1 1 == E( Xn 1 < P ( n ) , Theorem 35.4 applies. (b) If A e � ' then fA (Y, + Zn ) dP + a�( A ) == fA X00 dP + G00 ( A ) == P ( A ) == fA Xn dP + an (A). Since the Lebesgue decomposition is unique (Problem 32.1), Y, + Zn == Xn with probability 1. Since Xn and Y, converge, so does zn . If A E "' and n > k, then fA Zn dP < aoo ( A ), and by Fatou' s lemma, the limit Z satisfies fA Z dP < a00 (A). This holds for A in U 1c "' and hence (monotone class theorem) for A in �- Choose A so that P( A ) = 1 and G00 ( A ) == 0: E[ Z] == fA ZdP < a00 ( A ) == 0. It can happen that an ( n ) == 0 and a00 ( n ) == ., ( n ) > 0, in which case an does not converge to aoo and the xn cannot be integrated to the limit. 35. 19. For a very general result, see J. L. Doob: Application of the theory of martingales, Le Ca/cu/ des Probabilites et ses Applications (Colloques Inter nationaux du Centre de Ia Recherche Scientifique, Paris, 1949). 35.22. (a) To prove that two sequences must have the same limit, interlace them. To prove the last assertion choose n and K so that I E[ XT ) I < K for T e T, , and note that T v n e T, and n
I E [ XT ] I < E E [ I xk I ] + I E [ XT n ] I · v
k-1
35.13. (a) Use Problem 13.17.
(b) If A m q is the set where I xn - Yl < ( for some n in the range m < n < q , then P(U A m q ) == 1 for each £ and m because Y is a cluster , ) f 1 as q f oo . variable, and hence P\A mq (c) By parts (a) and (b) there is probability exceeding 1 - 2jk 2 that I Xn - Yk I < 2/k for some n in the range nk < n < qk , in which case I XT" - Y�c l < 2/k. Thus P( I XT" - Yl > 3/k] < 3jk 2 •
35.25. (a) Of course, { x: } is L1 -bounded. To show that it is an amart, it suffices (Problem 35.22(b)) to show that E[ XTII ] converges if Tn e T, and Tn > Tn _ 1 •
Suppose that
n � q
and put
if XT < 0, if XT > 0. II
II
Check that �'n , q e T, and E [ X:II ] ==
fXT
"
XT. dP + E ( XTII ) - E ( X., ] �0 q " ·9
Now deduce (Problem 35.22(a)) that lim supn E[ X: J < lim infq E[ X: J. Thus limn E[ x:II ] exists, although it might be infinite. ,.Choose integers Pn such
607
NOTES ON THE PROBLEMS
that Tn
s
Pn and put if XT � 0, if XT < 0. II II
Check that �'n e T and use the estimate E( x: ] < E [ X., ] + E( l X I ] along with Problem 35.22(a) and the assumption that { xn } is Ll-bounded. (b) Obviously, sums and differences of L1 -bounded amarts are again L1 -bounded amarts; note that Xn v a == Xn + (a - Xn ) + and Xn A a == Xn - ( Xn - a) + . 35.16. If �'a� == p " inf[n: I xn I > a], then �'a . p E T and aP[max n � p I xn I > a) < a P [ I X., I > a] < E [ I X., I ) . 35.27. (a) Use Problem 35.25(b) and the fact that X! a > == - a v ( Xn A a ). (b) Use Problem 35.26. (c) Combine parts (a) and (b) with Problem 35.24. 35.28. (a) Since E[ xn ] increases to some limit, it suffices to show that n s T < p implies that E[ Xn 1 < E [ XT ] < E [ Xp ]. To prove the left-hand inequality, note that II
a.p
a.p
E[ X,. )
=
1
T�n
X,. dP
...
1
T- n
XT dP +
1
T>n
X,. dP
(b)
For example, let X be a bounded, positive random variable and put n Xn == ( - 1) Xjn.
Section 36
36.6.
The analogue in sequence space of a cylinder in R r is [ w : ( a ,1 ( w ) , . . . , a ," ( w )) e H), where H is a set of k-long sequences of O's and 1 ' s. The special cylinders considered in this problem, where H consists of a single sequence, might then be called " thin" cylinders. Assign thin cylinders probabilities P [ w : a; ( w ) == u; , i < n] == Pn ( u 1 , , un ) · Show that if r < s, then •
•
•
where the sum extends over sequences ( vr 1 , vs ) of O's and 1 's. Deduce from this that if r < s and a thin cylinder of rank r is the disjoint union of thin cylinders of rank s, then the P-values add. Following the arguments in the notes on Problem 2.16, extend P to the field �0 of finite disjoint unions of thin cylinders, and prove consistency and finite additivity. Since P is automatically countably additive on �0 , it can be extended to � == o ( �0 ). This proof parallels that of Kolmogorov' s theorem but is simpler, since regularity of measures is irrelevant and finite additivity on �0 implies countable additivity. +
• • •
,
608
36.9.
NOTES ON THE PROBLEMS
(b)
Show by part (a) and Problem 34.18 that In is the conditional expected value of I with respect to the a-field 1 generated by the coordinates By Theorem 35.7 (see Problem 35.1 7), (36.27) will follow if xn 1 , xn 2 , each set in nns,; has 'IT-measure either 0 or 1, and here the zero-one law applies. (c) Show that gn is the conditional expected value of I with respect to the a-field generated by the coordinates x1 , , xn , and apply Theorem 35.5.
9,;+
+
+
• • •
•
• • •
36. 11. Let !R be the countable set of simple functions E; a; IA . for a; rational and
{ A; } a finite decomposition of the unit interval in to subintervals with rational endpoints. Suppose that the X, exist and choose (Theorem 17.1) Y, in !l' so that E [ I X, - Y, I ] < -1 . From E [ I Xs - X, 1 ] == t , conclude that E( l � - Y, l ] > 0 for s ::1= t . But there are only countably many of the Y,. It
does no good to replace Lebesgue measure by some other measure on the unit interval.
36. 12. That over D the path function for w is integer-valued and nondecreasing
with probability 1 is easy to prove. Suppose that w has this property and that the path restricted to D has somewhere a jump exceeding 1.n Then for some integer and for every sufficiently large n there is a k � 2 such that N, k /2.. ( w) - N, ( k - l )/2.. (w) � 2. To prove that this has probability 0, use th� fact that - N, > 2] s Kh 1, where K is independent of and h.
t0 P(N,+h
t
X H) ==1 fM "n ( H; x1 ,
36. 13. Put l' l == " and inductively define l'n ( M n I · dl'n (x , , xn_ ) for -
, xn_ 1 ) and H e 9P ; see Problem 18.25. • • •
M e Yi 1 1 1 Prove consistency and let { XI ' x2 ' . . . } have finite-dimensional distributions Let P�( H, w) == "n ( H; X1 (w), . . . , Xn_ 1 ( w )) and prove by a �£1 , �£ 2 , change of variable that fG "�( H, w)P(dw) == P ( G n [ Xn e H)) for G == [( X1 , , Xn_ 1 ) E M). In the sense of Theorem 33.3, "� ( H, w) is thus a conditional distribution for xn given a [ XI ' . . . ' xn - 1 ]. For the part concern ing integrals use Theorem 34.5. If A supports " and supports l'n ( · ; x1 , , Xn_ 1 ) for (x 1 , , xn _ 1 ) e A X · · · X A, then P( X1 E A, . . . , Xn E A ) == l'n(A X · · · X A) == 1 for all n . In such a case it makes no difference how "n ( H; x1 , , xn _ 1 ) is defined • • •
•
•
•
•
•
•
•
•
outside
AX
· · ·
•
•
• • •
• • •
XA.
Section 37 37.2.
(a) Use Problem 36.10(b). (b) Let [ W,: > 0] be a Brownian motion on (0, , ,
P0), where W( w) e C for every w. Define �: Q --+ -Rr by Z, (�(w)) == W, (w). Show that �1 is measurable 'I!JIT and P == P0 � 1 If C c A e !JIT, then P(A) == P0(�- A) == P0(Q ) == 1. t
·,
•
37.3.
Consider W(1) ence. Since n
== E;: _ 1 ( W( k/n ) - W((k - 1)/n )) for notational conveni
jI W(l/n )l �' W2 ( !n ) dP = 1IW( l)I � 'Ji" W2 ( 1 ) dP --+ 0,
the Lindeberg theorem applies.
609
NOTES ON THE PROBLEMS
37.12. By symmetry,
[
]
s sust
p ( s, t ) == 2P � > O , inf ( J.Yu - � ) < - � ; � and the infimum here are independent because of the Markov property,
and
so
by (20.30) (and syJDIPetry again) p
( S , I)
=
2
1 00 o
p [ TX
2 ts 1001 == o
o
< I - S] X
J2w
1
,J2ws
u312
Reverse the integral, use f0xe - x 2 r/2 dx p
'IT o
==
]
2 2 X e f s dx
2 2u e- x /
s 1t1 . ( s , t ) == -
1
==
1
J2ws
2 e - x 2 / s du dx .
1/r, and put
1
u+s
s112 du 2 1 u1
2 1 dv 2 v Vl {i7i ;
v
== (s/(s +
u)) 1 12 :
Bibliography
HALMOS and SAK.S have been the strongest measure-theoretic and DOOB and FELLER the strongest probabilistic influences on this book, and the spirit of KAC's small volume has been very important. AUBREY: BAHADUR:
Brief Lives, John Aubrey, ed., 0. L. Dick. Seker and Warburg, London. 1949.
Some Limit Theorems in Statistics, R. R. Bahadur. SIAM, Philadelphia, 1971 .
BILLINGSLEY : 1965.
Ergodic Theory and Information , Patrick Billingsley. Wiley, New York,
BILLINGSLEY': 1 968.
Convergence of Probability Measures, Patrick Billingsley. Wiley, New York,
BIRKHOFF: Lattice Theory, rev. ed., Garrett Birkhoff. American Mathematical Society, Providence, Rhode Island, 1961. BIRKHOFF & MAC LANE: A Survey of Modern A lgebra, 4th ed., Garrett Birkhoff and Saunders Mac Lane. Macmillan, New York, 1977. BREIMAN :
Probability, Leo Breiman. Addison-Wesley, Reading, Massachusetts, 1968. A Course in Probability Theory, 2nd ed., Kai Lai Chung. Academic, New York,
CHUNG: 1 974.
CHUNG' : Markov Chains with Stationary Transition Probabilities, 2nd ed., Kai Lai Chung. Springer-Verlag, New York, 1967. �INLAR: Introduction to Stochastic Processes, Erhan �inlar. Prentice-Hall, Englewood Cliffs, New Jersey, 1 975. CRAMER: Mathematical Methods of Statistics, Harald Cramer. Princeton University Press, Princeton, New Jersey, 1946. DOOB :
Stochastic Processes, J. L. Doob. Wiley, New York, 1953.
DUBINS & SAVAGE: How to Gamble If You Must, Lester E. Dubins and Leonard J. Savage. McGraw-Hill, New York, 1965. DYNKIN & YUSHKEVICH: Markov Processes, English ed Aleksandr A. Yushkevich. Plenum Press, New York, 1 969. FEDERER:
610
.•
Evgenii B. Dynkin and
Geor11etric Measure Theory, Herbert Federer. Springer-Verlag, New York, 1 969.
BIBLIOGRAPHY
61 1
FELLER: A n Introduction to Probabili(V Theory and Its Applications, Vol. I, 3rd ed., Vol. II. 2nd ed., Willi am Feller. Wiley, New York, 1 968, 1971 . GALAMBOS: The Asymp totic Theory of Extreme Order Statistics, Janos Galambos. Wiley, New York, 1 978. GNEDENKO & KOLMOGOROV : Limit Distributions for Sums of Independent Random Variables, English ed., B. V. Gnedenko and A. N. Kolmogorov. Addison-Wesley, Reading, Massachusetts, 1 954. HALMOS:
Measure Theory, Paul R. Halmos. Van Nostrand, New York, 1 950.
HALMOS' :
Naive Set Theory, Paul R. Halmos. Van Nostrand, Princeton, 1 960.
HARDY: 1 946.
A Course of Pure Mathematics, 9th ed., G. H. Hardy. Macmillan, New York,
HARDY & WRIGHT: A n Introduction to the Theory of Numbers, 4th ed ., G. H. Hardy and E. M. Wright. Clarendon, Oxford, 1 959. HAUSDORFF: JEC H :
Set Theory, 2nd English ed., Felix Hausdorff. Chelsea, New York, 1 962.
Set Theory, Thomas Jech. Academic Press, New York, 1 978.
KAC : Statistical Independence in Probability, A nalysis and Number Theory, Carns Math. Monogr. 1 2, Marc Kac. Wiley, New York, 1 959. KAHANE: Some Random Series of Functions, Jean-Pierre Kahane. Heath, Lexington, Massachusetts, 1968. KARLIN & TAYLOR: A First Course in Stochastic Processes, 2nd ed., A Second Course in Stochastic Processes, Samuel Karlin and Howard M. Taylor. Academic Press, New York, 1 975 and 1 98 1 . KINGMAN & TAYLOR: Introduction to Measure and Probability, J. F. C . Kingman and S. J. Taylor. Cambridge University Press, Cambridge, 1 966. KOLMOGOROV : Grundbegriffe der Wahrscheinlichkeitsrechnung, Erg. Math., Vol. 2, No. 3, A. N. Kolmogorov. Springer-Verlag, Berlin, 1 933. LEVY : Theorie de / 'Addition des Variables A leatoires, Paul Levy. Gauthier-Villars, Paris, 1 937. LOEVE : LUKACS:
Probability Theory /, 4th ed., M. Loeve. Springer-Verlag, New York, 1 977. Characteristic Functions, Eugene Lukacs. Griffin, London, 1 960.
NEVEU : Mathematical Foundations of the Calculus of Probabili(v, English ed., J. Neveu. Holden-Day, San Francisco, 1 965. RADO: Length and A rea, Tibor Rad6. American Mathematical Society, Providence, Rhode Island, 1 948. RENYI : RENYI ' :
Probability Theory, A. Renyi. North-Holland, Amsterdam, 1 970. Foundations of Probability, Alfred Renyi. Holden-Day, San Francisco, 1 970.
RIESZ & SZ.-NAGY : Functional A nalysis, English ed., Frigyes Riesz and Bela Sz.-Nagy. Unger, New York, 1 955. ROYDEN :
Real A nalysis, 2nd ed., H. I. Royden. Macmillan, New York, 1 968.
RUDIN : Principles of Mathematical A na(vsis, 3rd ed., Walter Rudin. McGraw-Hill, New York, 1 976. RUDIN ' : 1974.
Real and Complex Analysis, 2nd ed., Walter Rudin. McGraw-Hill, New York,
612
BIBLIOGRAPHY
SAKS:
Theory of the Integral, 2nd
rev.
ed., Stanislaw
Saks. Hafner,
New York, 1937.
SKO ROKHO D : Studies in the Theory of Random Processes, English ed., A. V. Skorokhod. Addison-Wesley, Reading, Massachusetts, 1965.
SPITZER:
1964. SPIVAK:
Principles of Random Walk , Frank Spitzer. Van Nostrand, Princeton, New
Calculus
on
Manifolds,
Michael Spivak. W. A. Benj amin, New York, 1 965.
Jersey,
List 0,
2, 16 ( a, b ], 56 5 fA , 564 P, 2, 20 dn ( W ), 3 rn ( w ) 5 sn ( w ), 6 N, 8, 366 A - B, 564 A c, 564 A A B, 564 A c B, 564 jli:set, 18 fjl0, 18 o ( Jat'), 19 J, 20 fjl , 20 ( D, !F, P), 21 A n t A , 564 A n � A , 564 A, 24, 40, 166 A, 59 V, 59 /( � ), 29 Pn ( A ), 30 D ( A ), 30 � ' 30 P *, 33, 42 ,
of Symbols
P. , 33
�'
42 !t', 36
A* ' 41
P( B IA ) , 46
lim supnAn, 46 lim infnAn, 46 lim n A n, 47 i.o., 47
!T,
57, 295
R 1 , 565 R k , 567
[ X = x ], 63 a ( X), 64, 260 p, , 68, 157, 261 E[ X], 70, 280 Var[ X ] , 73, 282 En [ / ], 78 s c ( a ) , 89 Nn , 91 'T, 96, 130, 486
P ;1 ,
107 S, 107
108 M(t), 142, 285, 293 C (tk ), 144 'IT; ,
!Ji , 155 x k t x,
565, 157 613
614
x
LIST OF SYMBOLS
= Ekxk , 1 57
p. *, 162 Jt( p. * ),
A 1 , 1 66
163
dist, 168 A k , 1 72 F, 175 , 1 77 177 flAF, T - 1A' , 565 :Fj:F', 182 Fn :::o F, 192, 335, 390 dF( x ), 230 X X Y, 234 !'I X d.!/, 234 p. X v, 236 diam, 247 h m , 248 e m , 249 I D I , 252 . , 272 Xn -+p X, 274, 340
L P ( D, :F, P ), 289
N, , 308 p. n :::0 p. ,
336, 3 90 Xn :::o X, 338, 390 xn :::0 a , 341 a A , 566 q> ( t ), 351 N, 8, 366 p. n --+ p,, 382 Fs , 434 Fa t'' 434 Jl « p, , 442 d vjdp, , 443 Jls , 445 v
�'a c' 445 P [ AII�], 451 P[ A ll X,, t e T], 454 E[ XII �], 466
RT, 508 !JtT, 509 w,, 522
Index
564-574); u . v refers to Problem v in Section pp . 575-609); the other references are to pages .
Here An refers to paragraph n of the Appendix (pp. u.
or else to a note on it (Notes on the Problems ,
Greek letters are alphabetized by their Roman equivalents bibliography are not indexed separately . Abel ,
244
(m
for IJ., and so on) . Names in the
Atom, 20. 1 6
Absolute continuity,
434, 436, 3 1 .7, 442 , 443 ,
18
Axiom of choice , AS ,
32. 3 Absolutely continuous part ,
446
28 1 Absorbing state , I 09 Adapted a-fields , 480
Bahadur,
Absolute moment ,
Additive set function , 440,
9 .4
Baire category, A 15,
1.13
13. I0 59 , 33.9, 33 . 1 7, 496
Baire function ,
32. 1 1
Bayes ,
countable ,
Benford ' s law, 25 . 3 Beppo Levi ' s theorem ,
finite ,
Bernoulli-Laplace model of diffusion,
Additivity :
1 58 21 , 23 , 1 58
Bernoulli trials,
Algebra , 1 7 Almost always,
48
Almost everywhere , Almost surely ,
20.24 Betting system , 95 , 485 Billingsley , 19 . 1 2, 392 Binary digit, 3 Binary expansion, 3 Binomial distribution, 262 Binomial series , 1 7 . I I Birkhoff, 42, 1 73 Blackwell , 476
54
Beta function ,
504 a-mixing , 375, 29 . 1 7 Aperiodic , 1 22 Amart ,
Arc, 19.3
108 80, 143 , 148, 9.3, 368
Bernstein polynomial , 82 Bernstein ' s theorem , 82
54, 420
Approximation of measure ,
16.9
166
19.3, 3 1 . 1 1 Area over the curve, 74 Area under the curve , 206 Arrival times , 309 Arc length ,
Asymptotic equipartition property ,
6. 1 4, 8 . 30
Bold play, 98 Boote ' s inequality , 23
615
616
INDEX
Borel , A l 3 , 9
Closure , A I I
Borel-Cantelli lemmas , 5 3 , 55 , 58 , 83 , 2 1 . 6 ,
Cluster variable , 35 . 23
22 .4
Co-countable set , 1 8
Borel function , 1 84. 1 3 . I 0, 3 1 . 3
Cofinite set , 1 8
Borel set , 20 , 3 . 1 4 , 1 55 , 433 Borel ' s nonnal number theorem , 9 , 6. 9 Borel ' s paradox , 3 3 . I
Collective , 7 . 3 Complement, A I
Boundary , A I I , 265
Completely nonnal number , 6 . 1 3
Bounded convergence theorem , 2 1 4
Complete space or measure , 39 , 1 0 . 6 , 533
Bounded gambling policy , 97
Completion, 3 . 5 , 1 0. 6
Bounded variation, 435
Complex functions, integration of, 22 1
Branching process , 483
Compound Poisson distribution , 28 . 3
Britannica , I . 5
Compound Poisson process , 23 . 8
Brown , 3 1 4
Concentrated , 1 58
Brownian motion , 522, 529 , 553 , 559 , 560 Burstin ' s theorem , 22 . 1 6
Conditional distribution, 460 , 47 1 , 36 . 1 3
Compact , A 1 3
Conditional expected value , 1 3 1 , 466 , 467 , 34 . 1 9
Canonical measure , 384
Conditional probability , 46 , 448 , 45 1 , 454 , 33.5, 33. 1 3
Canonical representation, 384 Cantelli ' s inequality , 5 . 5
Consistency conditions for finite-dimensional
Cantelli ' s theorem , 6 . 6
Content , 3 . 1 5
Cantor function , 3 1 . 2 , 3 1 . 1 5
Continuity from above , 23 , I 4\9 , 1 77 , 265
Cantor set , 1 . 6, 1 . 9 , 3 . 1 6 , 4 . 23 , 1 2 . 8
Continuity from below , 23 , 1 59 , 265
Caratheodory' s condition , 1 68
Continuity of paths , 524
Cardinality of a-fields , 2 . 1 1
Continuity set, 344, 390
Cartesian product , 234
Continuity theorem for characteristic functions ,
Category , A 1 5
distributions , 507
359 , 396
Cauchy distribution, 20. 20 , 2 1 . 5 , 22 .6, 358 ,
Conventions involving
26 . 9 ' 28 . 5 ' 28 . 1 3 Cauchy ' s equation , A20, 1 4 . 1 1
Convergence in distribution, 338 , 390
Cavalieri 's principle , 1 8 . 8
Convergence in measure , 274
Centrctl limit theorem , 300 , 366 , 27 . 1 4 , 27 . 1 5 ,
Convergence in probability , 274 , 299 , 340
387 ' 398 , 408 , 34. 1 7 ' 497 Central symmetrization, 25 1
Convergence with probability I , 274 , 299
Cesaro averages , A30 , 25 . 1 6
Convergence of types , 1 95
Change of variable , 2 1 8 , 22 8 , 229 . 252 . 264.
Convex functions , A32 , 75
280
oo ,
Convergence in mean , 2 1 . 22
Convergence of random series , 298
Convolution, 272
Characteristic function, A5 , 35 1 , 395
Coordinate function , 509
Chebyshev 's inequality , 5 , 75 , 5 . 5 , 8 1 , 283 ,
Coordinate variable , 509
34. 9
1 57 , 202
Coordinate-variable process , 5 1 0
Chemofr s theorem , 1 47
Countable , 8
Chi-squared distribution , 20 . 22 , 2 1 . 35 , 29 . 1 4 ,
Countable additivity , 20 , 1 58
462
Countable subadditivity , 23 , 1 59 , 1 62 , 1 64
Chi-squared statistic , 29 . 1 4
Countably additive , 20
Circular Lebesgue measure, 1 3 . 1 9
Countably generated a-field , 2 . 1 0 , 20 . 1
Class of sets , 1 6
Countably infinite , 8
Closed set , A I I
Counting measure , 1 58
Closed set of states , 8 . 23
Coupled chain , 1 23 , 8 . 1 6
Closed support , 1 2 . 9
Coupon problem , 372
617
INDEX Covariance , 284
Double-or-nothing , 95 . 7 . 6
Cover , A3
Double series , A27
Cramer-Wold theorem , 397 . 30 . 6
Doubly stochastic matrix , 8 . 22
Cumulant , 1 44
Dubins-Savage theorem , 99
Cumulant generating function , 1 44
Dyadic expansion, A3 1 , 3
Cylinder, 2 . 1 6 , 509 , 36 .6
Dyadic interval , 4 Dyadic rational , 4
Danieii-Stone theorem , 1 1 . 1 2 , 1 6 . 25
Dynkin , 1 1 0 Dynkin ' s 11'-A theorem , 37
Darboux-Young definition, 208 a-distribution, 1 94 Decomposition, A3 Definite integral , 202 , 203
E-8 definition of absolute continuity , 443 , 32 . 3 Egorov ' s theorem , 1 3 . 1 3
Degenerate distribution function, 1 95
Empirical distribution function , 275
Delta method , 368 , 27 . 1 0 , 27 . 1 1 , 402 , 29 . 8
Empty set , A I
DeMoivre-Laplace theorem , 25 . 1 1 , 368 DeMorgan ' s law, A6
Entropy , 52, 6. 1 4 , 8 . 30, 3 1 . 1 7 Equicontinuous , 26 . 1 5
Dense , A l 5
Equivalence class , 52
Density of measure , 2 1 6 , 440
Equivalent measures , 442 , 472
Density point, 3 1 . 9
Erdos-Kac central limit theorem , 4 1 3
Density of random variable or distribution,
Erlang density, 23 . 2
262 , 264 , 266
Essential supremum , 5 . I I
Density of set of integers , 2 . 1 5
Estimation, 475
Denumerable probability, 45 , 552
Etemadi , 290, 297
Dependent random variables , 375
Euclidean distance , A 1 6
Derivates , 423
Euclidean space , A I , A 1 6
Derivatives of integrals , 42 1 Diagonal method , A 1 4
Euler, 1 8 . 1 7 , 1 8 . 22 Euler ' s function, 2 . 1 5
Difference , A I
Event , 1 6
Difference equation, A 1 9
Excessive function , 1 3 1 , 8 . 33
Differential equations and the Poisson process ,
Existence of independent sequences , 68 , 27 1
317
Existence of Markov chains , 1 1 2
Dirichlet ' s theorem , A26 , 20
Expected value , 70 , 280
Discontinuity of the first kind , 56 1
Exponential convergence , 1 28
Discrete measure , 2 1 , 1 58
Exponential distribution, 1 9 1 , 263 , 20 . 25 , 286 ,
Discrete random variable , 26 1 Discrete space , I . I , 2 1
307 ' 327 ' 33 1 ' 358 , 28 . 5 Extension of measure , 3 2 , 1 64 , 1 1 . 1 , 1 1 . 5
Disjoint, A3
Extremal distribution , 1 97
Disjoint supports , 430, 442 Distribution: of random variable , 68 , 1 89 , 26 1
Factorization and sufficiency , 472
of random vector, 265 Distribution function , 1 76, 1 89 , 26 1 . 265 , 429
Fair game , 8 8 , 480 , 485 Fatou ' s lemma, 2 1 2 , 1 6.4, 34 . 8
Dividing cake , 2 . 1 8
Federer, 1 9 . 8
Dominated convergence theorem , 72 . 2 1 3 .
Feller, I , 8 . 1 9 , I 08 , 25 . 3 , 25 . I I , 6 1 0 Feller ' s theorem , 374
2 1 4, 1 6 .6, 2 1 . 2 1 , 348 Dominated measure , 442 , 472 , 494
Field , 1 7 , 1 9 , 2 . 5
Doob , 1 8 . 1 1 ' 490, 35 . 1 9 Double exponential distribution, 358
Finite additivity , 2 1 , 23 , 1 58
Double integral , 237
Finite-dimensional distributions , 3 1 9 , 506
Finite or countable , 8
618
INDEX
Finite-dimensional sets , 509
Independent classes . 50
Finitely additive field , 1 8
Independent events . 48 . 453
Finite measure , 1 57
Independent increments , 309 , 522 , 37 . 3
Finite subadditivity , 22 , 1 59 First Borei-Cantel li lemma , 53
Independent random variables , 66 , 73 , 1 92 , 267 , 284 , 287 , 290 , 355 , 454
First category , A 1 5 , 1 . 1 3
Independent random vectors , 268 . 269 . 454
First passage , 1 1 5
Indicator, AS
Fisher, 1 99
Infinite convolutions , 26 . 22
Fixed discontinuity , 3 1 4
Infinitely divisible distributions , 382, 385
Fourier series , 36 1 , 26. 26 , 27 . 2 1
Infinitely often, 47
Fourier transform , 352
Infinite series, A25
Frechet , 1 99
Infinitesimal array, 373
Frequency , 5 Fubini ' s theorem , 238 , 36. 9
Information, 52
Functional central limit theorem , 548
Initial digit problem , 25 . 3
Fundamental in probability , 20 . 28
Initial probabilities , I 08
Fundamental theorem of calculus , 227 , 4 1 9
Inner boundary , 58 , 85
Information source , 6. 1 4 , 8 . 30
Inner measure , 3 3 , 3 . 2 Integrable, 203 Galambos, 1 99
Integral , 202
Gambling policy , 96
Integral with respect to Lebesgue measure , 224
Gamma distribution , 20 . 23 , 26 . 8 , 28 . 5
Integrals of derivatives , 433
Gamma function, 1 8 . 22
Integration by parts , 239
Generated a-field , 1 9 , 64, 1 3 . 5 , 260
Integration over sets , 2 1 5
Glivenko-Cantelli theorem, 275 , 25 . 2
Integration with respect to a density , 2 1 7
Gnedenko , 1 99 , 28.4 Goncharov ' s theorem , 37 1
lnterarrival time , 322 Interior, A I I Interval , A9
Invariance principle , 547 , 37 . 1 4 Hahn decomposition, 44 1 , 32 . 2 , 34 . 1 9
Inventory , 3 23
Halmos , AS, 39 , 1 1 . 6 , 6 1 0
Inverse image , A7 , 1 82
Hamel basis , 1 4 . 1 1
Inversion formula, 355
Hardy , 1 . 1 0, 4 1 3 , 564
Irreducible chain, 1 1 7
Hardy-Ramanujan theorem , 6 . 1 7 , 30 .9, 30. 1 2
Irregular paths , 530
Hausdonf, 1 74 , 564
Iterated integral , 237
Hausdorff dimension, 1 9 . 1 1 Hausdorff measure , 247 Heine-Bore( theorem , A l 3 , A l 7
Jacobian , 229 , 25 3 , 266
Helly selection theorem , 345 , 392
Jensen·s inequality , 75 , 5 . 7 . 5 . 8 , 283 , 2 1 . 7 ,
Hewitt-Savage zero-one law , 304
470
Hilbert space , 2 1 . 27 , 34. 1 3 , 34 . 1 4 , 35 . 1 8
Jordan , 3 . 1 5
Hitting time , 1 34 , 54 1 , 37 .9 Holder ' s inequality , 75 , 5 . 9 , 5 . 1 1 , 283 , 34 . 9
Jordan decomposition, 442
Hypothesis testing , 1 48 Kac , 4 1 3 , 6 1 0 Inadequacy of �R1• 5 1 7 . 36. I 0
Inclusion-exclusion formula. 22 . 1 59 Indefinite integral . 4 1 9 Independent array , 5 1
Kahane , 22 . 1 2 k-dimensional Borel set , 1 55 k-dimensional Lebesgue measure , 1 7 1 , 1 78 . 20 .4 Khinchine , I . I 0, 27 . 1 7
619
INDEX Khinchine-Pollaczek fonnula, 332, 333
Lower semicontinuous. 29 . I
Kindler , I I . 1 2
Lower variation. 442
Kolmogorov , 28 .4
LP-space . 2 1 . 25
Kolmogorov·s existence theorem , 272, 5 1 0 ,
A-system , 36 , 3 . 8 , 3 . 1 0
523 , 37 . 1
Lusin' s theorem , 1 7 . 8
Kolmogorov ' s ineq uality , 296 , 487
Lyapounov ' s condition , 37 1 . 27 . 7
Kolmogorov ' s zero-one law , 57 , 294 , 22 . 1 3
Lyapounov ·s inequality , 76 , 283
Kronecker' s theorem , 338
Lyapounov 's theorem . 37 1
Kurzweil integral , 1 7 . 1 3 Mac Lane , 1 73 Ladder height, 325
Mapping theorem , 343 , 39 1
Ladder index , 325
Marginal distribution , 266
Landau notation, A 1 8
Markov chain, 1 07 , 20. 8 , 375 , 379 , 29 . 1 8 ,
Laplace distribution, 358
450
Laplace transfonn, 285 , 293
Markov process , I 07 , 456 , 537
Large deviations , 1 45 , 1 49 , 27 . 1 7
Markov 's inequality , 74 , 283 , 34 . 9
Lattice distribution, 26. t
Law of the iterated logarithm , 1 5 1 Law of large numbers:
Markov time , 1 30 Martingale, 98 , 7 . 8 , 445 , 480, 487 , 540, 37 . 1 7
strong , 9 , I I , 58, 80 , 6 . 8 , 1 27 , 290 , 27 . 20
Martingale central limit theorem , 497
weak , 5 , I I , 58 , 8 1 , 292
Martingale convergence theorem , 490
Law of random variable , 1 89 , 26 1
Maximal ineq uality , 296 , 525
L1 -bounded , 504
Maximal solution, 1 1 9
Lebesgue , 1 5 . 5 , 2 1 3
JL-continuity set, 344, 390
Lebesgue decomposition, 435 , 3 1 . 20 , 446 ,
m-dependent , 6. 1 1 , 376
32.7
Mean val ue , 70 , 26 . 1 7
Lebesgue function, 3 1 . 3
Measurable function, 1 83
Lebesgue integrable , 224
Measurable mapping , 1 82
Lebesgue integral , 224 , 225 , 229
Measurable process , 529 , 37 . 6 , 37 . 7
Lebesgue measure , 24. 3 3 , 40, 1 66 , 1 7 1 , 1 78 .
1 7 . 1 7 ' 24 1 ' 20 .4 Lebesgue 's density theorem , 3 1 . 9 , 35 . 1 5
Lebesgue set, 40 , 3 . 1 4, 433
Measurable random variable . 64 Measurable rectangle . 234 Measurable with respect to a a-field , 64, 260 Measurable set, 1 8 , 34 , 1 63
Leibniz ' s fonnula, 1 7 . I 0
Measurable space , 1 58
Levy distance , 1 4 . 9 , 25 .4
Measure , 20 , 1 57
Likelihood ratio, 483 , 495
Measure space , 2 1 , 1 58
Limit inferior , 46 , 4 . 2
Meets , A3
Limit of sets , 47
Method of moments, 405 , 30 . 5 , 30 .6
Limit superior , 46, 4 . 2
Minimal solution , 8 . 1 3
Lindeberg condition, 369 , 27 . 7
Minimum-variance estimation , 475
Lindeberg-Uvy theorem , 366
Minkowski 's inequality , 5 . I 0 , 5 . I I , 2 1 . 24
Lindeberg theorem , 368 , 30 . I
Mixing , 375 , 29 . 1 7 , 34 . 1 6
Linear Borel set, 1 55
1J. *-measurable , 1 63
Linearity of expected value , 7 1
Moment, 7 1 , 28 1
Linearity of the integral , 209
Moment generating function , I . 7 , 1 42 , 285 ,
Linearly independent reals, 1 4. 1 1 , 30. 8
292 , 408
Lipschitz condition. 3 1 . 1 7
Monotone , 22 , 1 59 , 1 62 , 209
log-nonnal distribution, 407
Monotone class , 39 , 3 . 1 0
Lower integral . 207
Monotone class theorem , 39
620
INDEX
Monotone convergence theorem , 2 1 1 , 2 1 4, 34 . 8
Pa�cal distribution, 1 99 Path function, 320 , 5 1 8 , 525
Monotonicity of the integral , 209
Payoff function, 1 3 1
M-test , A28 , 2 1 3
Perfect set , A 1 5
Multidimensional central limit theorem , 398 ,
Period , 1 22 , 8 . 27
29 . 1 8
Permutation , 67 , 6 . 3 , 37 1
Multidimensional characteristic function, 395
Persistent, 1 1 4 , 1 1 7, 1 1 8
Multidimensional distribution, 265
Phase space , I 08
Multidimensional normal distribution, 397 ,
Pinsky , 27 . 1 7
29 . 5 , 29 . 1 0 , 33 . 8 Multinomial sampling , 29 . 1 4
11'-A theorem , 36 , 3 . 9 P *-measurable , 34 Poincare , 29 . 1 5 Point, 1 6
Negative part , 203 , 259
Point of increase , 1 2. 9 , 20 . 1 6
Negligible set, 8 , 1 .4, 1 . 1 2, 4 1
Poisson approximation, 3 1 2 , 336
Newton , 1 7 . I I
Poisson distribution , 262. 20 . 1 8 , 286 . 3 1 0 ,
Nonatomic , 2 . 1 7 Nondegenerate distribution function, 1 95 Nondenumerable probability , 552
23 . 1 3 , 27 . I , 27 . 3 . 387, 28 . 1 2 Poisson process . 307 , 449 , 457 , 33 . 7 , 34 . 1 2 . 36 . 1 2 ' 36 . 1 4 . 560
Nonmeasurable set, 4 1 , 4 . 1 3 , 1 2.4
Poisson ' s theorem . 6 . 5
Nonnegative series , A25
Policy , gambling , 96
Normal distribution, 263 , 264 , 273 , 20. 1 9 ,
Polya 's criterion . 26 . 3
20 . 22, 28 1 ' 286 , 354 , 358 , 366 , 28 . 1 3 , 28 . 1 4, 397
Polya· s theorem , 1 1 5 , 1 1 8 Positive part , 203 , 259
Normal number , 8 , I . 9 , I . I I , 3 . 1 5 , 8 1 , 6. 1 3
Positive persistent , 1 27
Normal number theorem, 9 . 6 . 9
Power series, A29
Nowhere dense , A 1 5
Prekopa ' s theorem , 3 1 4
Nowhere differentiable , 3 1 . 1 8 , 532
Prime number theorem , 5 . 1 7
Null persistent, 1 27
Primitive, 227 , 4 1 9
Number theory , 4 1 2
Probability measure , 20 Probability measure space , 2 1 Probability transformation , 1 4 . 4
Open set, A l l
Product measure , 1 2. 1 3 , 234 , 236
Optimal policy , 99
Product space , 234 , 508 , 3 7. 2, 562
Optimal stopping , 1 30
Progress and pinch , 7 . 7
Optimal strategy , 1 34
Proper difference , A I
Optional stopping , 495 , 35 . 20
Proper subset , A I
Order of dyadic interval , 4
11'-system , 36, 3 . 9
Order statistic, 1 4 . 7 Orthogonal transformation , 1 73 , 398 Outer boundary , 5 8 , 85
Queueing model, 1 2 1 , 1 24 , 322
Outer measure , 3 3 , 3 . 2 , 1 62
Queue size , 333
Pairwise additive , 1 1 . 6
Rademacher functions, 5 , 1 . 8 , 63 , 298 , 30 1
Pairwise disjoint, A3
Rado , 1 9 . 8
Partial-fraction expansion, 20 .20
Radon-Nykodym derivative , 443 , 445 , 32.6,
Partial information, 52
32. 1 3 , 482 , 494 , 35 . 1 3
Partial summation , 1 8 . 1 9
Radon-Nykodym theorem , 443 , 32.4
Partition , A3 , 422 , 450
Random Taylor series, 30 1
621
INDEX Random variable , 63 , 1 83 , 259
Sequence space , 2 . 1 6 , 4. 1 8 , 8 . 4 , 36 .6
Random vector, 1 84, 260
Service time , 322
Random walk, 1 09 , 325
Set , A I
Rank of dyadic interval , 4
Set function , 20 , 440 , 3 2 . 1 1
Ranks and records , 20 . I I
a-field , 1 7 , 1 9
Rate of Poisson process , 3 1 0
a-field generated by a class of sets , 1 9
Rational rectangle , 1 55
a-field generated by a random variable , 64 ,
Realization of process , 5 1 8
260
Record values , 20 . 1 2 , 2 1 .4, 27 . 8
a-field generated by a transformation, 1 3 . 5
Rectangle , A l 6, 1 55
a-finite on a class , 1 58
Rectifiable curve , 1 9 . 3 , 3 1 . 1 1
a-finite measure , 1 57 , I 0. 7
Recurrent event, 8 . 1 9
Shannon' s theorem , 6. 1 4 , 8 . 30
Red-and-black, 88
Signed measure , 32 . 1 2
Regularity , 4 1 , 1 74
Simple function, 1 85
Regular partition, 1 78
Simple random variable , 63 , 1 85 , 260
Relative measure , 25 . 1 7
Singleton , A I
Renewal theory, 8 . 1 9 , 32 1
Singular function, 427 , 43 1 , 3 1 . 1
Renyi-Lamperti lemma , 6 . 1 5
Singularity , 442 , 494
Reversed martingale , 492
Singular part, 446
Riemann integral , 2 , 25 , 202 , 224 , 1 7 . 6 , 25 . 1 4
Skorohod embedding , 339 , 545
Riemann-Lebesgue lemma, 354
Skorohod 's theorem , 342 , 392 , 399
Right continuity , 1 75 , 26 1
Source , 6. 1 4 , 8 . 30
Rigid transformation , 1 73
Southwest , 1 76
Ring , 1 1 . 5
Space , A I , 1 6
Royden, I . 1 3
Square-free integers , 4 . 2 1
Rudin, 229 , 26 . 1 9 , 42 1 , 435
a-ring , 1 1 . 5 Stable law , 389 , 28 . 1 5 Standard normal distribution, 263
Saks , 6 1 0
State space , I 08
Saltus , 1 89 , 309
Stationary distribution, 1 2 1
Sample function , 5 1 8
Stationary increments , 523 , 37. 3
Sample path , 320 , 5 1 8
Stationary probabilities, 1 2 1
Sample point, 1 6
Stationary process, 36. 2
Sampling theory , 4 10
Stationary sequence of random variables ,
Samuels , 34 . 1 9
375
Scheffe's theorem , 2 1 8 , 1 6 . 1 5
Stationary transition probabilities , 1 08
Schwarz' s inequality , 75 , 5 . 6, 283
Statistical inference , 9 1 , 1 48
Second Borel-Cantelli lemma , 55 , 4 . 1 4 , 83
Steiner symmetrization, 25 1
Second category , A 1 5
Stieltjes integral , 230
Second-order Markov chain , 8 . 3 1
Stirling 's formula, 5 . 1 7 , 1 8 . 2 1 , 27 . 1 8
Secretary problem , I l l
Stochastic arithmetic, 2. 1 5 , 4 . 2 1 , 5 . 1 6, 5 . 1 7 ,
Section, 235
6. 1 7 , 1 8 . 1 8 , 25 . 1 5 , 4 1 2 , 30. I 0 , 30 . I I ,
Selection problem , 1 1 0 , 1 35
30 . 1 2
Selection system , 9 1 , 93 , 7 . 3
Stochastic matrix , I 09
Semicontinuous , 29 . I
Stochastic process , 309 , 3 1 9 , 506
Semiring, 1 64 Separable function, 552 Separable process, 553 , 558 Separable a-field, 2 . 1 0 Separant, 553
Stopping time , 96, 1 30 , 486, 534 , 536 , 539
Strong law of large numbers , 9, I I , 58, 80 , .
6 . 8 , 1 27 , 290 , 27 . 20 Strong Markov property , 535 , 537, 37. 1 0
Strong semiring , I I . 6
622
INDEX
Subadditivity:
Uni form distribution modulo I , 337, 362
countable . 23 . 1 59
Uni form integrability , 2 1 9 , 2 1 . 2 1 , 347
finite . 22. 1 59 Subfair game . 88 , 99
Uniformly asymptotically negligible array , 374 Uniformly bounded , 72 , 6 . 8 , 97 , 504
Subfield , 52, 260
Uniformly eq uicontinuous, 26 . 1 5
Submartingale . 484 , 35 .6 Subset , A I Sub-a-field, 52 , 260 Subsolution, 8 . 6 Substochastic matrix , 1 1 8
Uniq ueness of extension , 36, 38 , 1 60 Uniqueness theorem for characteristic functions , 355 , 26. 1 9 , 396
Uniqueness theorem for moment generating functions , 1 43 , 294 , 26 . 7 , 408, 30 . 3
Sufficient a-field, 47 1 , 34 . 20
Unitary transformation, 1 73
Sufficient statistic , 472
Unit interval , 45 , 68 , 27 1 , 5 1 5 , 37 . I
Superharmonic function , 1 3 1
Unit mass , 22 , 1 94
Supermartingale , 485
Univalent, 252
Support line , A33
Upcrossing , 488
Support of measure, 2 1 , 1 58, 1 2 . 9, 26 1 , 430 ,
Upper integral , 207
442
Upper semicontinuous , 1 3 . I I , 29 . I
Surface area , 247 , 1 9 . 8 , 1 9 . 1 3
Upper variation, 442
Symmetric difference , A I
Utility function, 7 . I 0
Symmetric random walk , 1 09, 1 1 5 . 1 35 , 35 . 1 0 Symmetric stable law , 28 . 1 5 System , 1 08
Vague convergence , 382 , 28 . 1 , 28 . 2 Value of payoff function , 1 3 1 van der Waerden, 3 1 . 1 8
Tail event, 57 , 295 , 304 Tail a-field , 57 , 295 Taylor series, A29 , 30 1 Thin cy finders , 36 . 6 Three series theorem, 299 Tightness, 346, 392 , 29 . 3
Variance , 73 , 282 , 285 Variation, 435 , 442 Version of conditional expected value , 466 Version of conditional probability , 449 , 45 1 Vieta's formula, 1 . 8 Vitali covering theorem , 250
Timid play , I 05 Tippet , 1 99 Tonelli 's theorem , 238 Total variation , 442 Trace , 1 9 . 3 Traffic intensity , 325 Trajectory of proce�s, 5 1 8 Transformation of measures , 1 86 Transient, 1 1 4 , 1 1 7 Transition probabilities , I 08 , I l l Translation invariance , 1 72 Triangular array , 368 Triangular distribution , 358
Waiting time , 323 Wald' s eq uation, 22 . 9 Wallis's formula, 1 8 . 20 Weak convergence , 1 92 , 335 , 390 Weak law of large numbers , 5, I I , 58, 8 1 , 292 Weierstrass approximation theorem , 82 , 26 . 1 9 Weierstrass M-test , A28 , 2 1 3 Wiener, 523 Wiener process, 522 With probability I , 54 Wright. 1 . 1 0 . 4 1 3
Trifling set, 1 . 4, 1 . 1 2 , 3 . 1 5 Type , 1 95
Yushkevich . I I 0
Unbiased estimate , 475
Zeroes of Brownian motion , 534, 37 . 1 3
Uncorrelated random variables , 2 1 . 1 5
Zero-one law , 57 , 294 . 304 , 22. 1 3 , 37.4
Uniform distribution , 263 , 358, 27 . 2 1 . 29 . 9
Zoms ' lemma. AS
S c an du r c h d i e D eu t s c he F i l i a l e von s t aa t l i c he r B au e r s c ha f t ( KOLX0 3 ' A )