Davar Khoshnevisan
Graduate Studies in Mathematics Volume 80
American Mathematical Society
Probability
Probability...
410 downloads
2603 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Davar Khoshnevisan
Graduate Studies in Mathematics Volume 80
American Mathematical Society
Probability
Probability Davar Khoshnevisan
Graduate Studies in Mathematics Volume 80
American Mathematical Society Providence, Rhode Island
Editorial Board David Cox Walter Craig Nikolai Ivanov Steven G. Krantz David Saltman (Chair) 2000 Mathematics Subject Classification. Primary 60- 01; Secondary 60-03, 28-01, 28-03.
For additional information and updates on this book, visit
www.ams.org/bookpages/gsm-80
Library of Congress Cataloging-in-Publication Data Khoshnevisan, Davar. Probability / Davar Khoshnevisan.
p. cm. - (Graduate studies in mathematics, ISSN 1065-7339 ; v. 80) Includes bibliographical references and index. ISBN-13: 978-0-8218-4215-7 (alk. paper) ISBN-10: 0-8218-4215-3 (alk. paper) 1. Probabilities. 1. Title. QA273.K488
2007 2006052603
Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Acquisitions Department, American Mathematical Society, 201 Charles Street, Providence, Rhode Island 02904-2294, USA. Requests can also he made by e-mail to reprint-peraissionaams.org. © 2007 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability.
Visit the AMS home page at http://ww.am8.org/ 1211 10090807 10987654321
To my family
Contents Preface ...................................
xi
General Notation ..... ......................... xv Chapter 1.
Classical Probability .....................
1
§3.
Discrete Probability ........................ .. .. ...... . .. Conditional Probability . .... . Independence ........... . ....... ...... .
§4.
Discrete Distributions ... . .... .
§5.
Absolutely Continuous Distributions ..
§6.
Expectation and Variance ..................... 12
§1. §2.
.
.
.
1
.
4
.
6
.. .. ... . ..... . .
.. .. .. .. .
.
.
6
.
10
. .... ... ......... Notes ................................... 15 . .... .. . ..... ....... Chapter 2. Bernoulli Trials Problems .... .... .
.
. .... . .
13
.
17
.
The Classical Theorems ...................... 18 . ..... . . .... .. 21 Problems .... ...... .. §1.
.
.
.
..
.
.
.
.
.
.
Notes ................................... 22 Chapter 3. Measure Theory
.
..
.
.
. ... .
.
.
.
.
..
.
.
.
.
.
25
.
.
28
§2.
. ..... . .... .. .. ..... . Lebesgue Measure .... . . ... .
§3.
Completion .
§1.
Measure Spaces .... .. ..... .. .. .
.
.
. .... .
.
.
.
.
. .... . ... .
Proof of Caratheodory's Theorem .. .. . . . Problems . §4.
. .... .
.
.
.
.
.
.
..
. ...... ....
. ..... ... . . .... .
.
.
23
.
..
.
23
30 33
vii
Contents
viii
Notes ................................... 34
. .... . . .... . ... . .. .. .. 35 Measurable Functions ....................... 35 The Abstract Integral ..... .. . ... . ... ... .. .. 37 . ... 39 LP-Spaces ..... .. .... ... . .. ..... . Modes of Convergence .... ... . .. ....... . . ... Limit Theorems . ......... .. ....... . . ..... 45
Chapter 4. Integration §1. §2. §3.
§4. §5. §6.
.
.
.
.
.
.
.
.
4:3
.
The Radon-Nikodym Theorem .. ......... .. ..
.
.
.
47
Problems ........ ... . ... .. .......... . ..... 49
Notes ................................... 52 Chapter 5. Product Spaces ........................ 53 Finite Products . ....... .. ........... ..... §1.
53
§2.
Infinite Products ....... . ........... . . ..... 58
§3.
Complement: Proof of Kolmogorov's Extension Theorem
. .
.
60
Problems ................................. 62 Notes ................................... 64 . ...... .. 65 Chapter 6. Independence ..... . .. .. .. .
.
.
.
§1.
Random Variables and Distributions .... . .... . .... .
65
§2.
Independent Random Variables .. ...... ....... ..
67
§3.
An Instructive Example ..
§4. §5.
Khintchine's Weak Law of Large Numbers .... .. ..... 71 Kolmogorov's Strong Law of Large Numbers ..... ..... 73
§6.
Applications
.
.
. ............ . .....
71
.. ...... . ............. ..... Problems .... . .. ..... . . ........... .. ..... Notes ................................... 89 Chapter 7.
.
77
.
84
. ... .. .. .. .... . . ... . .... . ...... .
The Central Limit Theorem
.
91
.
91
§1.
Weak Convergence ..... .
§2.
Weak Convergence and Compact-Support Functions ..
§3.
Harmonic Analysis in Dimension One .... .. ....... .
.
. ...
94
96
.. .. ...... .. . .... ... 97 . ....... . . ... 100 .. ......... ...... ... 101 Problems ........ ....... .. .......... . .. ... 111 Notes .......... ....... .. .. ....... . . ... 117 The Plancherel Theorem . §5. The 1-D Central Limit Theorem §6. Complements to the CLT . §4.
.
.
.
.
.
.
Contents
ix
Martingales .. ............ ..... . ...... 119
Chapter 8. §1. Conditional Expectations §3.
. .... . ... ........... . Filtrations and Semi-Martingales ................. 126 Stopping Times and Optional Stopping . ..... ..... .
§4.
Applications to Random Walks .................. 131
§5.
Inequalities and Convergence .... .
§2.
119 129
.
. ...... . . .... . 134 §6. Further Applications ........................ 136 Problems ....................... . . ... . .... Notes
151
........... ....... .. ....... .. ......
Chapter 9.
157
Brownian Motion ...................... 159
§2.
Gaussian Processes ..... .... ... ...... . .... . 160 Wiener's Construction: Brownian Motion on [0.1) . . ..... 165
§3.
Nowhere-Differentiability ..................... 168
§4.
The Brownian Filtration and Stopping Times .... . .... .
§5.
The Strong Markov Property .. .. ... ...... . ..... 17.3
§6.
The Reflection Principle ... .... ....... .. ...... 175
§1.
170
Problems ................................. 176
. ....... . ...... .. ......... ..... . . .... . .... . Chapter 10. Terminus: Stochastic Integration .. ... .. The Indefinite Ito Integral ..... ... .. Notes ..
.
.
.
§1.
.
.
.
180 181 181
§2.
Continuous Martingales in LZ(P) .. ... ...... ...... 187
§3.
The Definite Ito Integral ......... ............ 189
§4.
Quadratic Variation ... ..... .... ..
§5.
Ito's Formula and Two Applications ... .. ....... ..
. ...... ... .
192 193
Problems ................... . ............ 199 Notes ..... .. .... ... .... . .... . . ....... ... 201 Appendix ........................ .......... 203
Hilbert Spaces ........................... 203 . 205 §2. Fourier Series .................. ....... . Bibliography ...................... .......... 209 §1.
.
Index
........ .......... ...... . . .... .
.
.
..
.
217
Preface
Say what you know, do what you must, come what may. -Sofya Kovalevskaya
To us probability is the very guide of life.
-Bishop Joseph Butler
A few years ago the University of Utah switched from the quarter system to the semester system. This change gave the faculty a chance to re-evaluate their course offerings. As part of this re-evaluation process we decided to replace the usual year-long graduate course in probability theory with one
that was a semester long. There was good reason to do so. The role of probability in mathematics, science, and engineering was, and still is, on the rise. There is increasing demand for a graduate course in probability. And yet the typical graduate student is not able to tackle a large number of year-long courses outside his or her own research area. Thus, we were presented with a non-trivial challenge: Can we offer a course that addresses
the needs of our own, as well as other, graduate students, all within the temporal confines of one semester? I believe that the answer to the preceding question is "yes." This book presents a cohesive graduate course in measure-theoretic prob-
ability that specifically has the one-semester student in mind. There is, in fact, ample material to cover an ordinary year-long course at a more leisurely
pace. See, for example, the many sections that are entitled complements, and those that are called applications. However, the primary goals of this book are to maintain brevity and conciseness, and to introduce probability quickly and at a modestly deep level. I have used as my model a standard one-semester undergraduate course in probability. In that setting, the xi
xii
Preface
instructional issues are well understood, and most experts agree on what should be taught. Giving a one-semester introduction to graduate probability necessarily involves making concessions. Mine form the contents of this book: No mention is made of Kolmogorov's theory of random series; Levy's continuity theorem of characteristic functions is sadly omitted; Markov chains are not treated at all; and the construction of Brownian motion is Fourier-analytic rather than "probabilistic." That is not to say that there is little coverage of the theory of stochastic processes. For example, included you will find an introduction to Ito's stochastic calculus and its connections to elliptic partial differential equations. This topic may seem ambitious, and it probably is for some readers. However, my experience in teaching this material has been that the reader who knows some measure theory can cover the book up to and including the last chapter in a single semester. Those who wish to learn measure theory from this book would probably aim to cover less stochastic processes.
Teaching Recommendations. In my own lectures I often begin with Chapter 2 and prove the De Moivre-Laplace central limit theorem in detail. Then, I spend two or three weeks going over basic results in analysis [Chap-
ters 3 through 5]. Only a handful of the said results are actually proved. Without exception, one of them is Caratheodory's monotone class theorem (p. 30). The fundamental notion of independence is introduced, and a number of important examples are worked out. Among them are the weak and the strong laws of large numbers [Chapter 6], respectively due to A. Ya. Khintchine and A. N. Kolmogorov. Next follow elements of harmonic analysis and the central limit theorem [Chapter 7]. A majority of the subsequent lectures concern J. L. Doob's theory of martingales (1940) and its various applications [Chapter 8]. After martingales, there may be enough time left to introduce Brownian motion [Chapter 9], construct stochastic integrals, and deduce a striking computation, due to Chung (1947), of the distribu-
tion of the exit time from [-1, 1] of Brownian motion (p. 197). If at all possible, the latter topic should not be missed. My personal teaching philosophy is to showcase the big ideas of probability by deriving very few, but central, theorems. Fran4ois Marie Arouet [Voltaire] once wrote that "the art of being a bore is to tell everything." Viewed in this light, a chief aim of this book is to not bore. I would like to leave the reader with one piece of advice on how to best use this book. Read it thoughtfully, and with pen and paper.
Preface
xiii
Acknowledgements: This book is based on the combined contents of several of my previous graduate courses in probability theory. Many of these were given at the University of Utah during the past decade or so. Also, I have used parts of some lectures that I gave during the formative stages of my career at MIT and the University of Washington. I wish to thank all three institutions for their hospitality and support, and the National Science Foundation, the National Security Agency, and the North-Atlantic Treaty Organization for their financial support of my research over the years. All scholars know about the merits of library research. Nevertheless, the role of this lore is underplayed in some academic texts. I, for one, found the following to be enlightening: Billingsley (1995), Breiman (1992), Chow and Teicher (1997), Chung (1974), Cramer (1936), Dudley (2002), Durrett (1996), Fristedt and Gray (1997), Gnedenko (1967), Karlin and Taylor (1975, 1981), Kolmogorov (1933, 1950), Krickeberg (1963, 1965), Lange (2003), Pollard (2002), Resnick (1999), Stroock (1993), Varadhan (2001), Williams
(1991), and Woodroofe (1975). Without doubt, there are other excellent references. The student is encouraged to consult other resources in addition to the present text. He or she would do well to remember that it may be nice to know facts, but it is vitally important to have a perspective. I am grateful to the following for their various contributions to the development of this book: Nelson Beebe, Robert Brooks, Pieter Bowman, Rex Butler, Edward Dunne, Stewart Ethier, Victor Gabrenas, Flank Gao, Ana
Meda Guardiola, Jan Hannig, Henryk Hecht, Lajos Horvath, Zsuzsanna Horvath, Adam Keenan, Karim Khader, Remigijus Leipus, An Le, David Levin, Michael Purcell, Pejman Mahboubi, Pedro Mendez, Jim Pitman, Natalya Pluzhnikov, Matthew Reimherr, Shang-Yuan Shiu, Josef Steinebach, James Turner, John Walsh, Jun Zhang, and Liang Zhang. Many of these people have helped find typographical errors, and even a few serious mistakes. All errors that remain are of course mine.
My family has been a stalwart pillar of patience. Their kindness and love were indispensable in completing this project. I thank them deeply. And last but certainly not the least, my eternal gratitude is extended to my teachers, past and present, for introducing me to the joys of mathematics. I hope only that some of their ingenuity and spirit persists throughout these pages.
Davar Khoshnevisan Salt Lake City, January 2007
General Notation
Here we set forth some of the general notation that is consistently used in the entire book. This is standard mathematical notation, and we may refer to it without further mention.
Logic and Set Theory. Throughout, U and fl respectively denote union and intersection, and C the subset relation. Occasionally we may write "a :_ b," where a and b could be sets, numbers, logical expressions, functions, etc. Depending on the context, this may mean either "define a to be b," or "define b to be a." We will not make a distinction between the two.
If A, B C X, then Ac denotes the complement of A [in X). The dependence on X is usually suppressed as it is clear from the context. Let A \ B := A fl B`, and A L B :_ (A \ B) U (B \ A). The latter is called the set difference of A and B. A set is denumerable if it is either countable or finite. We frequently write "iff " as short-hand for "if and only if." Finally, "Y' and "s" respectively stand for "for all" and "there exists."
Euclidean Spaces. Throughout, R = (-oo, oo) denotes the real line, Z = 10,±l,±2 .... } the integers, N = 11 , 2 .... } the natural numbers, and Q denotes the rationale. If X designates any one of these, then X+ denotes the non-negative elements of X, and X_ denotes the non-positive elements. If k E N then Xk denotes the collection of all k-tuples (x1, ... , xk) such that xl, ... , xk are in X. For instance, R. denotes the collection of all k-vectors that are non-negative coordinatewise. The complex plane is denoted by C.
xv
xvi
General Notation
If x, y E R, then we write x A y = min(x, y) for the minimum and x V y = max(x, y) for the maximum. Similarly, sup and inf respectively refer to supremum and infimum operations. Functions. If X and Y are two sets, then "f : X -' Y" stands for "f maps X into Y," and "x -+ f (x)" refers to the map from x to f (x). If f : X -+ Y and A C Y, then f (A) = {x E X : f (x) E A}. This is the inverse image of A.
The Big-O/Little-o Notation. Suppose al, a2.... , bl, b2, ... E R. We say 1. When the bi's are also that an - bn [as n -y oo] when non-negative, "limsupn_,,. lanl/bn < oo" is often written as "an = O(b")," Note that an = O(bn) if there and "limn-oo lanl/bn = 0" as "an = exists a constant C such that lanl < Cbn for all n > 1. The big-O/little-o notation is also applicable to functions: "f (x) - g(x) 0 we may a" means "limx-a(f (x)/g(x)) = 1"; and when g as x write "f (x) = O(g(x)) as x -' a" for "lim supx-a If (x)I = O(g(x))," and "f (x) = o(g(x)) as x a" in place of "limy.a(f (x)/g(x)) = 0."
Chapter 1
Classical Probability
Probability does not exist.
-Bruno de Finetti
How dare we speak of the laws of chance? Is not chance the antithesis of all law? -Bertrand Russell
The original development of probability theory took place during the seventeenth through nineteenth centuries. In those times the subject was mainly concerned with games of chance. Since then an increasing number of scientific applications, many in mathematics itself, have spurred the development of probability in other directions. Nonetheless, classical probability remains a most natural place to start the subject. In this way we can work
non-axiomatically and loosely in order to grasp a number of useful ideas without having to develop abstract machinery. That will be covered in later chapters.
1. Discrete Probability Consider a game that can lead to a fixed denumerable (i.e., at most countable) set {wi }°__1 of possible outcomes. Suppose in addition that the outcome w j occurs with probability p j (j = 1, 2, ... ), where we agree once and for all
that probabilities are real numbers in (0, 1], and that the w,'s really are the only possible outcomes; i.e., that Ei=1 pj = 1. 1
1. Classical Probability
2
Definition 1.1. We call Q:= {wj}j=1 the sample space of our experiment. Subsets of 1 are called events, and the probability of an event A is
P(A) :_ E pj. j>1: WJEA
One can check directly that P has the following properties:
Proposition 1.2. P(S2) = 1, P(Q) = 0, and (1) P(A1 U A2) = P(A1) + P(A2) - P(A1 n A2).
(2) P(Ac) = 1 - P(A). (3) If the Ai's are pairurise disjoint, then P(U°__1Ai) (4) If Al C A2, then P(A1) < P(A2).
P(Ai).
(5) (Boole's Inequality) P(U901Ai) < E 'l P(Ai). (6) (Inclusion-Exclusion Formula)
P
U Ai/ _ > P(Ai) t=1
i=1
EE P (Ai, n Ail ) 1 0, then
the corresponding xj is a possible value of X. Definition 1.18. The mass function of a discrete random variable X is the function p : R -+ [0, 1] defined by (1.16)
p(x) = P{X = x}
ex E R.
The mass function is a convenient way to code the entire distribution of X. If the distinct possible values of X are {xi}°_°1 and the corresponding probabilities are {pi }°1, then p(xi) =pi, whereas p(x) = 0 for other values of X.
One can always construct a random variable with a pre-described distribution (Problem 1.4). However, when we discuss a random variable X, we rarely need to specify the underlying probability space. Usually, all that matters is the distribution of X. Next we present three examples to illustrate this remark. The exercises contain a few more examples.
1. Classical Probability
8
4.1. The Binomial Distribution. Consider a random experiment that has two possible outcomes only: Success versus failure (or heads vs. tails, female vs. male, smoker vs. non-smoker, etc.). The probability of success is denoted by p E (0,1). We perform this experiment n times independently. The latter means that the outcomes of all the trials are independent from one another. Such experiments are called Bernoulli trials with parameter p. Let X denote the total number of successes. Then we say that X has the binomial distribution with parameters n and p, and X is sometimes also written as Bin(n, p). For instance, suppose the proportion of women in a certain population
is p. We sample n people at random and with replacement. Then, the total number of women in the sample has the binomial distribution with parameters n and p. Quite generally, if X has the binomial distribution with parameters n and p, then the possible values of X are 0, ... , n. It remains to find the mass function; i.e., p(k) = P{X = k} for k = 0, . . . , n. This is the probability of getting exactly k successes, and also n - k failures. Consider first the problem of finding the probability that the first k trials lead to successes, and the remaining n-k trials to failures. By independence, this probability is pk(1 - p)n-k. Now if we fix any k of the n trials, then the probability that those k succeed and all remaining trials fail is the same; i.e., pk(1 - p)n-k. The union of these events is the event {X = k}. Because the latter is a disjoint union, p(k) = Npk(1 - p)n-k where N is the number of ways to choose k spots for the successes among all n trials. Therefore, (1.17)
p(k) =
(n)pk(lp)n_k
dk = 0, ... , n.
This notation tacitly implies that P{X = k} = 0 for all values of k other than k = 0, ... , n. From now on, we adopt this way of writing a probability mass function.
Binomials have the following interpretation: Let X denote the total number of successes in n independent Bernoulli trials with success proba-
bility p E [0, 1J each. Let Ij := 1 if the jth trial leads to a success, and Ij := 0 otherwise (1 < j < n). Then, X = Il + + In. Among other things this means that Bin(n, p) is a sum of n independent, identically distributed random variables.
4.2. The Geometric Distribution. Consider independent Bernoulli trials, where each trial has the same success probability p E (0, 1). We perform these trials until the first success appears. Let X denote the number of trials required. The distribution of X is the so-called geometric distribution with
4. Discrete Distributions
9
parameter p, and X is sometimes written as Geom(p). Its mass function is (1.18)
dx = 1,2,....
p(x) = (1 - p)x-lp
4.3. The Poisson Distribution. Suppose A > 0 is fixed. Then a random variable X is said to have the Poisson distribution with parameter .A if the mass function of X is C-aA x
(1.19)
p(x) =
X!
vx = 0, 1, 2,...
Sometimes we write Poiss(A) for such X.
The Poisson distribution plays a natural role in a number of approximation theorems. Next is an instance in random permutations. We will see another example in Chapter 2; see the law of rare events (p. 19).
Example 1.19 (Example 1.8, Continued). Let X denote the total number of fixed points in a random permutation 1r of { 1, ... , n}. In Example 1.8 we found that if n is large, then P{X = 0} .:s e-1. Now we go one step further
and compute the mass function of X. Define FO = Go = S2, and for all non-empty A C {0, ... , n} let (1.20)
FA := n {ir(j) = j} and GA := n {ir(j) # j}. jEA
j1ZA
In words, FA denotes the event that all elements of A are fixed points, and GA the event that there are no fixed points in A'. If k is an integer between 0 and n, then we can write (1.21)
P{X = k}
P(FA) {1 - P (GA I FA)}
.
AC{1,...,nl: IAI=k
We observed in Example 1.8 that (1.22)
P(FA) = (n - IAI )! n!
To find the remaining conditional probability we first note that the conditioning reduces the sample space to all permutations for which 7r(i) = i for all i E A. Consequently, P(GA I FA) is the probability that there is at least objects [one for one fixed point in a random permutation of n - Idistinct Al each j ¢ A]. Equation (1.8) tells us that n-IAI (1.23)
P(GA I FA) i=1
1. Classical Probability
10
In summary, for all k = 0, ... , n,
n
P{X = k} _ AC {1,...,n}:
(1.24)
(i(_i) \ n-k
n. k)
ti-1
i+1
2.
IAI=k
1 n-: -1'
k!.0
i!
oo to find that P{X = k} e-1/k!. It follows that, when n is Let n large, the distribution of X is approximately that of a Poiss(1). Next we present another example of Poisson distributions. This example highlights some of the deep connections between Poisson and Binomial random variables.
Example 1.20 (Poissonization). Choose and fix p E [0, 1] and .\ > 0, and let N, Il, 12i ... be independent random variables such that: (i) P{I3 = 1} = 1- P{I; = 0} = p for all j > 1; and (ii) N = Poiss(A). We have seen already that S,, E', Ii = Bin(n, p) for all n > 1. Now consider the randomized sum SN. Its distribution can be computed as follows: For all k > 0,
P{SN=k}P{Sn=k}P{N=n} (1.25)
n=k e--Ip(Ap)k
k!
(Check!) This proves that Bin(N,p) = I1 +
+ IN = Poiss(Ap).
5. Absolutely Continuous Distributions It is not difficult to imagine random experiments that lead to numerical outcomes which can, in principle, take an arbitrary positive real value. One would like to call such outcomes absolutely continuous, or merely continuous, random variables. For example, the average weight of 100 randomly-selected individuals is probably best modeled by an absolutely continuous random variable.
Definition 1.21. A function f on R is a density function if it is nonnegative, Riemann-integrable, and f!. f (x) dx = 1. Define f := {x E R : f (x) > 0} and P(A) := fA f (x) dx for all sets A C S1 for which the Riemann integral exists. One can then check that
Proposition 1.2 continues to hold for this choice of P and Q. Similarly, one can describe conditional probabilities for this P.
5. Absolutely Continuous Distributions
11
Definition 1.22. An absolutely continuous random variable X with density function f is a real-valued function on ci that satisfies P({w E S2 : X (w) E A}) =
(1.26)
JA
f (x) dx.
This is valid for all sets A for which the integral is defined. The displayed probability is usually written as P{X E A}.
Frequently, one assumes further that f is piecewise continuous. This means that we can find points {ai}°°1 such that for all is (a) ai < ai+1; and (b) f is continuous on (ai , aj+1). Next we describe two illustrative examples. A few more are included in the exercises.
5.1. The Uniform Distribution. If a < b are two fixed numbers, then the following defines a density function: (1.27)
P X) = 6
1
dx E (a, b).
a
This notation implies tacitly that f (x) = 0 for all other values of x. The function f is called the uniform density function on (a, b). If the density function of X is f, then X is said to be distributed uniformly on (a, b). We might refer to X as Unif(a, b). Clearly, X = Unif(a, b) if and only if P{X E A} is proportional to the length of A.
5.2. The Normal Distribution. If p E R and a > 0 are fixed, then X is normally distributed with parameters p and a2 if its density function is / _ 1 (1.28) exp I - (x f (x) = ) vx E R. 2a2)2
Frequently, the symbol N(p,a2) `denotes a normal random variable with parameters p and a2, and N(0,1) is called a standard normal random variable. We define N(p, 0) to be the non-random quantity p. This represents the degenerate case. To complete this discussion we need to verify that S' := f f'. f (x) dx is
equal to one. A change of variables [y = (x - p)/a] reduces the problem to the standard normal case. We then appeal to a trick of J. Liouville, and compute r22 in polar coordinates, viz., (1.29)
f
92 = Jr00 oo
foo e-(x2+'2)/2
J
27r
2rr
dxdy =
foo a-r2/2
J J
27r
r dr dO = 1.
1. Classical Probability
12
6. Expectation and Variance Suppose that X is a discrete random variable with mass function p. Its expectation (or mean), when defined, is
EX :=
(1.30)
xp(x). zER
That is, EX is the weighted average of the possible values of X where the weights are the respective probabilities. If X denotes the amount of money that is to be gained/lost in a certain game of chance, then EX is a natural predictor of the as-yet-unrealized value of X.
Similarly, if X has density function f, then the expectation of X is defined as
EX :=
(1.31)
00
fxf(x)dx,
provided that the integral is well defined. In this regard see Problem 1.13. It is not too difficult to check that for any reasonable function g : R -+ R, (1.32)
F,.ERg(x)p(x)
f.
Eg(X) _
g(x) f (x) dx
if X is discrete, if X is absolutely continuous.
In particular, we do not need to work out the distribution of g(X) before we compute its expectation. Define the variance of X by VarX := E [(X - µ)2] , (1.33) where µ := EX denotes the mean of X. Then, (1.34)
VarX =
2xER(x - µ)2p(x)
If
if X is discrete. if X is absolutely continuous.
!0. (x - µ)2f (x) dx The following is an equivalent formulation: (1.35)
VarX = E[X2] - (EX)2.
The square-root of the variance is the so-called standard deviation SD(X) of X. It gauges the best bet for the "distance" between the as-yet-unrealized value X and its predictor EX.
Example 1.23. If X = Bin(n,p) then EX = np and VarX = np(1 - p).
Example 1.24. If X = Geom(1/a) then EX = a and VarX = a(l - a).
Example 1.25. If X = Unif(a , b) then EX = (b + a)/2 and VarX = (b - a)2/12.
Example 1.26. If X = N(µ , a2) then EX = µ and VarX = a2.
Problems
13
The exercises contain many more classical examples.
Problems 1.1. Prove Propositions 1.2, 1.10, as well as the law of total probability (p. 5). 1.2. Derive (1.32) carefully, and verify Examples 1.23 through 1.26.
1.3. Construct a sample space and three events A, B, and C in this space such that: (1) Any two of the three events are independent, but A, B, and C are not independent. (2) A, B, and C have strictly positive probabilites, and P(A fl B fl C) = P(A)P(B)P(C) even though A, B. and C are not independent.
1.4. Suppose {pi}j= l are non-negative numbers that add up to one. If {x,}j= l are fixed and distinct, then construct a probability space and a random variable X on this probability space such that for all integers j > 1, P{X = xi } = pj. (HINT: Example 1.16.) 1.5 (A Bonferroni Inequality). Use the inclusion-exclusion principle to deduce that for all events
Et., En, (1.36)
P I U EJ I ? >P(Ei) - FF P(E, n Ei) !=I
/
1 0, (1.37)
n
(x + y)" = k=0
k
xAy°-k
1.7 (Distribution Functions). Let X be a random variable with a density function f. For all a E R define F(a) = P(X < a). F is the distribution function of X. Prove that F'(a) = f(a) if f is continuous in a neighborhood of a.
1.8 (Hypergeometric Distribution). An urn contains r + b balls; r are red, and the other b are blue. With the exception of their colors, the balls are identical. We sample, at random and without replacement, n of the r + b balls. Let X denote the total number of red balls in the sample. Then, find the distribution of X, as well as EX and VarX. The distribution of X is called hypergeometrsc.
1.9. Compute EX and VarX, where X = Poiss(A) for some A > 0.
1.10 (Negative Binomial Distribution). Imagine that we perform independent Bernoulli trials with success-per-trial probability p E (0, 1). We do this until the kth success appears; here, k is a fixed positive integer. Let X denote the number of trials to the kth success. Compute the distribution, expectation, and variance of X. The distribution of X is called negative binomial
1.11 (Exponential Distribution). If A > 0 is fixed, then we can define f(x) = Ae-11 (x > 0). Check that this is a density function. A random variable X with density function f is said to have the exponential distribution with parameter A. Compute the mean and variance of X.
1.12 (Gamma Distribution). Given a, A > 0, define (1.38)
where r(a) := fo
f(x)
xa-te-As
vx > 0,
I(a)
x°-le-r dx is Euler's gamma function. Verify that f is a density function. If X has density function f, then it is said to have the gamma distribution with parameters (a , A). Compute the mean and variance of X in terms of I'. Verify also that r(n+ 1) = n! by first deriving the "duplication formula" I'(a + 1) = ar(a), valid for all a > 0.
1. Classical Probability
14
1.13 (Cauchy Distribution). Prove that f is a density function, where
f(x) :=
(1.39)
vx E R.
all 1x2)
If X has density function f, then it is said to have the Cauchy distribution. Prove that in this case EX is not well defined. Construct a random variable whose mean is well defined but infinite.
1.14 (Standardization). If X = N(µ,02), then prove that (X -µ)/o is standard normal. (HINT: First compute F(x) = P{(X - µ)/o < x}. Then compute F'.) 1.15 (Standard Normal Distribution). Let Z be a standard normal random variable. Compute E[Z'J, for all r > 0, using facts about the Gamma function r(t) = fo xt-Ie- dx (Problem 1.12). What happens if r is a positive integer? Also compute E[exp(tZ)] and E[exp(tZ2)] for t E R. 1.16 (Tails of the Normal Distribution). Prove that for all x > 0,
1x
(1.40)
e-:2/2 x3
<J a
a-°2/2 du < 1 x
(HINT: Integrate (1+x'2)exp(-x2/2) and (1 -3x'4)exp(-x2/2).) 1.17. Prove that if X = Unif(0,1) and a E R and b > 0 are fixed, then bX + a = Unif(a, a + b). (HINT: First compute F(x) = P{bX + a < x}. Then compute F'.) 1.18. You have n distinct keys, one of which unlocks a certain door. You select a key at random and try to unlock the door with that key. If the key works, then you are done. Else, you select another key at random and try the door again. You repeat this procedure until the door is unlocked. Let X denote the number of sampled keys needed to unlock the door. (1) Compute the mean and variance of X if the sampling is done with replacement. (2) Compute the mean and variance of X if the sampling is done without replacement. 1.19 (Hard). Prove that rif X = Bin(n,p), then (1.41)
EIC+X] L
1 Pt o
fDSr-I(s+1-P)^ds
ve>0.
For what values of a can you evaluate this integral? The following exercises explore some two-dimensional extensions of the one-dimensional theory of this chapter. The reader is encouraged to think independently about three-dimensional, or even higher-dimensional, generalizations. If X and Y are random variables, both defined on a common sample space 11, then (X, Y) is said to be a random vector. It is also known as a random variable in two dimensions. If X and Y are discrete, then (X, Y) is said to be discrete; in this case,
P(x, y) := P{X = x, Y = y}
(1.42)
is its mass function. On the other hand, suppose there exists an integrable function f of two variables such that (1.43)
P{(X,Y) E A} = f f f(x,y)dxdy, A
for all sets for which the integral can be defined. Then (X, Y) is said to be absolutely continuous and j is its density function. We say that X and Y are independent if
P{X < x,Y < y} = P{X < x}P{Y < y} vx, y E R. 1.20 (Discrete Random Vectors). Let (X, Y) denote a discrete random vector with mass function (1.44)
p, and define the respective mass functions of X and Y as px (x) = P{X = x} and py (y) = P{Y = y}. Prove that for all x, y E R, Px (x) p(z, y). Prove also p(x, z) and Py (y) that for all functions g : R2 - R, (1.45)
Eg(X,Y)=
g(x,y)P(x,y), x.YER
provided that the sum converges.
Notes
15
1.21 (Absolutely Continuous Random Vectors). Let (X, Y) denote an absolutely continuous ran-
dom vector with a continuous density function f. Prove that X and Y are both absolutely continuous random variables, and their respective densities fx and fy are defined as follows: fx (x) = f f. f (x, z) dz and fy (y) = ff. f (z, y) dz for all x, y E R. Prove also that for all bounded continuous functions g : R2 - R, oo
(1.46)
E9(X,Y) = I
00
f
m
9(x,y)f(x. y)dxdy. 00
1.22. Suppose (X, Y) is an absolutely continuous random vector with density function f. Suppose
f is continuous. Then prove that X and Y are independent if and only if we can write f as f(x,y) = fl(x)f2(y). Explore the case that f is piecewise continuous in each variable. 1.23 (Convolutions). Suppose (X, Y) is an absolutely continuous random vector with density function f. Prove that if f is piecewise continuous and X and Y are independent, then for all
zER,P{X+Y 1, (2.7)
f (n)
n! nn+2e-n Because f (1) = e, we write f (n) as a telescoping product, viz.,
(2.8)
f(j) =exp 1+E[lnf(j)-lnf(j-1)j f(n)=ell T(7:: 1) j=2
j=2
Evidently, as j -+ oo, (2.9)
lnf(j)-lnf(j-1)=1+Ij-2ln(1-
1
-12j2
The result follows from this, the summability of j--2 and (2.8).
0
2. Bernoulli Trials
20
We can now complete our proof of the de Moivre-Laplace central limit theorem.
1-p
Proof of the de Moivre-Laplace CLT. Throughout, we write q and seek to approximate the following for large values of n: P
a<Sn-np 0, (2.17)
13
>
27r
ra
J
O(x) dx >
a
Let a T oo to deduce that 3 =
1 -a21) ,27r- -
as was claimed.
During the course of our proof of the CLT, we verified that 3 = This leads to the following celebrated result of Stirling (1730):
Stirling's Formula. n! - n"+2 a-n 27r as n
O 27r.
oo.
Problems 2.1. Derive the law of rare events (p. 19).
2.2. de Moivre's formula (p. 19) has quantitative versions as well. For instance, derive the bounds (2.18)
nn+l a-n < n! < nn+i a-n+1
vn > 1.
2.3. Derive Bernoulli's law of large numbers (p. 18) from the de Moivre-Laplace central limit theorem (p. 19).
2.4. An urn contains N balls. They are all identical except for their colors: k are white and N - k are black. We take a random sample of n balls, without replacement, from this urn. Let X denote the number of white balls in the sample. Suppose that k/N - p E (0, 1) as N -. 00. Then, compute lim,v-- P{X = j} for all j. 2.5. Suppose Xp has the geometric distribution with parameter p E (0, 1) (§4.2, page 8). Then prove that limp-0 P{pXp > x} = P(Y > x} for all x > 0, where Y has the exponential distribution with parameter one.
2.6. Suppose: (a) limn-_ P{Xn < x} = P{X < x} for all x E R; and (b) x - P{X < x} is continuous. Prove that limn_,o [P{Xn < x} - P(X < x11 = 0. Use this to prove that (2.5) holds uniformly for all real numbers a < b.
2.7. Construct discrete random variables {Xn };° 1 such that p(x) := limn-- P{Xn = x} exists for all x E R, but p is not a mass function. 2.8 (Problem 2.7, continued). Suppose (Xn}°n°__1 are discrete random variables with respective
mass functions (pn)1j. Prove that if there exists a function p such that "Mn- EIER 1p^(z) p(z)[ = 0, then p is a mass function. 2.9. Derive the following "normal approximation": (2.19)
1
e--' dx = t - I t3 + o(t3)
as t - 0.
(Laplace, 1782, vol. 10, p. 230).
2.10 (Hard). Suppose X is a discrete, non-negative random variable with mass function p. Prove
that limn_,(E[Xn[)1/n = sup{x E R : p(x) > 0}.
2. Bernoulli vials
22
2.11 (Hard). Suppose X has the Poisson distribution with parameter A > 0. Prove that for all a < b,
AhmoP(a< f 1, Ej E ad and E C U00 En } n=1
n=1
.
JJJ
In the jargon of measure theory, p is a Caratheodory outer measure on
(11,This defines a natural extension of M. The proof proceeds in three steps. Step 1. Countable Subadditivity of d. First, we want to prove that µ is countably subadditive on .9(12). Indeed, we wish to show that p(Un__1An) < En°_1 µ(An) for all A1, A2.... C S2. To this end, consider any collection {A3,n} of elements of 0 such that An C UJ_1Aj,n for all n. By the definition of µ, (3.21)
An =1
< n=1 =
3. Measure Theory
32
A second appeal to the definition of µ implies that for any E > 0 we could choose the Aj,n's such that
0 (3.22) j=1
whence
UAn <E+Efi(An)
(3.23)
n=1
n=1
Because E > 0 is arbitrary, this yields the countable subadditivity of µ. Step 2. µ extends it. Next, we plan to prove that µ and µ agree on .rd so that µ is indeed an extension of p. Because µ(E) < µ(E) for all E E d, we seek to prove the converse inequality. Consider a collection E1, E2.... of elements of sd that cover E. For any E > 0 we can arrange things so that E' 1 p(En) < p(E) + E. Since p is countably additive on sd, (3.24)
p(EE) < µ(E) +c.
µ(E) < p C U 001
En/
n=1
Because E > 0 is arbitrary, Step 2 is completed. Step 3. Countable Additivity. We now complete our proof by showing that the restriction of µ to u(d) is countably additive. Thanks to Step 1, it suffices to show that F,0* 1 p(An) < µ(U°O==1An) for all disjoint A1, A2, ... E
a(d). With this in mind consider (3.25)
-,f = {E c f2 : vF E sd, µ(E) =µ(E n F) + p(E n Fc) } .
According to Step 2, . f' contains sd. Thus, thanks to the monotone class theorem, if . f' were a monotone class then a(O) C -0. This proves that µ is finitely additive on o(sd). Since the An's were disjoint, it follows that 0o
(3.26)
p U An n=1
for every N > 1.
>_
ja
N
N
n=1
n=1
( U An = E fl(An),
Step 3, whence the Caratheodory extension theorem,
follows from this upon letting N j oo. Define (3.27)
-,Y := {E c it : p(E) > µ(E n F) + µ(E n F`) vF E W}
.
Owing to Step 1, it suffices to show that .A' is a monotone class. This is proved by appealing to similar covering arguments that we used in Steps 1 and 2.
Problems
33
Problems 3.1. Prove Lemma 3.4.
of o-algebras such that o(U_,.!Fi) $ 3.2. Construct an example of a countable family U,_,9;. Can you do this so that .ST, C Y,+1 for all i? Typically, one writes v°_-t.lg, for o(U;_ 1 f, ).
3.3. Construct a a-algebra 9 of subsets of R such that no open interval is measurable with respect to 9, although any singleton {x} is (x E R). 3.4. Prove that.4(Rk) is generated by the collection of all balls whose center and radius are both rational. This implies that yd(Rk) is "countably generated," i.e., generated by a countable family of sets. Prove also that any singleton (x) is .A(Rk)-measurable. 3.5. Prove Lemma 3.11. 3.6. Prove Lemma 3.13.
3.7. Prove Lemma 3.23. 3.8 (Counting Measure). Suppose fl is a set. For any A C fl define µ(A) to be the cardinality of A. Prove that p is a measure on (12, Y(fl)), where .?(fl) denotes the power set of fl. 3.9 (Distribution Functions). A function F : R -- [0, 11 is a (cumulative) distribution function on R if: (i) It is non-decreasing and right-continuous [this means that F(at) := limb,. F(b) = F(a)];
(ii) lim., F(a) = 0; and (iii) lim._-W F(a) = 1. Prove that if µ is a probability measure on (R,.4f(R)), then F(a) := p((-oo,a)) defines a distribution function on R. Conversely, prove that if F is a distribution function on R, then there exists a unique probability measure µ on (R, 9(R)) such that µ((-oo,a]) := F(a) for all a E R.
3.10. If x, r E R and A C R, then consider the sets (3.28)
x+A:={x+a: aEA} and rA:={ra: aEA}.
Prove that Lebesgue measure on R is translation invariant. That is, the measure of x + A is the same as the measure of A, provided that x + A and A are Bore] measurable. Furthermore,
prove that if m. denotes the Lebesgue measure on ([0,a],R([O,a])) for a given a > 0, then a-1A E 2((0,11) for all measurable A C (0,a), and m.(A) = aml(a-'A). In other words, prove that Lebesgue measure is also scale invariant.
3.11 (Problem 3.10, Continued). Let p be a translation invariant a-finite measure on (R, td(R)). Prove that there exists a c E (0, oo) such that c-1 µ is Lebesgue measure.
3.12 (Lebesgue Measure on the Circle; Problem 3.10, Continued). Let S1 = (z E C : [z[ = 1} denote the unit circle in the plane. We say that A C S1 is an open subset of S1 if A is an open subset of C; this defines 5it(S') unambiguously. Prove that f (O) = exp(i2i9) defines a homeomorphism from (0, 1] onto S1; that is, f-1 exists, and f and f-1 are both continuous. Let m denote the Lebesgue measure on (0, 1) and define
µ(A) = m(f-1(A)) for all A E .£(S'). Prove that p is a probability measure on (S' ,R(Sl)). Prove also that p is "rotation invariant." That is, u(rA) = µ(A) for all A E R(S1) and r E C with [r[ = 1. Frequently, p is called the Lebesgue measure on V.
3.13. Suppose (f2 (fl) , µ) is a topological (Borel-) measure space. Define supp(p) to be the smallest closed set whose complement has p-measure zero; this is called the support of p. Prove that supp(p) is well defined. Prove also that a point x E fl is not in supp(p) if and only if there exists an open neighborhood U of x such that p(U) = 0. 3.14. Consider a finite measure space (fl, Jr, it), and suppose rA C Jr is a monotone class. Prove that -f is a monotone class, where ((
(3.29)
n=1 l
n1
3. Measure Theory
34
3.15 (Relative Measures). If µ is a a-finite measure on (R,.B(R)), then define d to be the collection of all A E -V(R) such that the following limit exists and is finite:
µ(A fl [-n, nJ) n Is d an algebra? What if R E d? Is Du countably additive on (R, d)?
(Dµ)(A) = nom= Iim
(3.30)
3.16. Let µ be a measure on (R,R(R)) such that the following limit exists for all x E R: µ([x - x + Tl) (3.31) (LP)(x) = Timo T Prove that Lp is a constant (Plancherel and P61ya, 1931).
3.17 (Hard). In this exercise we construct a set in the circle S' that is not Borel measurable. As usual, we can think of S1 as the unit circle in C. That is, S1 = {e'a : 9 E (0,2x[}.
(1) Given any z = e1°, w = e48 E S', we write z - w if a - $ is a rational number. Show that this defines an equivalence relation on S'. (2) Use the axiom of choice to construct a set A whose elements are one from each equivalence class of St. A is often written as S'/ -. (3) For any rational a E (0,2x[ let A. = et°A denote the rotation of A by angle or, and check that if a, 0 E (0, 2x1 fl Q are distinct, then A. fl AO = 0. (4) Let µ denote the Lebesgue measure on (S',2(S')) (Problem 3.12), and show that
µ(A) is not defined. (HINT: S1 = UOE(o,z,JnQAQ.) (5) Conclude that A is not Borel measurable.
3.18 (Harder). For any compact E C (0,11 and r, Q > 0 define H5 (E) := inf
(3.32)
I E; 1s,
:=1
where JAI denotes the Lebeague measure of A, and the infimum is computed over all sequences
(E;), of closed intervals such that sup, JE;I < e and U;_,E, J E. Prove that (3.33)
H9(E)
limeH5(E)
exists and defines a measure on x([0,11). The set function HO is called the dimensional Hausdorff measure. Can you identify Ht and H5 for 9 > 1? (HINT: You may wish to consult the book of Falconer (1986, §1.1 and §1.2).)
Notes (1) The theorem of Solovay (1970), referred to in the preamble of this chapter, states
that there are non-measurable subsets of the real line if and only if Cantor's axiom of denumerable choice [ADCJ holds. Note that ADC lies at the very heart of nearly all of real analysis. (2) Textbook expositions of Lemma 3.4 have a long tradition; see, for example, Hausdorff (1927, p. 85). Similarly, we can refer to Hausdorff (1927, pp. 177-181) for Definition 3.6.
Chapter 4
Integration
Nature laughs at the difculties of integration. -Pierre-Simon de Laplace
We are ready to define nearly household terms such as "random variables," "expectation," "standard deviation," and "correlation." Next follows a brief preview: A random variable X is a measurable function. The expectation EX is the integral f X dP of the function X with respect to the underlying probability measure P. The standard deviation is the distance, in L2(P), between X and its expectation. Correlation is related to an expectation of a certain function of two random variables.
Thus, in this chapter we describe measurable functions, as well as the abstract integral f X dP. Throughout, (Sl , .F, µ) denotes a measure space.
1. Measurable Functions Definition 4.1. A function f : S2 -+ Rn is (Borel) measurable if f -1(E) E .9' for all E E .V(R'). Measurable functions on probability spaces are often referred to as random variables, and written as X, Y... instead of f, g, ... . Measurable subsets of probability spaces are called events. Because f -1(E) = {w E S2 : f (w) E E}, f is measurable (equivalently, f is a random variable) if and only if the pre-images of measurable sets under f are themselves measurable. 35
36
4. Integration
Example 4.2. The indicator function of A C S2 is (4.1)
1A(W) =
J1
ifwEA,
10 if w E A` If A E 3, then 1A : 9 - {0, 1} is a measurable function. .
Checking the measurability of a function can be a painful chore. The following alleviates some of the pain most of the time.
Lemma 4.3. If sit is an algebra that generates . (R") and f-1(A) E . for all A E 0, then f : S2 R" is measurable. Proof. The lemma follows from the monotone class theorem (p. 30), because (A E .V(R") : f -1(A) E 9} is a monotone class that contains 0. The following shows how to use this to produce measurable functions.
Lemma 4.4. Consider functions f, f1, f2, ...: Sl - R" and g : R" - R'". (i) If g is continuous, then it is measurable. (ii) If f, fl, f2 are measurable, then so are a f and f, + f2 for all a E R. If n = 1, then f1 x f2 is measurable too. (iii) If n = 1 and f1, f2,... are measurable, then so are supk fk, infk fk, lira supk fk, and lim infk fk
(iv) If g and f are measurable, then so is their composition (go f)(x) _ g(f (x)).
Proof. By definition, if g is continuous then for all open sets G C R, g1(G) is open and hence Borel measurable. Because g-1(G`) = (g-1(G))` and g-'(G1 U G2) = g-1(G1) U g-1(G2), (i) follows from Lemma 4.3. The functions g(x) = ax and g(x, y) = x + y and g(x, y) = xy are all continuous on the appropriate Euclidean spaces. So if we proved (iv), then (ii) would follow from (i) and (iv). But (iv) is an elementary consequence of the identity
(g o f)-1(A) = f-1(g-1(A)). It remains to prove (iii). From now on, we assume that the values of the fk's are one-dimensional. Let S(w) = supk fk(w) and note that 00
(4.2)
S-1((-oo,x]) = n fk 1((-oo,x]) E Jr k=1
for all x E R. Because S-1((x,y]) = S-1((-oo,y]) \ S-1((oo,x]) for all reals x < y, it follows that S-1((x, y]) E 9. The collection of finite disjoint unions of sets of the form (x , y] is an algebra that generates .R(R). Therefore, supk fk is measurable by Lemma 4.3. Apply (iv) to
g(x) = -x to deduce that infk fk = - supk(- fk) is also measurable. But
2. The Abstract Integral
37
we have lim supk fk = inf,n supm>n fk = infk hk where hk = SUP >k f,n. Since denumerable suprema and infima preserve measurability, lim supk fk is measurable. Finally, the lim inf is measurable because lim infk fk = - lim supk (- fk ).
2. The Abstract Integral Throughout this section (SI, 5, p) is a finite measure space unless we explicitly specify that µ is a-finite. We now wish to define the integral f f du for measurable functions f : 11 - R. Much of what we do here works for a-finite measure spaces using the following localization method: Find disjoint measurable K1, K2.... such that UnK7z = S2 and u(K,,) < oo. Define µn to be the restriction of it to Kn; i.e., p,z(A) = uu(K. fl A) for all A E Jr. It is easy to see that µ1i is a finite measure on (1, 5). Apply the integration theory of this module to
µ z, and define f f du = >n f f dµ.. For us the details are not worth the effort. After all probability measures are finite! The abstract integral is derived in three steps.
2.1. Elementary and Simple Functions. When f is a nice function, f f du is easy to define. Indeed, suppose f = c1A where A E 9 and c E R. Such functions are called elementary functions. Then, we define f f dp = cp(A). More generally, suppose A,_., An E Jr are disjoint, al, ... , an E R, and f = E 1 aj1A . Then f is measurable by Lemma 4.4, and such functions are called simple functions. For them we define f f du = E, 1 apu(Aj). This notion is well defined; in other words, writing a simple function f in two different ways does not yield two different integrals. One proves this first in the case where f is an elementary function. Indeed, suppose f = a1A = b1B + c1c, where B, C are disjoint. It follows easily from this that a = b = c and A = B U C. Therefore, by the finite additivity of p, a,u(A) = bp(B) + cp(C). This is another way of saying that our integral is well defined in this case. The general case follows from this, the next lemma, and induction.
Lemma 4.5. If f is a simple function, then so is If 1. If f > 0 pointwise, then f f dp > 0. Furthermore, if f, g are simple functions, then for a, b E R, (4.3)
f(af+b)dtz=affd+bffdiz.
In other words, A(f) := f f dp defines a non-negative linear functional on simple functions. A consequence of this is that f f dµ < f g du whenever
38
4. Integration
f < g are simple functions. In particular, we have also the following important consequence: I f f dµI < f if I dµ. This is called the triangle inequality.
2.2. Bounded Measurable Functions. Suppose f : Il - R is bounded and measurable. To define f f dµ we use the following to approximate f by simple functions.
Lemma 4.6. If f : S2 -+ R is bounded and measurable, then we can find simple functions fn, Yn (n = 1, 2, ...) such that as n -+ oo: L n 1 1; f, 1 f ; and fn < fn + 2-n pointwise.
We can deduce the following by simply combining Lemmas 4.5 and 4.6:
f fn dµ < f fn dµ < f f n dµ + 2-nµ(52) for all n > 1; and f f dµ := limn-oc f fn dµ = limn_co f 7n dµ exists and is finite. This produces an integral f f dµ that inherits the properties of f f n dµ and f fn dµ that were described by Lemma 4.5. That is, Lemma 4.7. If f is a bounded measurable function, then so is If 1. If f is a pointwise-nonnegative measurable function, then f f dp > 0. Furthermore, if f, g are bounded and measurable functions, then for a, b E R, (4.4)
2.3. The General Case. Let R+ := [0, oo), and consider a non-negative measurable f : S2 -* R+. For all n > 1, the function fn(w) := min(f (w), n)
is measurable [Lemma 4.4] and 0 < fn < f. Because fn j f as n -+ oo, Lemma 4.7 insures that f fn dµ increases with n, and hence has a limit, which is denoted by f f dµ. This "integral" inherits the properties of the integrals for bounded measurable integrands, but may be infinite. In order to define the most general integral of this type let us consider an arbitrary measurable function f : 11 -+ R and write f = f+ - f -, where (4.5)
f(w) := max(f (w) , 0) and f(w) := - min(f (w) , 0).
The functions f + and f - are respectively called the positive and the negative parts of f. Both f :L are measurable (Lemma 4.5), and if f If I dµ < oo, then
we can define f f dµ = f f + dµ - f f - dµ. This integral has the following properties.
Proposition 4.8. Let f be a measurable function such that f If I dµ < oo. If f > 0 pointwise, then f f dµ > 0. If g is another measurable function such that f IgI dµ < oo, then for a, b E R, (4.6)
f(af+bg)diz=affdiz+bfdiL.
39
3. LP-Spaces
Our arduous construction is over and gives us an "indefinite integral." We can get "definite integrals" as follows: For all A E S2 define
ffdP=JflAd/i.
(4.7)
This is well defined as long as fA If I dp < oo. In particular, note that
ffdµ=ffdµ
Definition 4.9. We say that f is integrable (with respect to µ) if f If I dµ < oo. On occasion, we will write f f (w) µ(dw) for the integral ffdµ. This will be useful later when f will have other variables in its definition, and the µ(dw) reminds us to only "integrate out the variable w."
Definition 4.10. When (S2 , P) is a probability space and X : 11 -+ R is a random variable, we write EX = f X dP and call this integral the expectation or mean of X. When A E [i.e., when A is an event], we may write E[X; A] in place of the more cumbersome E[X 1A] or fA X dP.
3. LP-Spaces Throughout this section (D, .`9', P) is a probability space. We can define for all p E (0, oo) and all random variables X : 9 -+ R, (4.8)
IIXIIP:_ (E{IXIPl)11P,
provided that the integral exists; i.e., that I X I P is P-integrable.
Definition 4.11. The space LP(P) is the collection of all random variables
X : Il -+ R that are p times P-integrable. More precisely, these are the random variables X such that IIXIIP < oo.
Remark 4.12. More generally, if (S2 , .` , p) is a o-finite measure space, then LP(M) will denote the collection of all measurable functions f : SZ -+ R such that II.f IIP < oo. Occasionally we write respectively If II r.'(µ) and LP(i, .$, µ) in place of IIf IIP and LP(µ) in order to emphasize that the underlying measure space is (Q, 9,,u). Next we list some of the elementary properties of LP-spaces. Note that the following properties do not rely on the finiteness of µ.
Theorem 4.13. The following hold for a a-finite measure µ: (i) LP(µ) is a linear space. That is, Ilaf IIP = Ial . IIf IIP for all a E R and f E L%µ), and f+ g E LP(µ) if f, g E LP(µ). (ii) (Holder's Inequality) If p > 1 and p 1 + q-1 = 1, then IIf9II1 S
IIfIIPIIgIIq
df E LP(µ), g E L'(µ).
40
4. Integration
(iii) (Minkowski's Inequality) If p > 1, then
'f, g E L"(p).
Ill +911P:5 IIf lip + II9IIP
Proof. It is clear that Ilaf lip = lal IIfIIP, and Ix+yIP I f f dpl. A second noteworthy example is the inequality f eJ d1i > exp(f f dp), valid because ?P(x) = eZ is convex. These examples do not presuppose any integrability (why?).
Proof of Jensen's inequality. We will soon see that because 1P is convex, there are linear functions {LZ}ZER such that (4.15)
V) (x) = sup L,.(x) zER
dx E R+.
Therefore, by Proposition 4.8, (4.16)
l,b(f)dp ?
(JIdit).
L(f)du
= supLCJ f d) = 0 sERJ [Here is where we need p to be a probability measure.] It is easy to describe LZ pictorially: L. describes the line "tangent" to the graph of V) at the point (z, tl'(z)). Nonetheless (4.15) merits an honest proof. Consider three points x < z < y. We can write z = Ax + (1 - A)y where A _ (y - z)/(y - x). Because A E (0, 11 and tp is convex, J
(4.17)
ty(z)
t/'(z) + A, (x - z)
Vx
z, b(y) ? t'(z) + A2(y - z) where A2 := supw p > 1, then L'(µ) C L"(µ). In fact, (4.22)
IIf!I,
°
II! II,
df E L'(µ).
Proof. The proposition follows from the displayed inequality. Since this is a result that involves only the function If I, we can assume without loss of generality that f > 0. Consider simple functions S that converge upward
to f. Suppose we could prove the proposition for each S, Then we can let n T oo and then appeal to our construction of integrals to derive the theorem for f. In particular, we can assume without loss of generality that f is bounded and hence in L°(µ) for all v > 0. The function ¢(x) = IxIs is convex for all s > 1. Let s = (r/p) and apply Jensen's inequality to deduce that when µ is a probability measure, (4.23)
IIfIIp =
f IflpdA)
J
0(IfI) d1z = Ill II.
This is the desired result. If µ(S2) > 0 is finite but not equal to 1, then define µ(A) := µ(A)/µ(S2). This is a probability measure, and according to what we have shown thus far, IIf IILP(,R) S IIf IIL-(f,). Solve for Ill II LP(µ) and Ill IILP(p) to finish. Finally, if µ(S2) = 0 then the result holds vacuously.
Fix some p > 1 and define d(f,g) := If - 9IIp for all f, g E LP(µ). According to Minkowski's inequality (Theorem 4.13), d has the following properties: (1) d(f,f) = 0;
(2) d(f, g) < d(f, h) + d(h, g); and
(3) d(f,9) =d(9,f) In other words, if it were the case that "d(f, g) = 0 : f = g," then d(. , ) would metrize L%µ). Unfortunately, the latter property does not hold in general. For an example consider g = f 1A where A 34 0 and µ(Ac) = 0. Evidently then g qb f but d(f, g) = 0. Nonetheless, if we can identify the elements of L'(µ) that are equal to each other outside a null set, then the resulting collection of equivalence classes-endowed with the usual quotient topology and Borel Q-algebrais indeed a metric space. It is also complete; i.e., every Cauchy sequence converges.
Theorem 4.17. Let (S2, ` ,µ) denote a a-finite measure space. For any f,g E LP(µ), write f ^' g i$f = g µ-a.e. That is, µ({w : f(w) # g(w)}) = 0. Then - is an equivalence relation on L%µ). Let [f] denote the --orbit of f; i.e., f E [f] if f - g. Let YP(µ) = {[f] f E LP(µ)} and define :
4. Modes of Convergence
43
II [f] IIP = IIf IIP Then, 2'p(µ) is a complete norrned linear space. Moreover,
22(p) is a Hilbert space. We will prove this in Section 5 below; see page 46.
4. Modes of Convergence There are many ways in which a sequence of functions can converge. We will be primarily concerned with the following. Throughout, (SI, .!F, µ) is a measure space, and f, fl, f2,... : S2 -+ R are measurable. Definition 4.18. We say that fn converges to f p-almost everywhere (written p-a.e., a.e. [p], or even a.e.) if
w E Q: limsup lfn(w) - f (w)I > 0 = 0. l n- oo / Frequently, we write {f E A} for {w E St : f (w) E A} and µ{ f E A} for p({ f E A}). In this way, fn converges to f a.e. if µ{ fn f+ f } = 0. When (Q, F, P) is a probability space and X, X1, X2, ... are random variables on this space, we say instead that Xn converges to X almost surely (written a.s.). (4.24)
p
C{
Definition 4.19. We say that fn -+ f in LP(p) if limn.- , [ifn - f I1P = 0. Also, fn -+ fin measure if limn,, p { I fn - fl J E} = 0 for all c > 0. If X, X1, X2, ... are random variables on the probability space (I , 9, P), then we say that Xn converges to X in probability when Xn -* X in P-measure; that is, if limn P{IXn - XI > E} = 0 for all E > 0. We write this as Xn - X. Theorem 4.20. Either a. e.-convergence or LP-convergence implies convergence in measure. Conversely, if sups>n If, I -+ 0 in measure, then fn -+ 0 almost everywhere. The interesting portion of this relies on the following result:
Markov's Inequality. If f E L1(µ), then for all A > 0, (4.25)
f IfIdp< p{IfI?A}< 1 A {f»,}
IIfIII
A
Proof. Set A := {I f I > A} and note that fA III dp > ff Adp = \p(A). This yields the first inequality. The second one is even more transparent. We can apply the preceding to the function If IP to deduce the following:
Chebyshev's Inequality (1846; 1867). For all p, A > 0 and f E LP(p), (4.26)
p{I fI > A}
,\)
-
IIf IIP AP
.
44
4. Integration
Proof of Theorem 4.20. By the Chebyshev inequality, LP(µ)-convergence implies convergence in measure. In order to prove that a.e.-convergence implies convergence in measure we first need to understand a.e.-convergence a little better. Note that fn -+ f a.e. if and only if p(f1N=1 U°O_N {ifn - f I > E}) = 0 for all c > 0. Since µ is continuous from above, (4.27)
E}) = 0
limo ju t U {ifn - f I fn -+ f a.e. if N-o n= N
Because 12{I fN - fl > E} < µ(Un>N{ I fn - f I > E}), if fN - f a.e. then
fN - f in measure. Finally, if sups>n I fi I -' 0 in measure then (4.28)
it
1)
\l
E11=0.
J
J. If N(E) denotes the set of w's for which this inequality fails, then UIEQ+N(E) is a null set off which lim,n Ifml < E for every rational E > 0; i.e., off U,Eq+N(E) we have
Thus, lim sup,n Ifml < lim supra Supt>n If; I < E a.e.
limm Ifml = 0.
El
Here are two examples to test the strength of the relations between the various modes of convergence. The first involves the Steinhaus probability space which was a starting-point of modern probability theory.
Example 4.21 (The Steinhaus Probability Space). The Steinhaus probaP) where H is either (0, 1), bility space is the probability space (Il, [0,1), (0, 1], or [0,11; P denotes the Lebesgue measure on Q. On this space consider (4.29)
Xn(w)
n°1[o .1 /n) M,
where a > 0 is fixed and w E f2. Then Xn -+ 0 almost surely (in fact for all w E fl). And yet if p > a-1, then IIXnIIp = nap-1 is bounded away from 0. Therefore, a.s.-convergence does not imply LP-convergence. The trouble comes from the fact that sups IXnI is not in L"(P); compare with the dominated convergence theorem (p. 46).
Example 4.22. Let ((0, l], ..((0 ,1]) , P) be the Steinhaus probability space of the previous example. Now we construct random variables {Xn}°° 1 such that limn X. (w) does not exist for any w E (0, 1), and yet limn IIXnIIp = 0
for allp>0.
5. Limit Theorems
45
Define a "triangular array" of functions fi, j (Vi >_ 1, j < 2`-1) as follows:
First let fi,l(w) = 1 for all w E (0,1]. Then define /
(4.30) f2.1 \w) =
if w E (0, Z] otherwise
2, 0,
f2,2 (w) =
r0, 512,
if w E (2 ,1] otherwise
In general, for all i > 1 and j = 1..... 2'-1, we can define fi,j to be i on ((j , j2-'-1], and zero elsewhere. Let us enumerate the fi,j's 1)2-i-1
according to the dictionary ordering, and call the resulting relabeling (Xk); Evidently, i.e., X1 = 11,1, X2 = f2,1, X3 = f2,2, X4 = f3,1, Xs = 13,2, limsupk_,, Xk(w) = oo whereas lim infk-oo Xk(w) = 0 for all w E (0,1]. In particular, the Xn(w)'s do not converge for any w. On the other hand, X,, converges to zero in L1(P) for all p > 0 because IIfi,j IIp = zp2-(i-1)
5. Limit Theorems Proposition 4.8 expresses two of the essential properties of the abstract integral: (i) Integration is a positive operation (i.e., if f > 0 then f f dµ > 0); and (ii) it is a linear operation (i.e., equation (4.6)). We now turn to some of the important properties that involve limiting operations. Throughout this section, we let (1, ., µ.) denote a finite measure space, and address the following question: If fn converges to f , then does f fn dp converge to f f dµ?
The Bounded Convergence Theorem. Suppose fl, f2.... are measurable functions on (S1,9) such that sup,, If,, I is bounded by a constant K. If
fn - fin measure [,a], then lim, f f du = f f dµ. Proof. For all n > 1, fn is integrable because I fn(w)I < K and µ is a finite measure. Now fix an e > 0 and let En := {w E 1 : If(w) - fn(w)l > e}. According to Proposition 4.8, (4.31)
ffdP_ffnd p) fIf_fnldP+LIf_fnIdia < ep(Il) + 2Kµ(En).
Since limn-,,.,u(En) = 0, we can then let e j 0 to finish. Fatou's Lemma. If { fi}°°1 is a collection of non-negative integrable functions on (Q, 9, µ), then (4.32)
Jliminffndiz < lim inf f fn dµ. n- oo n-oo
Proof. Let gn = infj>n fj and observe that gn T f := lim infk A. as n - oo.
In particular, for any constant K > 0, (f A K - gn A K) is a bounded
46
4. Integration
measurable function that converges to 0 as n
oo. Because g,, < fn, the
bounded convergence (theorem implies that (4.33)
lnm n J fn dµ ? limo
A K) dµ = f (f A K) dµ.
Therefore, it suffices to prove that
l f (f A K) dµ = f f dµ.
(4.34)
KToo
For all e > 0 we can find a simple function S such that: (i) 0 < S < f; (ii) there exists C > 0 such that S(w) < C; and (iii) f S dµ > f f dµ - E. Now
f(fAK)dµ> f(SAK)dp= fSdµ> f fdµ - e ifK>C. This proves (4.34), whence follows the result.
The Monotone Convergence Theorem. Suppose { fn}°o_1 is a sequence of non-negative integrable functions on (f , , µ) such that fn (x) < fn+1 (x) for all n _> 1 and x E S2, and f (x) := limn-oo fn (x) exists for all x E ft Then, limn-,,. f fn dµ = f f dµ. Proof. By monotonicity L := limn-oo f fn dµ exists and is < f f dµ. Apply Fatou's lemma to deduce the complementary inequality.
The Dominated Convergence Theorem. Suppose { fi }°_° is a sequence of measurable functions on (1l , .F) such that supm If n I is integrable [dµ]. Then, limn.oo f fn dµ = f limn-,,. fn dµ provided that f (x) := limn-,,. fn(x) 1
exists for all x E f2.
Proof. Thanks to Fatou's lemma f E L1(µ). Also, F := supi,l Ifil E L1(µ) by assumption. We can apply Fatou's lemma to the non-negative function gn := 2F - Ifn - f I to deduce that limn-oo f Ifn - f I dµ = 0. The dominated convergence theorem follows from this and the bound (4.35)
if
,
w hich is merely the t riangle inequality for integrals.
We can now prove our Theorem 4.17 (p. 42) on completions of LP spaces.
Proof of Theorem 4.17. The fact that L'(µ), and hence t'(µ), is a linear space has already been established in Theorem 4.13. As we argued a few
paragraphs earlier, d(f, g) := IIf - 9IIp is a norm (now on 2P(µ)) as soon as we prove that d(f, g) = 0 If] = [g]; but this is obvious.
llmm00
In order to establish completeness suppose { fn}°O_1 is a Cauchy sequence
in LP(µ). It suffices to show that in converges in L"(µ). (Translate this to a statement about [fn]Is.) Recall that "{ fn},°i°_1 is Cauchy" means that ,n-.oo Ilfn - fmllp - 0. Thus, we can find a subsequence {nk}k=1
6. The Radon-Nikodym Theorem
such that
Ilfnk+, - fnk IIP
47
2-k. Consequently, >k Ilfnk+1 - fnk IIP < 00.
Thanks to Minkowski's inequality and the monotone convergence theorem, II F,k I If.,,,, - fnk I IIP < oo. In particular, Ek(f11k+1 - fnk) converges µalmost everywhere (why?).
If f := >k(fnk+, - fnk) then f E LP(p) by Fatou's lemma. By the
triangle inequality for LP-norms, (4.36)
11f - Al, lip s 00E
Ilfnj+l -
fnj II p - 0
ask - oo.
j=k+1
Minkowski's inequality implies that (4.37)
VN, k > 1. If - fN Ilp : Ilf - fnk llp + Ilfnk - fN IIP Therefore, we can let N and k tend to infinity to see that fn -+ f in LP(p). Finally, we can recognize that by Holder's inequality (f, g) := f f g dµ is an inner product. Therefore, L2(µ) is a Hilbert space. This completes the proof.
6. The Radon-Nikodym Theorem Given two measures p and v one can ask, "When can we find a function ir, such that for all measurable sets A, v(A) = fA ir, dµ?" If µ denotes the Lebesgue measure, then the function a, is a probability density function, and the prescription v(A) := fA 7r, dµ defines a probability measure v. For instance, the standard-normal distribution is precisely the measure v when ir,(x) = (27r)-1/2 exp(-x2/2) and µ is the Lebesgue measure on the line. Definition 4.23. Given two measures µ and v on (Q, Jr), we say that v is absolutely continuous with respect to µ (written v 0, and the entire theorem follows, with ir,, = 7r, in the case of dominated measures. Step 2. General v, µ. Because v < (µ + v), Step 1 extracts a µ-a.e. unique (in fact, (µ + v)-a.e. unique) and non-negative it E L2(µ + v) such that f f (1-7r) dv = f fir dµ for all f E L2 (µ+v). Replace f by the indicator of {7r > 1} to deduce that µ{7r > 1} = 0. Consequently, (4.40)
f
7r ir.f(w) + e}. Then, (4.42)
J
f7r. dp =
f
f ire dp >
f
J
f (ir. + e) dIL =
J
f7r. dp + ep(A(e)).
Because f f7r. dp < II74 IIL1(p) < oo, this proves that p(A(e)) = 0 for all e > 0. By the continuity properties of measures (Lemma 3.11, page 25),
0 = limo p(A(e)) = it U A(c))
(4.43)
eEQ
e>0:
fEQ
But the right-hand side is the p-measure of the set where ir' > ir.. This proves that ir' < 1r* a.e. [p]. Reverse the roles of ir' and 7r, to find that they are equal almost everywhere [p].
Problems 4.1. Let f2 be a set and (A,d) a measurable space. For any function X : f1 -» A define o(X) to be the collection of all inverse images of X; i.e., {X-1(B);B E rA}. Prove that c(X) is a o-algebra, and is the smallest o-algebra with respect to which X is measurable.
4.2. Prove Lemma 4.6. (HINT: If f (w) E ]j2-", (j + 1)2-"), then set 7,(w) := (j + 1)2-".) 4.3. Let p denote the counting measure on the measure space (f2,9); see Problem 3.8 on page 33. Prove, carefully, that f f dp = F=En AT) whenever f is absolutely summable. 4.4 (Distributions). Suppose (fl, 9, p) is a measure space. Let f2' be a set, and let f' denote a or-algebra of subsets of f2'. Prove that if f : f2 -, 12' is measurable, then p o f-1 is a measure on (12',.x'), where (po f-1)(A) = p({w E fl: f(w) E A)) is the so-called distribution of Jr. 4.5 (Riesz Representation Theorem). Let C(0,11 denote the collection of all continuous functions that map (0, 1] to R. A map T : C(0, 1] - R is a positive, bounded linear functional if: (i) For
all a, b E R and f,9 E C(0, 1], T(af + bg) = aT(f) + bT(g); (ii) there exists a finite constant A such that IT(f)I < Asup5E(o 1l If(v)I for all f E C(0, 1]; and (iii) T(f) > 0 whenever f 2 0. The smallest such A is the norm IITII of T.
(1) Given any finite measure p on .9((0, 1]), check that T(f) = f f dp defines a positive, bounded linear functional. Compute IITII. (2) Conversely, prove that for any positive, bounded, linear functional T on (0, 1] there exists a finite measure p such that for all f E C(0, 11, T(f) = f f dp. (HINT: For any closed set C define p(C) := inf{T(f) : f > 1c}. For a general set G define p(G) := sup{p(C) : C C G, C closed}.) This is due to F. Riesz. (HINT: Examine Caratheodory's theorem on page 27. Also, see the proof of Lemma 3.15, p. 26.) 4.6. Consider two o-finite measures p and v, both defined on (R k' R(Rk )). Prove that if f f dp = f f dv for all continuous functions f : Rk -+ R, then p = v.
50
4. Integration
4.7. Choose and f i x pl, ... , p > 0 such that p = 1. Prove that if x i, ... , x are positive, = Xx"" < then Prove also that the inequality is strict unless xl =
4.8. Define d = E 1 a,/n to be the arithmetic mean, 9 a,)1"" the geometric mean, and J" = n/ E,_t a-' the harmonic mean of {a,}!'1, where a, > 0. Prove that JF < i < r/. 4.9. C - A function f : I -. R is said to be convex on I if f (-\x + (1 - a)y) < \f(x) + (1 - \)f(y) for all x, y E I and \ E [0, 1). Prove that if f" > 0 on I, then f is convex on I. Use this to prove that:
(1) The Euler gamma function 1'(t) = fo xt-1e- dx is convex on (0, oo). (2) The function x-1 exp(-x2/2) is convex on R+.
4.10 (Problem 4.9, Continued). Suppose f : R - R is convex. Prove that f has right and left derivatives everywhere; i.e., prove that 8+f(x) = limelo(f(x+e) - f(x))/e and 8_ f(x) = lim,lo(f (x) - f (x - e))/e exist. Prove, in addition, that 8+ f and 0-f are both non-decreasing. 4.11. Prove that if f is convex, then it is continuous. Conversely, prove that if f is continuous and 2f((a+b)/2) < f(a) + f(b) for all a,b E R, then f is convex.
4.12. Prove that if sup IX I < Y, where Y E L' (P), then, lim sup- EX < E[lim
X].
Construct an example to show that the domination condition on the X,,'s cannot be altogether dropped. 4.13. Let m(a) := min(Ial, 1), and given any two random variables X and Y, define dp(X,Y) _
E4(X - Y). Prove that dp is a metric, and X. converges to X in probability if and only if dp(X,,, X) -. 0. That is, dp metrizes convergence in probability.
4.14. Prove that if X E LP(P) for some p > 0, then lime_- tPP{IXI > t} = 0, and the latter condition implies that X E L'(P) for all r E (O,p). (HINT: Apply Fubini-Tonelli to jo t'-iP{IXI > t}dt.) 4.15 (Slutsky's Theorem). Suppose
and (Y )°n°_-1 are two sequences of random variables
such that X -. X and Y. -. Y. Prove that if f is a continuous function of two variables, then
f(Xn,Y.) . f(X,Y). 4.16. Prove that X converges to X in probability if for any subsequence {nk}k 1 there exists a further sub-subsequence {nk(,)};° 1 such Xnkl,> -' X as. 4.17. Consider the measure space ([0, 1], R([0, 11),1&) where p is a finite measure. Prove that if f : [0, 1] -. R is continuous and is a sequence of numbers in [0, 11, then (4.44)
n*\') µ
nlimo o f j=1
n
(7
n
= I fdµ - f(0)/1({0}). J
Use this to prove that if p denotes Lebesgue measure, then the Riemann integral of f agrees with its Lebesgue integral. Can you extend this to o-finite measure spaces (Rd,.9(Rd),p) and integrable continuous functions f : Rd -. R? 4.18. Let p denote the Lebesgue measure on ([0, 1J', M([0,1Jd)). Prove that continuous functions are dense in LP(p) for every p > 1. That is, prove that given e > 0 and / E LP(p) we can find a continuous function g : [0, 1[d -. R such that I[f - gllp < e.
4.19. If f : Rk -, R satisfies f I f (x)IP dx < oo for some p > 1, then prove that f is continuous in LP(R"); i.e., limr_.oIRk If(x+e) - f(x)IPdx = 0. (HINT: Problem 4.18.) 4.20. Prove that the following exists, and compute its value: (4.45)
fI
al~mo I
f
I 1
x2"" 2n 1
dx.
4.21. Construct a o-finite measure space (f1,.9,µ) such that L2(µ) Q L1(µ).
Problems
51
4.22 (Mixtures). Suppose (f2 , .5) and (e,) are two measure spaces. Assume that v is a probability measure on I and Pe a probability measure on f for each 0 E e. Then prove that I, (A) = fe Pe(A) v(dO) (A E Jr) defines a probability measure on F, provided that 0... PO (A) is 4-measurable for each A E .9. The probability measure u is said to be a mixture of the PO's; the mixing measure is the probability measure v. 4.23 (Generalized Holder Inequality). Let {X,}!'1 be non-negative random variables. Prove that
for all p I , ... , pn > 1 that satisfy E'-.,p' = 1, E(X1 ... Xn) < rj
1 ]lX, IIp;
(HINT: Problem
4.7.)
4.24 (Chernoff's Inequality). Prove that for any random variable X and all t > 0, P{X > t} < inft>o exp{-tl; + In Eef" }, and P{X < t} < inft>o exp{t. + In Ee-EX }. 4.25 (Hadamard's Inequality). Suppose f is a convex and integrable function on (a,b). Then prove that (b - a) f ((a + b)/2) < fn f(x) dx. 4.26 (Young's Inequality). Suppose f is a continuous, strictly increasing function on [0, a], and f (O) = 0. Prove that ab < fo f (x) dx + f, f- I (x) dx for all b > f (a), with equality if b = f (a). Here, f-1 is the inverse function to f. Use this to find another proof of (4.9). (HINT: Plot f, and consider the areas under and over f, respectively.) 4.27 (An Uncertainty Principle). Prove that all continuously differentiable functions f : R -. R that have compact support satisfy the inequality 00
(4.46)
I If(x)12dx t} = 0.
oo n--
Prove that (Xn)n l is UI if limt-°° supn>, E{IXnI; lXnl > t} = 0. Also prove: (1) If IX,I < IYnl (n > 1) and {Yn}n__, is UI, then so is {Xn}n--1. (2) If {Xn}°n°__, and {Yn}n°_, are UI, then so is {Xn +Yn},°i=1. (3) {Xn}n 1 is UI as long as supn ]IXn]]p < co for some p > 1.
(4) Xn -» X in LI(P) if and only if. (a) Xn Z X; and (b) {Xn}n 1 is UI. 4.29 (Problem 4.28, Continued). Let p > 1, and consider X,XI,X2,... E LP(P) such that Xn -. X, in probability. Prove that either one of the following is equivalent to the uniform integrability of {lX,IP},° 1: (i) Xn -. X in LP(P); or (ii) E{IXnIp} -. E{IXIP} as n -. 00. 4.30 (Hoeffding's Inequality). Suppose EX = 0 and P{IXI < c} = 1 for some non-random constant c > 0. Prove that for all 4 E R, EeEX < exp(F2c2/2) and Ee(1X1 < 2exp(a:2c2/2).
(HINT: CC' < e1(c+x)/(2c)+e-t"(c-x)/(2c) for all x E [-c, c).) 4.31 (Hard). For all a > 0 compute (4.48)
nlim° n
f" expx 0
/ xs- I dx.
4.32 (The Good-Lambda Inequality; Hard). Let X and Y be two non-negative random variables,
and p > 1 a fixed constant. Suppose there exist 0 > 1, ry E (0,1), and b < (3-P such that P{X > OX, Y < 7a} < dP{X > J,} for all a > 0. Then prove that E[XP] < a-y PE[YP], where
a = (Q-P - b)-1. 4.33 (Harder). Let X and Y be two non-negative random variables such that X, Y, log Y E L3 (P). Suppose for all measurable sets A, E[X; A] > IE[Y; A))2. Then prove that E[log X] > -oo. (HINT:
Set A_, = {X > Y}, and for all n > 0 define An = {e-n-1Y < X < e-"Y} and Bn = An fl {Y < e-n/4}. Prove that En nP(An \ Bn) < 00. Use this to prove that E. nP(A,) < oo. Alternatively, see Dudley (1967).)
52
4. Integration
Notes (1) The modern notions of abstract random variables-as measurable functions-and expectations-as integrals-seem to be due to Freshet (1930). In concrete settings, these notions have been around for quite a long time. See, for example, the classic by Borel (1909). (2) Kolmogorov (1933) created the modern, axiomatic theory of probability in his landmark book. Among other things, Kolmogorov's work is said to have solved a main part of Hilbert's sixth problem (Gnedenko, 1969). (3) Much of the material of Section 5 is due to Lebesgue (1910). Notable exceptions to this remark are Fatou's lemma (1906) and the monotone convergence theorem of Levi (1906).
(4) Problem 4.15 is due, in its essence, to Slutsky (1925). (5) Problem 4.24 is due to Chernoff (1952). (6) Problem 4.27 is a disguised form of the Heisenberg uncertainty principle. In this form, it is due to H. Weyl. Another form will be discussed in Problem 7.36 on page 115. (7) Problem 4.30 is due to Hoeffding (1963, Lemma 1). (8) The good-A inequality (Problem 4.32) is a fundamental tool in probability and harmonic analysis. It was invented by Burkholder and Gundy (1970) and explored further by Coifman (1972) and Burkholder, Davis, and Gundy (1972). See also the expository account by Jones (1998).
Chapter 5
Product Spaces
Nature is an infinite sphere, whose center is everywhere and whose circumference is nowhere. -Blaise Pascal
If Al and A2 are sets, then their product Al x A2 is defined to be the collection of all ordered pairs (al , a2) where al E Al and a2 E A2. In a like manner, we define Al x A2 x A3, etc. We can even define infinite-product spaces of the type Al x A2 x ... . We have two main reasons for studying the measure theory of product spaces. The first one is that an understanding of product spaces allows for the construction and analysis of several random variables simultaneously; a theme that is essential to nearly all of probability theory. Our second reason for learning more about product spaces is less obvious at this point: We will need the so-called Fubini-Tonelli theorem that allows us to interchange the order of various multiple integrals. This is a central fact, and it leads to a number of essential computations.
1. Finite Products Suppose (521 , Jr1, {L1) and (I2 , Y2, µ2) are two finite measure spaces. There
is a natural o-algebra -r1 x 92 and a measure µl xµ2 that correspond to the product set 52l x 522. First consider the collection (5.1)
Wp := {Al x A2 : Al E J PI, A2 E
This is closed under finite (in fact arbitrary) intersections, but not under finite unions. For example, let Al = A2 = [0, 11 and Bl = B2 = [1, 21 to see 53
5. Product Spaces
54
that (A1 x A2) U (B1 x B2) is not of the form C1 x C2 for any C1 and C2. So do is not an algebra. We correct this by adding to sago all finite disjoint unions of elements of do, and call the resulting collection ii. Lemma 5.1. The collection sad is an algebra, and o(sV) = o(s.Vo).
Definition 5.2. We write 91 x 92 in place of Define p on .c 4o as follows: (5.2)
p(A1 x A2) := p1(A1)p2(A2)
'A1 E 91,A2 E g2.
If A',..., An E do are disjoint, then we define p(U 1A') :=
1 p(A') This constructs p on the algebra sat in a well-defined manner. Indeed, sup-
pose U 1A' = Uj'!1B' where the A''s are disjoint and the Bp's are also disjoint. Then, U 1A' = U 1 UT 1 (A' fl B3) is a disjoint union of nm, sets. Therefore,
p(UA') =EEA(A'nB')=p UBi
(5.3)
i=1
,
i=1 j=1
by symmetry. Theorem 5.3. There exists a unique measurep1 X p2 on (521 X 12
, 3%1 X _Qr2)
such that pi x p2 = p on si. Definition 5.4. The measure p1 x p2 is called the product measure of pi and 42; the space 1 x 522 is the corresponding product space, and -Qr1 x 3r2 is the product o-algebra. The measure space (521 X Q2 , 3r1 X 15rl , p1 X 02) is the product measure space.
Remark 5.5. By induction, we can construct a product measure space (52
, F, p) based on any finite number of measure spaces
1,...,n: Define 52:=f21x...Xf2n, `fit:=JT1x...x.9t,, andu
i= plx...xpn_
Proof of Theorem 5.3. By Caratheodory's extension theorem (p. 27), it suffices to prove that (Al x 02) is countably additive on the algebra W. We accomplish this in three successive steps. Step 1. Sections of Measurable Sets are Measurable. For all E C 521 x 522 and w2 E 522 define (5.4)
E,112 :_ {w1 E 521
:
(WI, W2) E E}.
This is the section of E along w2. In the first step of the proof we demonstrate that if E is measurable, then for every w2 E 522, E,,,2 is measurable too: Fix W2 E 522 and consider the collection (5.5)
.4f :={EE50F1x92: E12E91}.
1. Finite Products
55
Because . ' is a monotone class that contains d, the monotone class theorem
(p. 30) implies that 4' = Jr, x ,F2. This concludes Step 1. Step 2. Disintegration. Because Ewe is measurable, µ1(E,,,2) is well defined. We now show that 522 3 w2 i-4 µ1(E,2) is measurable. First suppose E E sj(o, so that E = Al x A2 where Ai E 9;. Then E,,,2 = Al if w2 E A2, and E,,,2 = 0 if w2 E A. Consequently, (5.6)
µ1(E,,2) = µ1(A1)1A2(w2)
It follows that µ1(E,,,2) is a measurable function of w2 E 522. Furthermore, (41 x µ2)(E) = p1(A1)µ2(A2), and hence
(Al x 42)(E) = fPi(Ew2)2(dw2).
(5.7)
Equation (5.7) is called a disintegration formula. Step 3. Countable Additivity. By finite additivity, (5.7) extends the definition of µl x µ2 to finite disjoint unions of elements of s"1(0; i.e., (5.7) holds
for all E E d. Furthermore, the dominated convergence theorem shows that µ1 x 92 is countably additive on the algebra sat. [It suffices to prove that if EN E s.4' satisfy EN 1 0, then (µ1 x µ2)(EN) 10. But this follows from (5.7) and the monotone convergence theorem.] Therefore, owing to the Caratheodory extension theorem (p. 27), µ1 xµ2 can be extended uniquely
to a measure on all of 91 x F2. This proves the theorem. In addition, it shows that (5.7) holds for all E E 91 x 2. (The fact that w2 --+ µ1(E,,,2) is measurable is proved implicitly here; why?) The following shows that the two possible ways of constructing Lebesgue's measure coincide.
Corollary 5.6. If and denotes the Lebesgue measure on ((0' 1]d,.V((0,1]d)), then md = m1 x x m1 (d times.)
Proof. If E = (a1
, b1 ]
x (ad, bd) is a d-dimensional hypercube, then
x d
(5.8)
md(E) = rJ(bi - ai) =
(m1
x ... x ml)(E)
j_1
By finite additivity, and and (m1 x x m1) agree on the smallest algebra that contains hypercubes. By Caratheodory's extension theorem, and and (m' x x m') agree on the o--algebra generated by hypercubes. The following is an important consequence of this development.
The Fubini-Tonelli Theorem. If f : S21 x 522 -+ R is product measurable, then for each w1 E 521i w2 F-+ f(w1,w2) is 92-measurable, and by symmetry,
5. Product Spaces
56
for each w2 E 522, WI H f (Wl, W2) is 91 -measurable. If in addition f E L1(µ1 x µ2), then the following are a.e.-finite measurable functions: (5.9)
WI'-'
J f(w1,w2)P2(dw2), W2- ff(iw2)Pi(di).
Finally, the following change-of-variables formula is valid:
(5.10)
J
f d(µ1 x µ2) =
J (ff(wiw2)Pi(dwi))
=J
A2(2)
ff(W1,W2)/42(dW2)) Al(du)l)
Proof. (Sketch) If f = 1E for some E E Jr1 x F2i then (5.7) contains (5.9) and (5.10). By linearity, these equations continue to hold for all simple functions f. Finally, we take limits to prove the result for every function
f E Ll(µl x µ2). The following is an important corollary of the proof of Fubini-Tonelli's theorem. (Proof: Approximate f from below by simple functions; then appeal to the monotone convergence theorem.)
Corollary 5.7. If f : 521 x 522 -. R is measurable and non-negative, then (5.10) holds in the sense that all three double integrals converge and diverge together, and are equal in the convergent case.
Remark 5.8. In fact, the proof shows that f E Ll (µl x 02) as long as one of the three integrals in (5.10) is finite when f is replaced by I f I.
The Fubini-Tonelli theorem is deceptively delicate: We cannot always interchange the order of double integrals. Our next examples highlight this fact.
Example 5.9. For all n > 0 and x E (0, 1] define On(x) :=
(5.11)
2n+1
10
if x E (2-n-1,2-n], otherwise.
Note, in particular, that f0 On(x) dx = 1. Define the measurable function f : (0,1]2 R as follows: 00
(5.12)
f (x , y) = E[10n (x) - Vn+1(x)] VGn (y)
vx, y E (0,1].
n=0
All but possibly one of these terms are zero, so the function f is well defined. Nonetheless, we argue next that the Fubini-Tonelli theorem is not applicable
to the function f.
1. Finite Products
57
If Y E (2-n-1, 2-n] then f (x, y) = 2n+1 if1Pn (x) -'+1 n+1(x)]. It follows that
fo f (x, Y) dx = 0, whence we have fo fo f (x, y) dx dy = 0. On the other hand, 2-^
(5.13)
12-1 f (x , y) dy = On (x) - n+1(x)
Sum this from n = 0 to n = m - 1 to find that
Lm f(x ,y)dy=Vo(x)-But
(5.14)
if x > 0, then limn., O n(x) = 0 for all x E (0,1]. Thus,
f f(x,y)dy=Vo(x)
(5.15)
0
We integrate this over all values of x E (0, 1] to find that
ff 1
(5.16)
0
ff 1
f (x, y) dx dy = 0:A 1 =
0
0
1
f (x, y) dy dx.
0
Thus, Fubini-Tonelli's theorem does not apply, and the reason is that f is not absolutely integrable. (Prove it!) The preceding example is slightly complicated because we had to work with finite measures. But, in fact, the Fubini-Tonelli theorem is valid for sigma-finite measures as well (Problem 5.5). If we admit this, then we can greatly simplify the preceding example.
Example 5.10. Define f : R. - { -1, 0, 1} as follows: (5.17)
f (x , y)
_
cc [1(n,n+1) x (n,n+1) (x , y) - 1(n,n+1) x (n+1,n+2) (x ,
y)]
n=0
One can check directly that f o°D f 0O f dx dy = 1 whereas fo fo f dy dx = 0. As was the case in the preceding example, the Fubini-Tonelli theorem fails to apply here because fO0 f00° If I dx dy = oo.
Our next example, due to W. Sierpiriski, illustrates that Fubini-Tonelli's theorem need not hold when f is not product-measurable. Throughout, we will rely on the axiom of choice as well as the continuum hypothesis, and let (0, 9, P) designate the Steinhaus probability space.
Example 5.11. Define c to be the first uncountable ordinal; by the axiom of choice c exists. Next, define S to be the collection of all ordinal numbers strictly less than c; S is called Hartog's c-section of ordinal numbers. By the continuuum hypothesis, S has the power of the continuum. That is, we can find a one-to-one map 0: 10, 11 -' S.
5. Product Spaces
58
Now consider the set
E := { (x, y) E [0, 11' : O(x) < 0(y) } .
(5.18)
For all x E [0, 1] consider the x-section xE of E, (5.19)
E:= {y E [0,1] : (x, y) E E} = {y E [0,1] : 0(x) < O(y)1 .
Both E and xE are non-empty, because 0 is one-to-one. Moreover, ,E° is denumerable because O(xE`) is. [This follows from the definition of S.) Consequently, E is Borel measurable and P(xE) = 1 for all x E [0, 1]. W e can also define the y-section E y := {x E [0 ,1] : (x, y) E E} for any y E [0, 1]. Since Ey is denumerable, Ey E 9 and P(Ey) = 0. Hence, (5.20)
J0
1
P(xE) P(dx) = 10 0 =
1
P(Ey) P(dy).
So there is no disintegration formula (5.7). This, in turn, implies that (5.10)
does not hold for the bounded function f (x , y) = IE(x, y). Since P is a probability measure on the Borel subsets of [0, 1), all bounded measurable functions are P-integrable. Thus we see that the source of the difficulty is that f is not product-measurable although x '--* f f (x, y) P(dy) and y H f f (x , y) P(dx) are measurable (in fact constants).
2. Infinite Products So far, our only nontrivial example of a measure is Lebesgue measure on (R", .V(R")). We have also seen that we can create other interesting product measures once we know some nice measures. We now wish to add to our repertoire of nontrivial measures by defining measures on infinite-product spaces that we take to be (0,1]°O or R°°, where for any SZ the set Q°° is defined as the collection of all infinite sequences of the form (w1, w2, ...) where w; E Q.
In order to construct measures on (0,1]°°, or more generally R°°, we first need a topology in order to have a Borel a-algebra .4((0,1]°°).
Definition 5.12. Given a topological set St, a set A C S2°° is called a cylinder set if either A = 0, or it has the form A = Al x A2 x . where Ai = SZ for all but a finite number of i's. A cylinder set A = Al x A2 x is open if every Ai is open in Q. The product topology on 1l°° is the smallest topology that contains all open cylinder sets. This, in turn, gives us the Borel a-algebra .4(f2°O). Suppose we wanted to construct the Lebesgue measure on (0, 111. Note that any cylinder set has a perfectly well-defined Lebesgue measure. For example, let It = (0, e-1) for e = 1, 2, 3, and It = (0, 1] for e > 4. Then,
2. Infinite Products
59
I = Il x I2 x 13 x 14 x . . . is a cylinder set, and it would make perfectly good sense that the "Lebesgue measure" of I should be 1 x z x 3 = s It stands to reason that if m denotes one-dimensional Lebesgue measure then one should be able to define the Lebesgue measure tn' = m x m x on ((0,1]°0,R((O,1]O°)) as the (or perhaps a) "projective limit" of the ndimensional Lebesgue measure ml = m x . . . x m on ((0,1)n,.F((0,1]n)). This argument can be made rigorous not only for m°O, but for a large class of other measures as well. But first we need some notation for projections.
For all 1 1 and all A1i ... , A. E -41. We will also say that {Pn}n°_1 is consistent.
Remark 5.15. There is another way to think of a consistent family {Pn}°°_1: If 1 < n := dim A < oo then (5.25)
Pt(irm.(A)) = Pn(7rn(A))
dm > n.
5. Product Spaces
60
The notation is admittedly heavy-handed, but once you understand it you are ready for the beautiful theorem of A. N. Kolmogorov, the proof of which is spelled out later in §3.
The Kolmogorov Extension Theorem. Suppose {P"}n is a consistent 1
family of probability measures on each of the spaces (In, -4'). Then, there exists a unique probability measure P00 on (1°°,.100°) such that P00(B) _ for all finite n and all n-dimensional sets B E .4°O. Remark 5.16. One can use Kolmogorov's extension theorem to construct the Lebesgue measure on (100, °°).
Remark 5.17. One can just as easily prove Kolmogorov's extension theorem on the measurable space (R00,M(R00)), where R0O is endowed with the product topology.
3. Complement: Proof of Kolmogorov's Extension Theorem First we establish the asserted uniqueness of POO. Indeed suppose there were It follows immediately two such measures P°O and Q°°, both on (I, x that POO(A) = Q°O(A) for all n-dimensional cylinder sets A = Al x An x (0, 11 x (0, 11 x . Therefore, P°O and Q00 agree on the algebra W generated by all cylinder sets; this is the smallest algebra that contains all cylinder sets. The monotone class theorem (p. 30) implies that P1 = Q°° on all of -10°O.
Here is the strategy of the remainder of the proof, in a nutshell: Let d denote the collection of all finite unions of cylinder sets of the form (5.26)
(al , bl] x (a2 , b2] x ... x (ak, bk] x (0, 11 x (0, 11 x
,
where 0 < ai < b; < 1 for all i, and k > 1. We also add 0 to sad, so that sat becomes an algebra that generates W°°. Our goal is to construct a countably additive measure on W and then appeal to Caratheodory's theorem (p. 27) to finish.
Our definition of P°O is both simple and intuitively appealing: First, define P0O(0) = 0 and P1 (I') = 1. This takes care of the trivial elements
of d. If A E 0 is such that 1 < n := dimA < oo, then we let P0°(A) _ [Check that this is well defined.] Step 1. Finite Additivity. Let us first prove that PO° is finitely additive
on sl. We want to show that if A, B E d are disjoint, then P' (A U B) _ P- (A) + P00(B). If A = 0 or I00 then B = A0, and finite additivity holds trivially from the fact that P1 (0) = 1 - P1 (11) = 0. If neither A nor B is I°O, then n = dim A and m = dim B are nontrivial natural numbers. We may assume without loss of generality that n > m. It
3. Proof of the Extension Theorem
61
follows that
P- (A U B) = P' (7rn(A) U irn(B)) = P"(in(A)) + P' (7n(B))
(5.27)
This follows, since 7rn(A) n irn(B) = 0 and P' is a measure. On the other hand, P'(irn(A)) = P°O(A), and P"(7rn(B)) = Pm(7r,n(B)) = P°O(B) since {Pk}' is a consistent family (Remark 5.15). This verifies finite additivity. Step 2. Countable Additivity. Suppose P°° is countably additive on .sV'. Then, the Caratheodory's extension theorem implies that P°O can be extended uniquely to a countably additive measure on u(d) = .4°°. This extension, still written as PO°, is the probability measure on (I°°, .°°) that is stated in the present theorem. Thus, it suffices to establish the countable 1
additivity of P°O on d. In order to do this, we appeal to an argument that is similar to the proof that the Lebesgue measure on (0, 1] is countably additive on finite unions of intervals of the form (a , b]. Make certain that you understand the proof of Lemma 3.15 (p. 26) before proceeding with the present proof. Let A', A2, ... denote disjoint sets in .& such that UJt 1Al is also in a; Lj01 P°°(Aj). We write we need to verify that
and note that U1Aj and Ut N+IAj are disjoint
U
elements of W. By Step 1, 0o
(5.28)
P°O
U Aj j=N+1
N
o0
U Aj
= E P°O(Aj) + P°°
.
j=N+1
j=1
Thus, it suffices to show that if BN I 0-all in d-then P°O(BN) 1 0. We assume the contrary and derive a contradiction. That is, we suppose that
there exists e > 0 such that PI(BI) > e for all n > 1. These remarks make it clear that dim B' is strictly positive (i.e., Bn 34 0) and finite (i.e., Bn # I°°) for all n large. Henceforth, let y(n) := dim Bn, and note that the condition Bn 10 forces y(n) to be non-decreasing. Thus,
B' = Bi x ... x B,y(n) x (0, 1] x (0, 1] x
(5.29)
where
,
k(n,m) nm ] (m. < y(n)). Now define Cn to be an n := Uj=1 (a,'nm b,"
B,n
approximation from inside to B via closed intervals, viz., (5.30)
Cn=C1
where Cn = Uk=n1'm) [a: 'm, b, 'm] (m < 'Y(n)), and the a" 'm E (ar'm, 'm) bi are so close to the a's that (5.31)
PO°(Bj\Cj)
1.
5. Product Spaces
62
Proof. This can always be done because Pw(BJ \ C) is k(l,t)
p7(i) (i)
U
o?,tl x ... x
k(7.'Y(l))
U
l
a measure on (11 W, Sir 0)). 13
Therefore, thanks to (5.31), P°O(D') > (e/2), where Dn = fly 1C1 is a sequence of decreasing sets with D' C 10, jp(n) x (0, 1] x (0, 1] x
. Now
we argue that fln°_1Dn # 0; since Dn C Bn, this contradicts B" l 0 and our task is done. We know that Dn # 0 for any finite n since Poo(Dn) > (E/2). Moreover, we can write Dn = Di x D2 x where: (a) Dn = (0,1] for all j > y(n); and (b) Dn is closed in [0, 1] for all j < ry(n). Therefore, we can choose xn E Dn of the following form: 2,...,xnt(n)12
xn :_ (x 1,x
(5.32)
2....)
)
do >
Because ^t(n) is non-decreasing and DI D D2 D Di ) i
is a decreasing
sequence of closed subsets of [0, 11, z1 := lime-, xi is an element of fl° n°__1 Di .
Similarly, zj = lime.,, x E fl°=1 D for all j > 1. Thus, we have found a point z = (z i z2, z3, ...) in fln°.1Dn. This proves that fln_1D" # 0, whence the theorem.
Problems 5.1. Let fl be an uncountable set and SF the collection of all subsets A C 0 such that either A or A` is denumerable. (1) Prove that S9' is a o-algebra. (2) Define the set function P : 9 -. {0, 1) by: P(A) = 1 if A is uncountable, and P(A) = 0 if A is denumerable. Prove that P is a probability measure on (0, 5). (3) Use only the axiom of choice to construct a set 11, and an E C fl x 0 such that for
all x, y E 0, E and Ey are denumerable, where xE Ey := {x E fl : (z, y) E E}.
{y E 0 : (x, y) E E} and
5.2. Consider a finite measure tt on (R", .9(R")) such that p({x}) = 0 for all x E R". Define the diagonal D of R21 = R" x R" to be ((x, x) : x E R"). Then prove that (µ x µ)(D) = 0. 5.3 (Problem 5.2, Continued). Let (X, Y) be a random variable that takes values in R2. We say that X and Y are independent if (5.33)
Elf(X)9(Y)] = Ef(X) Eg(Y),
for all bounded measurable functions f,g : R -» R. Prove that if P{X = a} = 0 for all a E R, then P{X = Y} = 0. Why does this generalize Problem 5.2? (HINT: p(A) := P{X E A} and u(B) := P{Y E B} are probability measures.) 5.4. Prove that if {a,,i }° j=1 is a sequence indexed by N2, then 00
00
00
(5.34)
,=li=t
a',i =
00
ai,i+
provided that E,,, Ia,,ij < oo. Construct an example of {a,,i}w.j=t for which (5.34) fails to hold.
Problems
63
6.5. Prove that the Fubini-Tonelli theorem (p. 55) remains valid when p is a-finite.
5.6. If
are probability measures on ([0,1] , 53([0, 1])), then carefully make sense of the probability measure 11°_jµ,. Use this to construct the Lebesgue measure m on [0,1]O° endowed with its product a-algebra. Finally, if 1 > al > a2 > ... > a,, 10, then prove that
m fi[a,, 1] I >0 if E a, < oo.
(5.35)
/
,=1
We will do much more on this. Consult the Borel-Cantelli lemma on page 73.
5.7. Define (5.36)
f(x,y):=(x2-y2)(x2+y2)-2 for all x,y E (0,1], and verify that
f' 0
0
f'f(x,y)dxdy# f' f'f(x,y)dydx. 0
0
Why does the Fubini-Tonelli theorem not apply?
5.8. Define f(x,y) := xy(x2 + y2)-2 for all x, y E [-1, 11, and prove that f 0 L'([-I, I]') and yet t
t
t
1
t
f(x,y)dxdy= /- j f(x,y)dydx.
(5.37)
./
1
!
t
5.9. Define, for all x, y E R2,
f(x.u) _
(5.38)
r 1 if x2 + y2 = 1, 10 otherwise.
Respectively define µ and v to be the Lebesgue measure and the counting measure on R. Prove
that f't f 1 t f dµ d. 0 f't f't f dvdµ. Why does the Fubini-Tonelli theorem not apply? 5.10 (Monotone Rearrangements). Let f : R -» R+ be integrable, and define (5.39)
A,:={xER: f(x)>z}
Vz>0.
Prove that A, is measurable for all z > 0. Let fl(z) denote the Lebesgue measure of A,, and prove that: (i) fl is non-increasing and measurable; and (ii) f0 fl (z) dz = f f_ f (x) dx. 5.11. Compute explicitly the numerical value of oo
oa
/
x2
\
f!0 f exp l\ -- y- y l /dydx.
(5.40)
0
5.12. For all functions f : [a, b] -» R+ define the set (5.41)
A(f):={(x,y)ER2: 0 0, (5.44)
fe
sin x x
dx--it2 1. (HINT: an/n = fo xn-1 dx.) Hn :_
n° IanHn for all a E (0, 1), where
5.18. Use the Fubini-Tonelli theorem to compute
f 11-e as Ie-'3Tdx X
(5.46)
Jo
\
da,/3>0.
5.19 (Hard). Consider a set-valued function X on some given probability space (fl, 9, P). Specifically, X : fl -. 9(Rd), where 9(Rd) denotes the power set of Rd. We say that X is a random set
if (w,x) -. lx(,) (z) is product-measurable on the measure space (fl x Rd,.. x 57(R' )). Prove: (1) If A E .9(Rd), then A and X n A are both random sets. (2) If X,XI,X2,... are random sets, then so are X`, n,° IXn, and Un IXn. (3) If A E .W(Rd) satisfies A(A) < oo where A is a a-finite measure on (Rd,_4(Rd)), then A(X n A) is a finite random variable. (4) For all A E 9(Rd) such that A(A) < oo, and for all integers k > 1,
Ila(XnA)Ilk=fA ... f AP{x1EX,.,xkEX}A(dx,)...\(dxk). (5) P{x E X} = 0 for A-almost every x E Rd if and only if A(X) = 0, P-a.s. (6) There is a non-empty random set X such that P{x E X} = 0 for all x E Rd.
Notes (1) Example 5.11 is due to Sierpifiski (1920).
Mattner (1999) has constructed a Borel set A C R and two a-finite measures µl and N2 on 9(R) such that if we ignored measurability issues, then we would have the
'
following:
fx
\
(
\I-. 1A(x+Y)l+I(dx)/ 1 112(dv) 1 f
/ f IA(x+v)P2(dv)\1 --(d.).
1
o° \ °o Mattner's construction is interesting for at least two reasons: (i) It does not rely on the axiom of choice nor on the continuum hypothesis; and (ii) it shows that the "convolution" y .-. f f (x - y) pi (dx) need not be measurable with respect to the oo
smallest a-algebra with respect to which all functions { f (.--y); y E R} are measurable. (2) Problem 5.9 is motivated by two papers of Mukherjea (1972, 1974). (3) The FKG inequality (Problem 5.13) is due to Fortuin, Kasteleyn, and Ginibre (1971).
Chapter 6
Independence
Nothing is too wonderful to be true.
Attributed to Michael Faraday
Our review/development of measure theory is finally complete, and we begin studying probability theory in earnest. In this chapter we introduce the all-important notion of independence, and use it to prove a precise formulation of the so-called law of large numbers. In rough terms, the latter states that the sample average of a large random sample is close to the population average. Throughout, (Sl , Jr, P) is a probability space.
1. Random Variables and Distributions For every random variable X : SZ
R we can define a set function P o X-1
on (R,.V(R)) as follows: (6.1)
(P o X-1) (E) = P{X E E}
dE E R(R).
This notation is motivated by the fact that {X E E} is another way to write
X-1 (E), so that (P o X-1)(E) = P(X-1(E)). Lemma 6.1. P o X-1 is a probability measure on (R, -4(R)). Definition 6.2. The measure P o X
is called the distribution of the ran-
dom variable X.
Proof of Lemma 6.1. The proof is straightforward: (P o X-1)(0) = 0, and P o X-1 is countably additive on (R, -4(R)), since P is countably ad(] ditive on (Sl, .F) and X is a function. 65
6. Independence
66
Lemma 6.1 tells us that to each random variable we can associate a real probability space (R,.V(R), P o X-1). In a sense, the converse is also true: For every probability measure p on (R,,9(R)), we can define X(w) = w (w E R) to deduce that there exists a random variable X whose distribution is µ.
Definition 6.3. The (cumulative) distribution function F of a probability
measure µ on (R,R(R)) is defined by F(x) = p((-oo,x]) for all x E R. The distribution function F of a random variable X is the distribution function of P o X-1. In other words, F(x) := P{X < x} for all x E R. Note that: (i) F is non-decreasing, right-continuous, and has left limits;
F(x) = 0; and (iii) F(oo) := limx, F(x) = 1.
(ii) F(-oo) :=
These properties characterize F; i.e.,
Theorem 6.4. A function F : R -i [0, 1] is the distribution function of a probability measure p if and only if: (i) F is non-decreasing and right-
continuous; and (ii) F(-oo) = 0 and F(oo) = 1. In addition, F and p define one another uniquely.
Proof. (Sketch) The necessity of (i) and (ii) has already been established. Conversely, suppose F : R -- [0, 11 satisfies (i) and (ii). We can then define µ((a, b]) = F(b) - F(a) for all real numbers a < b. Extend the definition of p to finite disjoint unions of intervals of type (ai , bi] by setting (6.2)
µ
(C(ai , b1] I = E[F(bi) - F(ai)] i-1
/f
i=1
It is not difficult to check that: (a) This is a well-defined extension of p; and (b) p is countably additive on the algebra of all disjoint finite unions of intervals of the type (a, b]. These assertions are proved by adapting the proof of Lemma 3.15 on page 26 to the present case. Now we apply Caratheodory's
theorem (page 27), and extend p uniquely to a measure on all of R(R). It remains to check that this extended µ is a probability measure, but this µ((-oo, n]) = F(oo) = 1, thanks to the inner follows from µ(R) = continuity of measures.
Definition 6.5. If p > 0, then the pth moment of a random variable X is defined as E[XP] provided that X > 0 a.s., or X E L"(P). Lemma 6.6. If X > 0 a.s., then E[XP] = fn XP dP = f ° xP p(dx), where p denotes the distribution of X. More generally still, if h : R --+ R is Borel measurable, then (6.3)
Eh(X) =
Jn
h(X) dP =
provided that the integrals exist.
J
h(x) p(dx), 00
2. Independent Random Variables
67
Proof. The assertion Eh(X) = fn h(X) dP is a tautology. Next consider a simple function h, i.e., one of the form n
h(x) = E ailAi (x),
(6.4)
i=1
where A1,. .. , An E 9 are disjoint and al, ... , an E R. It follows that h(X) is a discrete random variable, and (6.5)
aip(Ai) = f h dµ.
aiP{X E Ai} _
Eh(X)
For a more general non-negative function h, we can choose simple functions hn T h and appeal to the monotone convergence theorem (p. 46). The rest follows from linearity. 0
Definition 6.7. The variance and the standard deviation of a random vari-
able X E L2(P) are respectively defined as VarX := E[(X - EX)2] and SD(X) := V-arX. If X,Y E L2(P), then the covariance and correlation between X and Y are respectively defined as (6.6)
Cov(X, Y) := E[(X - EX)(Y - EY)],
and
Cov(X, Y)
(6.7)
SD(X) SD(Y)'
where 0/0 := 1. Two random variables X and Y are said to be uncorrelated if p(X, Y) = 0.
Lemma 6.8. If X,Y E L2(P), then VarX = IIX - EX II2 = E[X2] - IEXIZ and Cov(X, Y) = E[XY] - EX EY. If X > 0 a.s. then
E[XP] = p r AP-1P{X > A} d)
dp>0.
00
CO
E P{X > n} < EX < E P{X > n}. n=1
n=0
2. Independent Random Variables Now we generalize the notion of independence that was touched on first in Chapter 1.
6. Independence
68
Definition 6.9. Events {Ei}T 1 are independent if for all distinct indices
i(1),...,i(l) in {1,...,n}, (6.10)
P (Ei(1) n ... n Eia)) = 11 P (Ei(j))
j-1
A collection {E,}QED of events is independent if EQII), ... , EQ(n) are indepen-
dent f o r all a(1), ... , a(n) E I. For an arbitrary index set I, the a-algebras {90aE1 are called independent if any finite number of events A"(;) E gQi;) (i = 1, . . . , n) are independent. Definition 6.10. The random variables XI, ... , Xn : S2 - Rd are independent if the events {X, 1(Ai)} 1 are independent for all A1, ... , An E .4(Rd). An arbitrary collection {XQ}QEJ is independent if XQii), ... , XQ(n) are inde-
pendent for all c (1), ... , o(n) E I. If {XQ}QE, are independent and identically distributed random variables, then we say that the XQ's are i. i. d. Equivalently, {Xi}
1 are independent if for all measurable {Ai} n
(6.11)
PIXIE A1,...,XnEAn}=flP{XjEAj}. j=1
Remark 6.11. One can construct random variables X1, X2, X3 such that Xi and Xj are independent whenever i 34 j, but X1, X2, X3 are not independent.
Lemma 6.12. Random variables {Xi} 1 are independent if for all measurable functions 01, ... , On : Rd -, R+, n
(6.12)
E
[f(x)] = fJE[4>j(Xj)]. j=1
j=1
Consequently, {Xi} 1 are independent if {hi(Xi)}= 1 are independent for
all Borel-measurable functions hl, ... , hn : R - R. Proof. The second assertion is a ready consequence of (6.12). Therefore, we derive only the latter equation. When the 4j's are elementary functions, (6.12) is the definition of independence. By linearity (in each of the O3's), (6.12) continues to hold when the 4j's are simple functions. Take limits to obtain the full result. O Soon we will see that assuming independence places severe restrictions on the random variables in question. But first, we need a definition or two.
Definition 6.13. The a-algebra generated by
is the smallest a-
algebra with respect to which all of the Xi's are measurable; it is written as u({Xi}LEA). When we say that {Xi}iEj is independent of a a-algebra ', we mean that a({Xi}iEI) is independent of 9.
2. Independent Random Variables
69
Definition 6.14. The tail a-algebra.I of the random variables {X1}°_1 is the a-algebra J = fln°_1a({Xi}°_°n). The following tells us that our definitions of independence are compatible. Moreover, the last portion implies that in order to prove that two
real-valued random variables X and Y are independent, it is necessary as well as sufficient to prove that (6.13)
P{X < x , Y < y} = P{X < x}P{Y < y}
''x, y E R.
Lemma 6.15. Let A and B denote two topological spaces. (i) For all random variables X : 52 -+ A,
a(X) = {X-1(A) : A E .2(A)}
.
(ii) If {Xj} 1 are random variables all taking values in A, then a Bvalued random variable Y is independent of {XX}jG if and only if Y is independent of {X}1 for every n > 1. (iii) Let 2 and sA be two subalgebras that respectively generate 2(B) and .Q(AO°). If Y-1(F) is independent of (XI, X2 .... )-'(E) for 1
all E E d and all F E 2, then Y and (Xj}j' 1 are independent. The proof of this is relegated to the exercises. Instead, we turn to the following consequence of independence. It confirms the assertion-made
earlier-that independence is a severe restriction.
Kolmogorov's Zero-One Law. If {Xi}°_1 are independent random variables, then their tail a-algebra 7 is trivial in the sense that for all E E .Y. P(E) = 0 or 1. Consequently, any .°T-measurable random variable is a constant almost surely.
Proof. Our strategy is to prove that every E E 9' is independent of itself,
so that P(E) = P(E fl E) = P(E)P(E). Since E E J, it follows that E is independent of a({Xi}n 1). Because this is true for each n, Lemma 6.15 (iii) ensures that E is independent of V'n'o a({Xi}i=11), which is defined to be the smallest a-algebra that contains U°° 1a({Xi}i1). In other words, E is independent of all the Xi's, and hence of itself. To conclude this proof suppose Y is Y-measurable. We intend to prove
that there exists a constant c such that P{Y = c} = 1. For any x E R, the event {Y < x} has probability zero or one. Therefore, the distribution function F of Y is necessarily of the form F(y) = where c denotes
the smallest p such that P{Y < p} = 1. This implies that Y = c almost surely.
0
6. Independence
70
Example 6.16. Suppose {Xi}°_1 are independent and define An:=X1+...+Xn.
(6.14)
n
Then, lira supn_,,. An and lim infn,. An are almost surely constants. Furthermore, the probability that limn-,, An exists in [-oo, oo] is zero or one. If this probability is 1, then limn-(,o An is a constant almost surely. Next we prove that independent random variables exist.
Theorem 6.17. If {µi}i=1 are probability measures on (Rd,.V(Rd)), then there exist independent random variables {Xi}°0=1, all on a suitable probability space, such that the distribution of X; is pi for each i = 1, 2,... .
Proof. For the sake of notational convenience we will assume that d = 1. x An for every Let fln := Rn, gn := .4(Rn), and µn := Al x n > 1. Clearly, {/Ln}°O_1 is a consistent family of probability measures. By the Kolmogorov extension theorem (p. 60; see also Remark 5.17 therein) there exists a probability measure P on (R°°,R(R°O)) that extends {µn}°n°__1.
Define X1(w) = wi for all w E R°° and i > 1. Because Xt 1(Ei) = {w E RO° : wi E Ei}, it follows that P{Xi E Ei} = µi(EE) and
P{Xl E El,..., Xn E En} = P(Xi 1(El) n ... n Xn 1(En)) (6.15)
n
_ flui(Ei). i=1
Therefore, the Xi's are independent and have the asserted distributions. 0 Let us conclude this section with two results of computational utility. The proofs are left to the reader.
Corollary 6.18. If X, Y E L2(P) are independent, then they are uncorrelated; i.e., Cov(X,Y) = 0. The converse is false in general; see Problem 6.7.
Corollary 6.19. If {Xi},n=1 are uncorrelated and in L2(P), then n
(6.16)
Var (Xi + ... + Xn) = E VarXj. j=1
In particular, this identity is valid if the Xi's are independent.
4. The Weak Law
71
3. An Instructive Example We now describe a class of distributions that do not fit into the classical probability models of Chapter 1. Let {Xi}°_1 denote i.i.d. random variables, all taking the values 0 and 1 with probability 2 each, and define Y :_ E°__14-'Xi. If µ denotes the distribution of Y, then p is a probability measure on .V([0,11) that is neither discrete nor absolutely continuous. It is also not a simple combination of the latter two types of distributions. The following makes these remarks more precise.
Theorem 6.20. The distribution µ of Y satisfies: (i) M({x}) = 0 for all x E [0, 1]; and (ii) there exists a measurable set A C R+ that has zero Lebesgue measure and yet µ(A) = 1.
Proof. For all n > 1 let Yn := F 14-'Xi and Yn the set of possible values of Yn. The cardinality of 9n is 2n because its elements are of the form En where yi = 0 or 1. Also, because IY-Ynl < Ei'n+14-` = 4-n/3, i=1 Y is a.s. within 4-n/3 of some y E Yn. Therefore, µ(,,) = P{Y E Vn} = 1 for all n > 1, where (6.17)
Vn:= U U
k=nyE9'
[c,+c].
Vn+1, A := nl,Vn has IL-measure one. On the other hand, if m denotes Lebesgue measure, then for all n > 1, Because Vn
00
(6.18)
m(A) < m(Vn) < 2 k=nyEYA,
4_k
2 00
3=3
2- k
k=n
=
22_n 3
This proves that m(A) = 0 although µ(A) = 1. It remains to prove that ;=14-'xi where ,u({x}) = 0 for all x E [0, 1]. To this end, write x = xi E {0,1, 2, 3}, and note that (6.19)
µ({x}) < P{X1 = x1i ... , Xn = x, } < 2-n.
Let n - oo to complete the proof.
D
4. Khintchine's Weak Law of Large Numbers The weak law of Khintchine (1929) states that, with high probability, sample
averages are close to population averages. We will soon see that this is a considerable improvement of the law of large numbers of Bernoulli (1713) on page 18. Although the weak law is subsumed by the forthcoming strong law of large numbers, we state and prove it first because it provides us
6. Independence
72
with a good opportunity to learn more about the Markov and Chebyshev inequalities, as well as Markov's "truncation method." Throughout this section {Xi}°_1 are i.i.d. (see Definition 6.10), realvalued, and
S. := X1 + ... + X.
(6.20)
do > 1.
The Weak Law of Large Numbers. If {Xi}°_1 are in L'(P) then, as
n - oo, Sn
- EX1 in LI(P), and hence in probability. n Example 6.21. Imagine n independent Bernoulli trials where the probability of success per trial is p E (0, 1). If E3 denotes the event that the jth trial is a success, then Sn := E i 1Ej is the total number of successes, and Sn = Bin(n,p). Because Sn is a sum of n i.i.d. mean-p random variables, Khintchine's weak law of large numbers includes Bernoulli's law of large numbers (p. 18) as a special case. (6.21)
Proof of the Weak Law. Thanks to Theorem 4.20 on page 43, it suffices to prove that Sn/n --* EX1 in LI (P). We do this in two steps. Step 1. The L2-Case. If {X)911 are in L2(P), then Corollary 6.19 tells us that (6.22)
II
nn
- EX1II2
SD(Sn/n) =
SD( 1)
0.
That is, Sn/n -+ EX1 in L2(P), and hence in L'(P) (Proposition 4.16, p. 42).
Step 2. The General Case. When the Xi's are assumed only to be in
L'(P) we use a truncation argument. Choose and fix a large a > 0, and define X° := Xi1{lX;1 1, n
Sc,111:5
(6.23)
+ Xn. By the triangle
IIsn -
E (IXiI; IXiI > a) = nE (IX1I; IX1I > a). i=1
Also, IEX1 - E[X1 ]I < E{IX1I; IXiI > a}. Therefore, n
Sn (6.24)
n - EX1
< 2E (IX1I; IXi i > a) + nn - E[Xfl 11
i
Because the X°'s are bounded and i.i.d., Step 1 insures that the last LInorm converges to zero as n -+ oo. Therefore, (6.25)
lim sup II Sn - EX1 ill :5 2E (IX, I; IX1I >- a),
n-oo
n
5. The Strong Law
73
for all truncation levels a > 0. Let a -* oo and appeal to the dominated convergence theorem to finish.
5. Kolmogorov's Strong Law of Large Numbers We are ready to state and prove the law of large numbers of Kolmogorov (1933). Throughout this section {Xi}°__1 are i.i.d. random variables in R. We will write Sn := Xl + + Xn as before.
The Strong Law of Large Numbers. If X1 E L'(P), then lim
(6.26)
Sn
n-»oo n
= EX1
a.s.
Conversely, if limsupn.,,. ISn/nI < oo with positive probability, then the XD's are in L1(P) and (6.26) holds. Our proof hinges on two key technical results.
The Borel-Cantelli Lemma. Let {Ai}i=1 be a collection of events. If E 1 P(An) < oo then Eoa--1 lA < oo a.s. Conversely, suppose that the Ai's are pairwise independent; i.e., P(Ai fl A3) = P(Ai)P(Aj) whenever i # j. Then, E°=1 P(An) = oo implies that F,n=1 lA = oo a.s.
Proof. Let pn := P(An) for all n > 1. By the monotone convergence theorem, E°O_1 pn = E F,O°_11An. Any non-negative [0, oo1-valued random
variable that is in L' (P) is a.s.-finite. Therefore, if E°_1 pn < oo then En '=l lA < 00 almost surely. The converse is more interesting: Suppose that En'=1 pn = oo and the Ad's are pairwise independent. Let Zk := En 1A and note that -1
k
(6.27)
k
VarZk=Epn(1-pn) pn=EZk n=1
Vk> 1.
n=1
See Corollaries 6.18 and 6.19. Chebyshev's inequality (p. 43) then yields (6.28)
P{jZk-EZkI>EEZk}
A} = 1 for all A > 0. Because F,n° 1A > Zk for all k, this proves that En=11A = oo almost surely. 1
Before we proceed with our second technical result, let us prove the second half of the strong law.
6. Independence
74
Proof of the Strong Law (Necessity). Suppose that EIX1I = oo; we plan to prove that limsupn, ISn/nI = 00 a.s. Because IXnI lim sup IXnI < 2limsup
(6.29)
ISnI+ISn-1I,
ISnI
n-.oo n n-.oo n Therefore, it suffices to prove that limsupn_0 IXnI/n = oc a.s. Choose and fix an arbitrary k > 0, and observe that
(6.30)
ElX1I < E P{IX1I > kn} = > P{IXnI > kn}. n=0
n=0
See Lemma 6.8. Because the left-hand side is infinite, the second half of the
Borel-Cantelli lemma implies that limsupn_00 IXnI/n > k a.s. Let k - 00
0
to finish.
The following maximal L2-inequality of Kolmogorov (1933, 1950) is the second technical result that was promised earlier.
Kolmogorov's Maximal Inequality. Suppose Sn = X1 + + Xn, where the XD's are independent and in L2(P). Then for all A > 0 and n > 1, (6.31)
ISk-ESkI>AVarSn
P{ max
\2
1 r_k=1 E[Sn; Ak]. Because Sn > 2(Sn - Sk)Sk + Sk, this yields n
(6.32)
n
E[SS; Ak].
E [Sk(Sn - Sk); Ak] +
E[Sn] > 2 k=1
k=1
The event Ak and the random variable Sk both depend on {Xi}k 1, whereas Sn - Sk = Xk+1 + +Xn is independent of {Xi }k 1 Consequently, Sn - Sk is independent of Sk1 A, (Lemma 6.12). It follows that E[(Sn - Sk)Sk; Ak] = E[Sn - Sk]E[Sk; Ak] = 0, since ESn = ESk = 0. Whenever w E Ak we have SS(w) > A2. Therefore, n
(6.33)
n
n
E[SS] > E E[SS; Ak] > A2 E P(Ak) = J12P U Ak k=1
k=1
k=1
.
5. The Strong Law
75
This proves the result.
0
Proof of the Strong Law (Sufficiency). We suppose that X1 E L' (P), Sn/n = EX1 a.s. Throughout, we may and strive to prove that assume, without loss of generality, that EX1 = 0; otherwise, we can consider
Xi - EXi in place of Xi. The proof of the strong law simplifies considerably when we assume further that X1 E L2(P). This will be done in the first step. The second step uses a truncation argument to reduce the matter to the L2 case. Step 1. The L2 Case. If the XD's are in L2(P), then by the Kolmogorov maximal inequality for all n > 1 andle > 0, (6.34)
P c lmkax I SkI > en } 5
[2 2] =
II E2 2
because E[Sn] = VarS,, = nE[Xf] (Corollary 6.19). Replace n by 2' to deduce that En'=1 P{maxl j). We exchange the order of summation and expectation to find that IES;,I < E {IXI I min (IXI I
(6.38)
,
n)).
By the dominated convergence theorem, ES;, = o(n), as asserted. Our next and final goal is to prove that IS,, - ESnI = o(n) a.s. It follows from this that IS,',I = o(n) a.s., whence the proof would follow. For all k > 1, Sk is a sum of k independent (though not i.i.d.) random variables. According to the Kolmogorov maximal inequality,
P(E(n))
1, c > 0,
l max ISk - ESknE } . E(n) :_ lllog2(j )
If x >_ 0, then En>x 4-n is at most En°_lxj 4-n = 41-1x1/3 < 42-x/3, where J denotes the greatest-integer function. Hence, P(E(2n)) n=1
1
E[X1
32
j]
(6.42) 16
= 362E
1
2
XI
2
.
jEN: j?JXIJ
It is not hard to prove that xE,>x j-2 < 2 for all x > 0.1 Set x := IXiI and plug this into (6.42) to find that 00
(6.43)
32
>P(E(2n)) 0. We now follow Bernstein (1913) and derive the Weierstrass theorem from Khintchine's weak law of large numbers.
Definition 6.23. For every continuous function f : [0, 1] - R define the Bernstein polynomial Bn f by (6.44)
(n)x3(1
(Bnf)(x)
- x)nf Cam/
'x E [0,1].
Then, C3 f is a polynomial of order at most n for each n > 1.
Theorem 6.24.
!n f = f, uniformly on [0, 1).
Proof. Choose and fix some p E (0, 11, and define {Xt }°_1 to be independent
random variables that take the values 1 and 0 with respective probabilities
p and 1 - p. Recall that S := X1 +
+ X, = Bin(n, p), and note that
(Bn f)(p) = E f (Sn/n). Consider the "modulus of continuity of f": (6.45)
m f(6) = sup I f (r) - f (s)I
d6 > 0.
O 0; i.e., there exists L such that m f(6) < Lb°. It follows that (6.50) o
sup 1 1(13.f )(p) - f(p)I
nn/2'
for a constant A that depends only on a and L. We derive (6.50) next. The only case of interest is a E (0, 11. For if a > 1 then f = 0, in which case f is a constant and there is nothing to prove. Henceforth, choose and fix a E (0, 11, and define D := I (Bnf)(p) - f (p)I for brevity. Because Dn = I E{ f (Sn/n) - f (p)} I, Holder continuity of f implies that Dn < LE{I(Sn/n) - p1°}. Since a E (0, 1], Holder's inequality asserts that E(IZ1°) < (E{Z2})Q/2 for all random variables Z. Therefore, [(1_n P) ]./2 (6.51) I(8nf)(p) -f(p)I -.5 L [var L
(S)]/2
The elementary inequality p(1 - p) < 1/4 yields (6.50) with A := L2-°.
6.2. The Asymptotic Equipartition Property. Our next application of independence is one of the starting points of the work of Shannon (1948; 1949), who discovered various startling connections between the thermodynamical notion of entropy and the mathematical theory of communication. We explore one of these connections here. First, we need some jargon from communication theory: Consider a fixed finite set A := {a1 i ... , a,n} with m distinct elements. Any element o i of
A is a letter (or symbol), and A itself is the alphabet. A word (or code) w := (wl, ... , w,) of length n is a vector of n letters in A. The relative frequency of the letter ak in the word w is then (6.52)
fn(ak,w) = n
1{ok}('wj) j=1
There are a total of ml words of length n. If we were to select one at random (all n-letter words being equally likely) and write the resulting random word as W := (W1,... , Wn), then W1,. .. , Wn are i.i.d., and P{ W 1 = ai} = 1/m f o r all i = 1, ... , m (check!). Therefore, by the weak law of large numbers (p. 72), f o r any k = 1, ... , m, fn(ak, Wn) - 1/m in probability. That is, for a "very typical word of indefinite length," the asymptotic relative frequency of any letter is 1/m. What about long words with (possibly) other asymptotic frequencies? To answer this, we choose and fix a probability vector (p1 i ... , pm) through-
out;i.e.,pj>Oforj=1,...,m,andpi++pn=1.
6. Applications
79
Definition 6.26. If e > 0, then an n-letter word w is said to be e-typical if (6.53)
Ifn(crk,w) - pkl < e
vk = 1,..., m.
Otherwise, w is said to be e-atypical.
The following is, in essence, Shannon's fundamental theorem of data compression (Shannon, 1948). For an improvement see Problem 6.33 below.
Theorem 6.27. For every n _> 1 and e > 0 define Tn(e) to be the number of c-typical words of length n. Then,
1- 1)2 n(H(p)-cf) < Tn(e) < 2n(H(P)+cc)
(6.54)
where loge denotes the base-2 logarithm, c
-Ek1 log2pk > 0, and
H(p) :_ - > i i t p; 1092 A is the entropy of the vector p = (p',... ,pm). Suppose n tends to infinity and e = en goes to zero so that c2n oo. Then the preceding says that of the mn words of length n, 2n(H(P)+o(1)) have more or less the property that the letters al, ... , am have asymptotic relative frequencies pl,... , pm, respectively. The point is that unless pi = = pm = 1/m,, 2n(H(P)+o(1)) is asymptotically far smaller than the total number of n-letter words mn; compare with Problem 6.25. This observation is the starting point of data compression.
Proof of Theorem 6.27. Let {X;}°_1 denote i.i.d. random variables, all taking the values a1, ... , am with probabilities P1,... , pm. Write Wn
(X1,...,Xn). For any fixed n-letter word w, P{Wn = w} = lik
t pkIf w = 2_n(H(p)+«);
is
IIk1 pk(Pk+F)
e-typical, then the latter probability is at least also it is at most 2-n(H(P)-Of). Rearrange and sum over c-typical w's to find
that (6.55)
2n(H(P)-cf)In(')
1 - e; and (6.61)
sup
X,_1<x<xj
IF(x) - F(xi-1)I < e
dj = 1,...,m.
6. Applications
81
According to (6.60), (6.62)
max IFF(xj) - F(xj)I
O<j<m
-e
for all but finitely many n's, a.s.
Hence follows that if x E [xj_1,xj) for some 1 x,n.
Similarly, if x < xo, then IF(x) - Fn(x)I < F(xo) + Fn(xo) < 3e. Consequently, with probability one, (6.66)
sup I FF(x) - F(x)I < 3E for all but finitely many n's.
rER
This proves the theorem.
6.4. The Erdos Bound on Ramsey Numbers. Let us begin with a definition or two from graph theory.
Definition 6.30. The complete graph Km on m vertices is a collection of m distinct vertices any two of which are connected by a unique edge. The nth (diagonal) Ramsey number Rn is the smallest integer N such that any bi-chromatic coloring of the edges of KN yields a Kn C KN whose edges are all of the same color.
To understand this definition suppose Rn = N. Then, no matter how we color the edges of KN using only the colors red and blue, somewhere inside KN there exists a K whose edges are either all blue or all red, and N is the smallest such value. It is possible to check that R2 = 3 and R3 = 6, for example.
Ramsey (1930) introduced these and other Ramsey numbers to discuss ways of checking the consistency of a logical formula. See also Skolem (1933) and Erdos and Szekeres (1935).
As a key step in his proofs Ramsey proved that R,, < oo for all n > 1.
Evidently, Rn - oo as n - oo; in fact, it is obvious that Rn > n. The following theorem of Erdos (1948) presents a much better lower bound.
6. Independence
82
Theorem 6.31. As n - oo, Rn > (c+o(1))n2n/2, where 1/c := e'. Proof. We aim to prove that given any two integers N > n, (1V)21_() < 1 implies R> N. (6.67) n Let us assume that this is the case, for the time being, and apply it with N := Lcn2n/2J, where L J denote the greatest-integer function. Because N!/(N - n)! < Nn, our particular choice of N yields (6.68)
(N) `n JJ
< (cn2n/2)n n!
Consequently, by Stirling's formula (p. 21), (6.69)
(')2'-;< 2c Onn2n/2
n! ten' which is strictly less than 1 for all large n. It remains to verify (6.67). Consider a random coloring of the edges of KN; i.e., if EN denotes the set of all edges of KN, then consider an i.i.d. collection of random variables
{Xe}eEEN where P{Xe = 1} = P{Xe = -1} = 1/2. We color e red if
Xe=1,andblue ifXe=-1. The probability that any n given vertices form a monochromatic Kn is 21-(2). Since there are (n) choices of these n vertices, the probability that there exist n vertices that form a monochromatic Kn is less than or equal to (n)21-(2). If this is strictly less than one, then there must exist bichromatic colorings of KN that yield no monochromatic Kn, and hence (6.67) follows.
6.5. Percolation. Consider the d-dimensional integer lattice Zd as a graph; i.e., two points x, y E Zd are connected by an edge if and only if the Euclidean
distance between x and y is one. Let Ed denote the resulting collection of edges; Zd denotes both the vertices and the graph. Fix a number p E [0, 1], and consider the resulting random graph of Zd that is obtained by deleting any edge with probability 1 - p; else, the edge is kept. The decisions from edge to edge are made independently. More precisely, let {Xe(p)}eEEd be i.i.d. with P{Xe(p) = 1} = 1 - P{Xe(p) = 0} = p. As always, XX(p) = Xe(p,w) depends on w, but we ignore the w. If Xe(p) = 0, then we think of e as a deleted edge. The resulting random subgraph can be identified with r(p) := {e E Ed : XX(p) = 1}. We say that percolation occurs if r(p) has an infinite connected subgraph. By the Kolmogorov zero-one law, the probability of this event is zero or one. The basic question is to decide when the probability of percolation is one.
6. Applications
83
This, and some generalizations, were introduced by Broadbent and Hammersley (1957). Since then, the subject has grown to be a vast area in mathematics and physics alike; see Grimmett (1999) for a lively introduction.
Next, we prove one of the basic results of percolation. Namely, that there exists a critical probability of percolation.
Proposition 6.32. For any d > 2, there exists a "critical probability" pc(Zd) E [0, 1] such that whenever p < pe(Zd) percolation does not occur. But if p > p,(Zd), then percolation occurs a.s.
Remark 6.33.
(1) The same is true for d = 1, but this is the trivial case since p,(Z) = 1. Indeed, if p < 1, then with probability one, we end up deleting infinitely many edges on both sides of the origin; see Problem 6.16. Thus, unless p = 1, there is no percolation on Z.
(2) It can be shown that if d > 2, then pe(Zd) lies (strictly) in (0, 1). In addition, pc(Z2) = the lower bound on pc(Z2) is due to Harris (1960), and the upper2;bound to Kesten (1980). To establish the weaker bound, pc(Z2) > 1, one can employ far simpler arguments. See Problem 6.28.
Proof. The trick is to appeal to a monotonicity argument: We canconstruct all of the Xe(p)'s, all on the same probability space, such that [for all w]: (6.70)
If p< r, then Xe(p) < Xe(r) for all e E Ed.
In order to do this, w e first construct, on an appropriate product space, random variables {Ue}eEEd that are i.i.d. and distributed uniformly on [0 ,1]. Then we define Xe(p) := 1{UUE(o,p1} for all p E [0, 11 and e E Ed. Evidently, {Xe(p)}eEEd are i.i.d., and P{Xe(p) = 1} = 1 - P{Xe(p) = 0} = p. More-
over, (6.70) is manifest. In particular, on this probability space, if p < r, then r(p) C r(r) [for all w]. The result follows from letting pe(Zd) denote the smallest p E [0, 1] such that r(p) contains an infinite connected subgraph with positive probability.
6.6. Monte Carlo Integration. Suppose we were asked to find, or estimate, the value of some n-dimensional integral (6.71)
I(0) :=
J(0.11""
O(x) dx.
Here, 0 : R" --+ R is a Lebesgue-integrable function that is so complicated that I(¢) is not explicitly computable. One way to proceed is to first select i.i.d. random variables X1, ... , XN uniformly at random in the n-cube [0, 11n. By Lemma 6.6, EO(X3) = I(0)
6. Independence
84
for all j = 1, ... , N. Because {O(Xi)}N 1 are i.i.d. random variables with expectation I(0), the strong law of large numbers (p. 73) implies that N
lim N-.oo N
(6.72)
EO(Xj) = 1(0)
a.s.
j=1
The preceding suggests a way of finding numerical approximations to 1(0): First simulate N independent uniform-[O, 1]" random variables {Xi}N1, and then average 0(X1), ... , O(XN ). This procedure is called Monte Carlo
integration. It was first used in the 1930s by E. Fermi in the calculation of neutron diffusion. The full power of Monte Carlo integration was discovered in the 1940s by J. von Neumann and S. Ulam. Monte Carlo integration outperforms most other numerical integration methods when N is large. We conclude this section with a brief discussion on random-number generation. For this discussion, it might help to recall that a random "variable" is in fact a function, or a procedure. Stated loosely, most random-number generators simulate the said procedure as follows: Start with an initial number wo, called a seed, and some predetermined function f : R -s R. Then, define iteratively wi+1 := f (wi). If f is sufficiently "chaotic," and if N is sufficiently large, then wN simulates a realization X(u)o) of a certain random variable X. To obtain more simulations we start over with other seeds. In order to simulate a Unif(O, 1) random variable X, f has to be chosen with care. The most common choice is to use a linear congruential generator (LCG). A linear congruential generator is described by (6.73) f (x) = (ax + b) (mod c), where a, b, and c are "well-chosen" predescribed parameters. Knuth (1981) discusses this and related methods. In particular, one finds there intuitive as well as rigorous methods for finding good choices of a, b, and c. There are also interesting non-linear examples; for a sampler see Problems 6.31 and 6.36.
Problems 6.1. Consider a continuously differentiable function f : R+ -. R+ that satisfies f (O) = 0. Suppose that either f' > 0 a.e., or f' is integrable. Then prove that for all non-negative random variables X. (6.74)
I,
Ef(X) =
0
f'(,\)P{X > A}dA= /
f'(A)P{X >
0
Deduce Lemma 6.8 from this.
6.2. Prove that two real-valued random variables X and Y. both defined on the same probability space, are independent if and only if for all x, y E R, (6.75)
P{X < x,Y < y} = P{X < x}P{Y < y}.
Problems
85
6.3. Improve Lemma 6.8 as follows: If X, Y E L2(P) then Cov(X, Y) is equal to (6.76)
1
J
(P{X > x,Y > y} - P{X > x}P{Y > y}) dxdy.
6.4. Prove that if the distribution of (X, Y) is absolutely continuous with respect to two-dimensional Lebesgue measure, then:
(1) There exists f E L1(R2) such that P{(X,Y) E A} = ff,, f(x,y)dxdy for all A E M(R2),
(2) There exist f, f} E L1(R) such that P{X E B} = f8 fx(x)dx and P{Y E B) _ fe fv (y) dy for all B E .S!(R). (3) X and Y are independent iff f (x, y) = f,, (x)f,. (y) for almost every (x, y) E R2.
6.5. Verify the claim of Remark 6.11 by constructing three random variables X1, X2, and X3, such that X, and Xj are independent whenever i j, but {X1,X2,X3} are not. 6.6. Prove Lemma 6.15.
6.7. Construct two random variables X and Y on the same probability space such that X and Y are uncorrelated but not independent. 6.8. Prove Corollaries 6.18 and 6.19.
Improve the latter by proving that given any sequence
{X,}°=1 of random variables in L2(P),
+Xn)_
VarX,+2 Cov(X, , X2). =1 IGV ry-t > 0. Verify that this improves the independence half of the Borel-Cantelli lemma (p. 73).
6.21 (Problem 6.20, Continued). Prove that if Z is a non-negative random variable in L2(P), then P(Z = 0) < VarZ/E(Z2]. 6.22 (Normal Numbers). Suppose X is uniformly distributed on [0, 1], and write its decimal expansion as X = F,-=, 10-'Xj, where Xj = 0,...,9 (with some convention for terminating expansions). Prove that {X,}.__, are i.i.d. Find their distribution. Derive the normal-number theorem of Borel (1909): "Lebesgue-almost every number w E [0, 11 satisfies limn_- Nt(i.,)/n = 0.1 fort= 0,... , 9." Here, Nn' (w) = E° I 1{x,(W)=t} denotes the number of times that the digit I appears in the first n binary digits of w. [If you do not find this surprising then you may wish to try to decide whether some given irrational such as 1/f, 7r/10, or In 2, is a normal number.] 6.23 (Problem 6.22, Continued). Choose and fix an integer b > 1. If X is distributed uniformly on (0, 11, then write its b-ary expansion X(w) = Eat b-jXj(w). In the case that there are two ways to choose the X,'s, we always opt for the finite expansion. Prove that {Xj},_, are i.i.d. and take the values 0, ... , b - 1 with equal probability. Prove: n (6.81)
P
vt nHIM
n F_ I{x;(W)=t} = 6
= 0,...,b - 1, vb > 2
= 1.
=i
6.24. Recall that if f : R R+ is measurable and integrates to one, then it is a probability f (y) dy defines a density function. Prove that if, in addition, f is continuous then F(x) distribution function with F' = f almost everywhere. 6.25. Suppose p = (pi , . Pm) is a vector of probabilities; i.e., p, E [0, 11 for all i = 1, ... , m, and the p,'s add up to one. Recall the entropy H(p); you may need to define OInO := 0 to make sense of this in general. Prove that the (discrete) uniform distribution maximizes the entropy uniquely among all probability vectors on m fixed points. Calculate this maximum entropy. This exemplifies the method of "the most probable distribution" of statistical thermodynamics (Schrodinger, 1946). (HINT: For all x > 0, -xlnx < 1 - x.)
Problems
87
6.26 (Information Inequality). First prove that
f f(x)Ing(x)dx< f f(x)Inf(x)dx, 0o co
(6.82)
for all density functions f and g on R, where 0 In 0 := 0. This is called the information inequality.
Let H(f) := - f f. f (x) In f (x) dx denote the entropy of f. Then prove that: (1) The Unif(a,b) density is of maximum entropy among all density functions that are supported on (a, b). (2) The N(µ, 02) density is of maximum entropy among all densities on R that have mean p and variance a2.
6.27. Prove that if {X,) 1 areri.i.d. and in L2(P), and if Sj := X1 +
+ X,, then
E I max IS, - ES, 1] < 2SD(Xl)
(6.83)
1<j1 tj+1/tj > 1. The latter gap condition cannot be improved dramatically (Buczolich and Mauldin, 1999). (16) The limiting result in Problem 6.40 can be shown to hold almost surely, and the end result is called the almost-sure central limit theorem (Lacey and Philipp, 1990). For a prefatory version consult Levy (1937, p. 270). This problem and its history have been recently surveyed by Berkes (1998).
Chapter 7
The Central Limit Theorem
Experimentalists think that it is a mathematical theorem, while the mathematicians believe it to be an experimental fact. -Gabriel Lippman, in a discussion with J. H. Poincare about the CLT
Let Sn denote the total number of successes in n independent Bernoulli trials, where the probability of success per trial is some fixed number p E (0, 1). The De Moivre-Laplace central limit theorem (p. 19) asserts that for all real numbers a < b, (7.1)
-x2/2dx.
lim P a< n-oo
Sn - np < b np(1 - p)
-
f be 27r
We will soon see that (7.1) implies that the distribution of Sn is close to that of N(np, np(1 - p)); see Example 7.3 below. In this chapter we discuss the definitive formulation of this theorem. Its statement involves the notion of weak convergence which we discuss next.
1. Weak Convergence Definition 7.1. Let X denote a topological space, and suppose p, Al, µ2, ... are probability (or more generally, finite) measures on (X,R(X)). We say that An converges weakly to It, and write µ µ, if (7.2)
lim
If f dpn =
f f dµ, 91
7. The Central Limit Theorem
92
for all bounded continuous functions f : X - R. If the respective distributions of Xn and X are µn and p, and if µn . µ, then we also say that Xn converges weakly to X and write X, = X. This is equivalent to saying that lim Ef (Xn) = Ef (X), noo
(7.3)
for all bounded continuous functions f : X
R.
The following result of Levy (1937) characterizes weak convergence on R.
Theorem 7.2. Let µ,µl,µ2, ... denote probability measures on (R, .B(R)) with respective distibution functions F, F1, F2,... . Then, µn µ if and only if
lim Fn(x) = F(x),
(7.4)
n-oo for all x E R at which F is continuous.
X if and only if P{Xn 0} is denumerable.
1. Weak Convergence
93
Proof. Define r
Jn:={xER: P{X=x}> n
(7.7)
111
Since J = U°_1J,,, it suffices to prove that J,, is finite. Indeed, if J were infinite, then we could select a countable set Kn C J,,, and observe that
1 > E P{X = x} >
(7.8)
II L"I
xE Kn
where I infinite.
I denotes cardinality. This contradicts the assumption that K is
Proof of Theorem 7.2. Throughout, we let X denote a random variable whose distribution is pn (n = 1, 2....), and X a random variable with distribution M.
Suppose first that X = X. For all fixed x E R and e > 0, we can find a bounded continuous function f : R -+ R such that (7.9)
f (y) : 1(-oo,x) (y)
f (y - E)
dy E R.
[Try a piecewise-linear function f.] It follows that
Ef
(7.10) Let n
(7.11)
F.(x) < Ef (Xn - e).
oc to obtain
E f (X) < lim inf F (x) < lim sup Fn (x) < E f (X - E). n-roo
n-.oo
Equation (7.9) is equivalent to the following: (7.12)
1(-.,x-E)(y) < f (y) and f (y - e) < 1(_oo,x+E)(Y)
We apply this with y := X and take expectations to see that (7.13)
F(x - e) < E f (X) and E f (X - e) < F(x + e).
This and (7.11) together imply that (7.14)
F(x - c) < lim inf Fn (x) < lim sup n-.oc
F(x + e).
Let e 10 to deduce that Fn(x) F(x) whenever F is continuous at x. For the converse we suppose that F,,(x) -' F(x) for all continuity points x of F. Our goal is to prove that limn. E f (Xn) = E f (X) for all bounded continuous functions f : R R. In accord with Lemma 7.5, for any o, N > 0, we can find real numbers < x_2 < x_1 < xo < x1 < x2 < . . . (depending only on d and N) I f(y) - f (x,)1 < d; (ii) F is continuous such that: (i) maxj,j- 1 - b and F(x_N) < b. Let AN
(x-N,xN+1] By (i), N
E [f (X.); X. E AN] -
E f(xj)[Fn(xj+l) - FF(xj)]
=-N N
E E If (X.) - f (xj); X. E (xj,xj+1)}
(7.15)
j=-N N
< >2 E{If(Xn) -f(xj)I; X. E (xj,xj+1]} j=-N < b.
This remains valid if we replace Xn and Fn respectively by X and F. Note that N is held fixed, and Fn converges to F at all continuity-points of F. Therefore, as n oo,
f(xj) [Fn(xj+1) - Fn(xj)] - >2 f(xj) [F(xj+1) - F(xj)]
(7.16)
lil 0 for all x. Step 1. The Lower Bound. For any p > 0 choose and fix a function fp E C,,(Rk) such that:
(1) For all x E [-p,p]k, fp(x) = f(x). (2) For all x ¢ [-p - 1, p + 11k, fp(x) = 0. (3) For all x E Rk, 0 < fp(x) < f (x), and fp(x) T f (x) as p T oo. It follows that (7.22)
lim inf f f dµn > lim f fp dµn = n-oo
n-oo
J fp dµ.
Let P T oo and apply the dominated convergence theorem to deduce that (7.23)
lim inf
n-oo
ff dµn >
J
f dp.
This proves half of the theorem.
Step 2. A Variant. In this step we prove that, in (7.23), f can be replaced by the indicator function of an open k-dimensional hypercube. More precisely, given any real numbers al < bl,... , ak < bk, (7.24)
liminfµn((ai,bi) x ... x (ak,bk)) > p((ai,bi) x ... x (ak,bk))
To prove this, we first find continuous functions z/P,n I pointwise. By definition, On E C,,(Rk) for all m > 1, and (7.25)
lim f lin ((al , bl) x ... X (ak , bk)) ? n-oo
imoo
m dµn
V"m d{!.
n J Let m T 00 to deduce (7.24) from the dominated convergence theorem.
7. The Central Limit Theorem
96
Step 3. The Upper Bound. Recall fP from Step 1 and write
f f dpn = f (726)
1.
JJZ(n1JJ
By independence, the characteristic function of Z( n) is f (t) := exp(-IItfl2/2) Because f is rotation-invariant, Z(n) and MZ(n) have the same characteristic
function as long as M is an (n x n) rotation matrix. Consequently, Z(") and MZ(n) have the same distribution for all rotations M; confer with the uniqueness theorem on page 99. It follows that the distribution of X(n) is rotation invariant, and hence the existence of a uniform distribution on Sn-1 follows. Next we prove the more interesting uniqueness portion. For all e > 0 and all sets A C Sn-1 define KA(e) to be the largest number of disjoint open balls of radius a that can fit inside A. By compactness, if A is closed then KA(e) is finite. The function KA is known as Kolmogorov c-entropy, Kolmogorov complexity, as well as the packing number of A.
Let p and v be two uniform probability measures on ..(S11-1) By the maximality condition in the definition of KA, and by the rotational invariance of µ and v, for all closed sets A C (7.57)
Sn-1
KA(e)µ(B,) S µ(A) 5 (KA(E) + 1)µ(B,),
where B, := {x E Sn-1 : IIxli < e}. The preceding display remains valid if we replace µ by v everywhere. Therefore, for all closed sets A that have
7. The Central Limit Theorem
104
positive v-measure, (7.58)
KA(E) + 1) (_K,()
p(A) < < v(A) - v(BE)
K() + 1 KA(E)
(A) v(A)
Consequently, (7.59)
I p(A)
p(BE) I
v(A)
v(B,)
1
,u(A)
KA(E) v(A)
We apply this with A := Sn-1 to find that (7.60)
I1
A(BE)
1 < Ks.-.(E)
)
We plug this back in (7.59) to conclude that for all closed sets A with positive v-measure, 1 p(A) 1 dE>0. kA(E) v(A) + Ks,.-, (E) As f tends to zero, the right-hand side converges to zero. This implies that p(A) = v(A) for all closed sets A E .O(Si-1) that have positive v-measure.
(7.61)
p(A) v(A)
1
Next, we reverse the roles of p and v to find that u(A) = v(A) for all closed sets A E ..(Si-1). Because closed sets generate all of R(Sn-'), the monotone class theorem (p. 30) implies that p = v.
0
Proof of Theorem 7.21. We follow the proof of Theorem 7.23 closely, and 1 observe that by the strong law of large numbers (p. 73), IIZl">II/v a.s. Therefore, f X(") - Z1 a.s. The latter is standard normal. a.s.-convergence implies weak convergence, the theorem follows.
Since
0
6.3. The Replacement Method of Liapounov. There are other approaches to the CLT than the harmonic-analytic ones of the previous sections. In this section we present an alternative probabilistic method of Lindeberg (1922) who, in turn, used an ingenious "replacement method" of Liapounov (1900, pp. 362-364). This method makes clear the fact that the CLT is a local phenomenon. By this we mean that the structure of the CLT does not depend on the behavior of any fixed number of the increments. In words, the method proceeds as follows: We estimate the distribution of S" by replacing the increments, one at a time, by independent normal random variables. Then we use an idea of Lindeberg, and appeal to Taylor's theorem of calculus to keep track of the errors incurred by the replacement method. As a nice by-product we obtain quantitative bounds on the error-rate in the CLT without further effort. To be concrete, we derive the following using the Liapounov method; the heart of the matter lies in its derivation.
6. Complements to the CLT
105
Theorem 7.24. Fix an integer n > 1, and suppose {Xi}! 1 are independent Xi and s' mean-zero random variables in L3(P). Define Sn VarSn. Then for any three times continuously differentiable function f,
IEf(Sn) - Ef (N(O,s2))I
:r} dx.
In particular, suppose U is non-negative, and there exists r > 1 such that. (7.90)
P{V>x}:r}
vx>0.
Then, Ee°v < rEeau for all a > 0. Proof. Because a°v(.) = 1 +a fo 1 {V (w)>r}eax dx and the integrand is nonnegative, we can take expectations and use Fubini-Tonelli to deduce (7.89). Because r > 1, the second assertion is a ready corollary of the first
Proof of Theorem 7.26. Throughout, let Z := X1 + X2; Z is normally distributed. We can assume without loss of generality that EZ = 0; else we consider Z - EZ in place of Z. The proof is now carried out in two natural steps.
7. The Central Limit Theorem
110
Step 1. Identifying the Modulus. We begin by finding the form of IEeitXk
fork=1,2. Because EZ = 0, there exists or > 0 such that Eexp(zZ) = exp(z2a2) for all z E C. Since IZI > IXII - 1X21, if IXII > A and IX21 < m then IZl > A - m. Therefore, by independence,
P{IZI > A-m} > P{IXII > A}P{IX21 <m} (7.91)
> 4P {IX1I > A},
provided that we choose a sufficiently large m. Choose and fix such an m. In accord with Lemma 7.33, EectXiI < 4ec"tEe0Z1 for all c > 0. But Ee`IZI < EecZ +
(7.92)
Ee-cZ
0.
Consequently, (7.93)
I EezX' I < EeIzI'IXtI < 8exp (lzlm + a2lzl2)
Vz E C.
Because IZI > IX21 - IXII, the same bound holds if we replace Xl by X2 everywhere. This proves that fk(z) := Eexp(zXk) exists for all z E C, and defines an entire function (why?). To summarize, R a t '-- fk(it) is the characteristic function of Xk, and (7.94)
Ifk(z)I I exp(z2o2)I >- eXp (-Izl2o2) . It follows from this and (7.94) that for all z E C and k = 1, 2, (7.96)
8 exp (-Izlm - 2a21z12) 5 Ifk(z)I < 8exp (Izlm +a21Z12) .
Consequently, in l fkI is an entire function that satisfies the growth condition (7.79) of Lemma 7.29 with n = 2, and hence, (7.97)
lf1(z)l = exp (a0 + a1z + a2z2)
ez E C.
A similar expression holds for 1f2(z)j. Step 2. Estimating the Imaginary Part. Because fk is non-vanishing and entire, we can write (7.98)
fk(z) = exp(gk(z)),
where gk is entire for k = 1, 2. To prove this we first note that fL/ fk is entire, and therefore so is (7.99)
gk(Z) := fz
fk(w) dw.
Problems
111
Next we compute directly to find that (e-9k fk)'(z) = 0 for all z E C. Because fk(0) = 1 and gk(O) = 0, it follows that fk(z) = exp(gk(z)), as asserted.
It follows then that Ifk(z)I = exp(Regk(z)), and Step 1 implies that Re 9k is a complex quadratic polynomial for k = 1, 2. Thanks to this and Lemma 7.32, we can deduce that the entire function gk satisfies (7.79) with n = 2. Therefore, by Liouville's theorem, gk(z) = ak + AZ + ykz2 where al, a2, /31, 02, ^11,1'2 are complex numbers. Consequently, (7.100)
Ee`tXk = fk(it) = exp (ak + it,Qk - t2yk)
dt E R, k = 1, 2.
Plug in t = 0 to find that ak = 0. Also part (1) of Lemma 7.9 implies that fk(-it) is the complex conjugate of fk(it). We can write this out to find that (7.101)
exp(-it,3k - t2-yk) = exp(-it,Ok - t2^1k)
et E R.
This proves that (7.102)
it(3k - t2^1k = it[3k
- t2^1k + 27riN(t),
where N(t) is integer-valued for every t E R. All else being continuous, this proves that N is a continuous integer-valued function. Therefore, N(t) = N(0) = 0, and so it follows from the preceding display that Qk and yk are
real-valued. Because Ifk(it)l < 1, we have also that ^1k > 0. The result
0
follows from these calculations.
Problems 7.1. Define C'°(Rk) to be the collection of all infinitely differentiable functions f : Rk -+ R that have compact support. If µ,µt,p2,... are probability measures on (Rk,R(Rk)), then prove that
µiff f
f fdpfor all f
7.2. If µ,µt , µ2, ... , µ is a sequence of probability measures on (Rd, R(Rd)), then show that the following are characteristic functions of probability measures:
(1) µ; (2) Re µ, (3) lµ12;
(4) n;=t Ih; and (5) E'=lpiµ, where pi__ p. > 0 and
pi = 1.
Also prove that µ(T) Consequently, if p is a symmetric measure (i.e., µ(-A) = p(A) for all A E .A(Rd)) then µ is a real-valued function.
7.3. Use characteristic functions to derive Problem 1.17 on page 14. Apply this to prove that if X = Unif(-1, 1), then we can write it as (7.103)
X:= _t X p
where the X,'s are i.i.d., taking the values ±1 with probability
each.
7. The Central Limit Theorem
112
7.4 (Problem 7.3, continued). Prove that sinx
(7.104)
= VT cos (
I
vx E R \ {0).
\ 2k x /
k=1
X
By continuity, this is true also for x = 0.
7.5. Let X and Y denote two random variables on the same probability space. Suppose that X + Y and X - Y are independent standard-normal random variables. Then prove that X and Y are independent normal random variables. You may not use Theorem 7.26 or its proof. 7.6. Suppose X1 and X2 are independent random variables. Use characteristic functions to prove that:
(1) IfX,=Bin(n,,p)forthesame pE(0,11,then X1+X2=Bin(n1+n2,p). (2) If X, = Poiss(A1), then X1 + X2 = Poiss(Ai +a2) (3) If X. = N(µ, , a'), then X1 + X2 = N(µ1 + µ2 , of + o2). 7.7. Let X have the gamma distribution with parameters (a, A). Compute, carefully, the characteristic function of X. Use it to prove that if X1, X2, ... are i.i.d. exponential random variables with parameter A each, then S., := X1 + + X has a gamma distribution. Identify the latter distribution's parameters. 7.8. Let f be a symmetric and bounded probability density function on R. Suppose there exists C > 0 and a E (0, 11 such that
f(x) - Clxl-(1+0)
(7.105)
as lx( -. oo.
Prove that
f(t) = I - Dltle +o(Itl°) as Itl -. 0, and compute D. Check also that D < oo. What happens if a > 1? (7.106)
7.9 (Levy's Concentration Inequality)).. Prove that if µ is a probability measure on the line, then
(7.107)
1
vc>0.
(HINT: Start with the right-hand side.)
7.10 (Fourier Series). Suppose X is a random variable that takes values in Zd and has mass function p(x) = P{X = x}. Define pit) = Eet1'X, and derive the following inversion formula:
p(x) = (2n)d -1
(7.108)
,.p(- it x) p(t) dt
i- ..]d
vx E Zd.
Is the latter identity valid for all x E Rd? 7.11. Derive the following variant of Plancherel's theorem (p. 99): For any a < b and all probability measures µ on (//R, 53(R)), .In a-ieb\ µ({b}) i1 (7.109)
lio
-
J
,212,2 re
00 e
`
/I
µ(t)dt = µ((a,8))+ l+({a}) z
7.12 (Inversion Theorem). Derive the inversion theorem: If It is a probability measure on 99(Rk such that µ is integrable (dxl, then µ is absolutely continuous with respect to the Lebesgue measure on Rk. Moreover, then µ has a uniformly continuous density function f, and (7.110)
P X) = (21I )k
f
a-u : f(t) dt k
vx E Rk.
7.13 (The Triangular Distribution). Consider the density function f(x) := (1 - lxl)+ for x E R. If the density function of X is f, then compute the characteristic function of X. Prove that f itself is the characteristic function of a probability measure. (HINT: Problem 7.12.)
7.14. Suppose f is a probability density function on R; i.e., f > 0 a.e. and f . f (x) dx = 1.
Problems
113
(1) We say that f is of positive type if f is non-negative and integrable. Prove that if f is of positive type, then f (x) < f (O) for all x E R. (2) Prove that if f is of positive type, then g(x) f (x)/(2a f (0)) is a density function, and g(t) = f(t)/f(0). (HINT: Problem 7.12.) (3) Compute the characteristic function of g(x) = z exp(-Ixl). Use this to conclude that f(x) := ar-I(1 +x2)-' is a probability density function whose characteristic function is f(t) = exp(-Itl). The function f defines the so-called Cauchy density function. [Alternatively, you may use contour integration to arrive at the end result.[
7.15 (Riemann-Lebesgue lemma). Prove that Ee't'X = 0 for all k-dimensional absolutely continuous random variables X. Can the absolute-continuity condition be removed altogether? (HINT: Consider first a nice X.) 7.16. Suppose X and Y are two independent random variables; X is absolutely continuous with density function f, and the distribution of Y is µ. Prove that X + Y is absolutely continuous with density function
f f(x-y)Ft(dy)
(f'tp)(x)
(7.111)
Prove also that if Y is absolutely continuous with density function g, then the density function of
X+Y isfsg.
7.17. Prove that the CLT (p. 100) continues to hold when o = 0.
7.18. A probability measure it on (R,R(R)) is said to be infinitely divisible if for any n > 1 there exists a probability measure v such that µ = (E)n. Prove that the normal and the Poisson distributions are infinitely divisible. So is the probability density
f(x) := x(1 +I x2)
(7.112)
vx E R.
This is called the Cauchy distribution. (HINT: Problem 7.14.)
7.19. Prove that if (X,)l I are i.i.d. uniform-[0,11 random variables, then (7.113)
4 i-3
2-
rt 2
converges weakly.
Identify the limiting distribution.
7.20 (Extreme Values). If {X,}°_, are i.i.d. standard normal random variables, then find nonrandom sequences an,bn -. oo such that an maxl 1, where Z = (Z1, ... , Zk) and the Z,'s are i.i.d. standard normals.
7.33. Choose and fix an integer n > 1 and let X1,X2,... be i.i.d. with common distribution given by P{X1 = k} = 1/n for k = 1,...,n. Let Tn denote the smallest integer I > 1 such that P{Tn = k} for all k. Xt + + XI > n, and compute 7.34 (Uniform Integrability). Suppose X, XI, X2,... are real-valued random variables such that: (i) Xn X; and (ii) sups IIXnIIp < oo for some p > 1. Then prove that limn, EXn = EX. (HINT: See Problem 4.28 on page 51.) Use this to prove the following: Fix some po E (0, 1), and
define f(t) = It - pol (t E [0, 1]). Then prove that there exists a constant c > 0 such that the Bernstein polynomial Ian f satisfies (7.126)
do > 1.
I(13nf)(PO) - f(PO)I ?
Thus, (6.50) on page 78 is sharp (Kac, 1937).
7.35 (Hard). Define the Fourier map -Ff = j for f E Ll(Rk). Prove that (7.127)
IIfIIL2(Rk) =
"f E Lt(Rk)nL2(Rk).
1k/2II'FfIIL2(Rk) (2n)
This is sometimes known as the Plancherel theorem. Use it to extend F to a homeomorphism from L2(Rk) onto itself. Conclude from this that if it is a finite measure on .S(Rk) such that fRk Iµ(t)I2 dt < oo, then p is absolutely continuous with respect to the Lebesgue measure on Rk. WARNING: The formula (.Ff)(t) = fak f(x)ei1'xdx is valid only when f E L1(Rk). 7.36 (An Uncertainty Principle; Hard). Prove that if f : R -. R is a probability density function
that is zero outside [-7r, w), then there exists t 0 [-1/2,1/2] such that f(t) # 0 (Donoho and Stark, 1989). (Hint: View f as a function on [-rr,rr], and develop it as a Fourier series. Then study the Fourier coefficients.)
7.37 (Hard). Choose and fix \I ...... \- > 0 and al,...,a,, E R. Then prove that if m < oo, then fn, defines the characteristic function of a probability measure, where (7.128)
aj (1 - cos(ajt)))
f,.(t) := exp
"t E R, I < m < oo.
Prove that fo is a characteristic function provided that F,, (a2 A Iajl)aj < oo. (HINT: Consult Example 7.14 on page 97.)
7.38 (Lindeberg CLT; Hard). Let {X,},° 1 be independent L2(P)-random variables in R, and for
all n define sn = Ej=1 VarXj and p" = EXn. In addition, suppose that s" -. co, and (7.129)
lim
1
. 8n j=1
E [(Xj - µj)2; IX, - µj] > tan] = 0
Prove the Lindeberg CLT (1922):
S. (7.130)
- E,=1 Ikj 3n
. N(0,1).
Check that the variables of Problem 7.21 do not satisfy (7.129).
'e > 0.
7. The Central Limit Theorem
116
7.39 (Hard). Let (X, Y) be a random vector in R2 and for all 9 E (0,2,r) define Xe := cos(0)X +sin(0)Y
(7.131)
and
Y9 := sin(0)X - cos(0)Y.
Prove that if Xs and Ye are independent for all 0 E (0, 2a], then X and Y are independent normal variables. (HINT: Use Cramer's theorem to reduce the problem to the case that X and Y are symmetric; or you can consult the original paper of Kac (1939).) 7.40 (Skorohod's Theorem; Hard). Weak convergence does not imply a.s. convergence. To wit, X => X does not even imply that any of the random variables {Xn},°.t and/or X live on the X whenever same probability space. The converse, however, is always true; check that X. X almost surely. On the other hand, if you are willing to work on some probability space, Xn then weak convergence is equivalent to as. convergence as we now work to prove. (1) If F is a distribution function on R that has a continuous inverse, and if U is uniformly
distributed on (0,1), then find the distribution function of F-' (U).
(2) Suppose F F: All are distribution functions; each has a continuous inverse. Then prove that Bin- FT' (U) = F-' (U) a.s. X0, we can find, on a suitable probability (3) Use this to prove that whenever X. space, random variables X;, and X' such that: (i) For every 1 < n < oo, X;, has the same distribution as X,,; and (ii) lim X;, = X' almost surely Skorohod (1961, 1965). (HINT: Problem 6.9.)
7.41 (Ville's CLT; Hard). Let fl denote the collection of all permutations of 1, ... , n, and let P be the probability measure that puts mass (n!)-I on each of the n! elements of fl. For each w E fl define XI(w) = 0, and for all k = 2,....n let Xk(w) denote the number of inversions of k in the permutation w; i.e., the number of times 1, ... , k - 1 precede k in the permutation W. [For instance, suppose n = 4. If w = (3,1,4,2), then X2(W) = 1, X3(W) = 0, and X4(W) = 2.] Prove that {X,}° , are independent. Compute their distribution, and prove that the total X. in a random permutation satisfies number of inversions S.
S
(7.132)
n3;2/4)
N(0, 1/36).
(HINT: Problem 7.38.)
7.42 (A Poincar6 Inequality; Hard). Suppose X and Y are independent standard normal random variables. (1)
Prove that for all twice continuously differentiable functions f, g : R -» R that have bounded derivatives,
f
Cov(f(X),g(X)) = f E [f'(x)g' (ax +
1- s2 Y)] de.
(HINT: Check it first for f (x) := exp(itx) and g(x) := exp(irx).) (2) Conclude the "Poincar6 inequality" of Nash (1958):
Vaf(X) 0; else, consider Z+ and Z- separately. Now (8.1) holds tautologically if Z = 1A for some A E 9. So it holds also for a simple 9-measurable function Z. The rest of the claim follows from the monotone convergence theorem.
Since 9 C 9, both v and P are finite measures on 9. Because v 0 a.s. then E[X I91 > 0 a.s. Also, E[E(X I9)] = EX, E[X I .] = X, and E(X [ (0, 12}) = EX a.s. (2) If Xl, X2, ... , Xn E L' (P) and al, a2, ..., an E R, then a.s., n
E Ea.X. j=1
q
n
=Ea,E[XiI `.9]. j=1
1. Conditional Expectations
121
(3) If Z is 9-measurable and ZX E L1(P), then E[ZX 11] a.s. exists and is equal to ZE[X I W]. (4) (Conditional Jensen) If VY : R then with probability one,
R is convex and Ili(X) E L'(P),
E[''(X) IIf] > V,(E[X I `.Q]).
(5) (Conditional Fatou) If {X;}°°1 are integrable and non-negative, then with probability one,
E [lim inf Xn 14] < lim inf E[Xn I i]. n-oo
n-oo
(6) (Conditional Bounded Convergence) If {X;}°_1 are bounded and a.s. -convergent, then with probability one,
E [ lim Xn W] = lim E[X,, I l]. n-oo
n-oo (7) (Conditional Monotone Convergence) If X1 < X2 < X3 < all in L'(P), then with probability one,
are
E[Xn 159] / E n-oo [ lim Xn I ] as n -+ oo. (8) (Conditional Dominated Convergence) If E{sup IXnI} < oo and limn.. Xn exists a.s., then with probability one, E [ lim Xn ( cf] = lim E[Xn I q].
noo
n-oo
(9) (Conditional Holder) Suppose X E L"(P) for some p > 1 and Y E L9(P) where p-1 + q-1 = 1. Then with probability one, IE [XY 19] I < (E{IXIP I ¶})l/p(E{IYIq I W})114.
(10) (Conditional Minkowski) If X, Y E LP(P) for some p > 1, then with probability one,
(E{IX + YIP I ¶})'"p
1. We can think of {Xi}°_1 as a random process that evolves in (discrete) time. Then, a sensible prediction of the value of the process at time n + 1, given the values of the process by time n, is E[Xn+1 I 9n]. We say that X = {Xi}911 is a martingale if this predicted value is Xn. In this way, you should convince yourself that fair games are martingales, and in a sense, the converse is also true.
Definition 8.12. A stochastic process X = {Xn}°°_1 is a submartingale with respect to a filtration 9 = {9n},1 n= ifX is adapted to .9. (ii) Xn E L1(P) for all n > 1. (iii) For each n > 1, E[Xn+1 I Win] > Xn a.s.
The process X is a supermartingale if -X is a submartingale. It is a martingale if it is both a sub- and a supermartingale; it is a semi-martingale if it can be written as Xn = Yn + Zn where {Yi}°_1 is a martingale and {Zi}°__1 is a bounded-variation process; i.e., Zn = Un - Vn where U1 < U2 < and V1 < V2 < are integrable adapted processes. Occasionally, we call a finite sequence {Xi}
1 a submartingale if. (i) Xi
is 3i-measurable for all 1 < i < n; (ii) Xi E L'(P) for 1 < i < n; and (iii) E[Xi+l I 3i] > Xi for all 1 < i < n. A similar remark applies to super- and semi-martingales. Here are a few examples of martingales.
Example 8.13 (Independent Sums). Suppose that a fair game is played repeatedly. Every game is independent of all others, and results in ±1 dollar
for the gambler. We can model this by letting {Xi}°11 be i.i.d. random variables with the values ±1 with probability one-half each. We think of Xi as the gambler's win from game i, where negative win means loss. In this way, the gambler's cumulative fortune after k games is Sk := Xi + Xk. By independence, E[Xk I Xl,... , Xk_ 1] = EXk = 0. Therefore, S is a martingale with respect to the filtration where .'n := a({Xi} 1). More generally still, if Sn = X1 + + Xn, where the XD's are independent
2. Filtrations and Semi-Martingales
127
(not necessarily i.i.d.) and have mean zero, then S is a martingale with respect to .9P. Check that .fin is also equal to a({Si}° 1). For our second class of examples, we need a definition.
Definition 8.14. A stochastic process {Ai}°_1 is previsible with respect to a given filtration {. 'i}°_1 if An is .9n_1-measurable for every n > 1, where -9o always denotes the minimal a-algebra 10 ,12}.
Example 8.15 (Martingale Transforms). Let be a martingale with respect to a filtration {.n}°O_1. Define So := 0, yo a constant, and A a previsible process with respect to Now consider the process Y defined by (8.14)
Yn:=yo +>Aj(Sj-Sj_1)
Vn>0.
j=1
The process Y is called the martingale transform of S. It is a straightforward task to prove that Y is a martingale. Here is an example of how Y arises naturally: Suppose we play a fair game repeatedly and independently
every time. Let Xi denote the amount of win/loss for the ith play of the game, so that Sn := X1+ +Xn denotes the total win/loss by the nth play. Of course, S is a martingale (Example 8.13). Now suppose we can bet An dollars on the nth play; we are allowed to choose An based on what we have seen so far. That is, at time n, we are privy to the values of X1, ... , Xn_1.
Then, the martingale transform Yn = >s A,X2 = E 1 Ai(S2 - Si_1) 1
describes the win/loss after the nth play. Note that the martingale transform of (8.14) has the equivalent definition, Yn+1 - Yn = An+1(Sn+l - Sn) for
n > 0, where Yo = yo. We may think of this, informally, as the discrete analogue of the "stochastic differential identity," dY = A dS.
Example 8.16 (Doob Martingales). Let S be a filtration and Y E L1(P). Then martingales of the form Xn := E[Y 1 9n] are called Doob martingales.
Lemma 8.17. If X is a submartingale with respect to a filtration 5, then it is also a submartingale with respect to the filtration generated by X itself. That is, for all n, E[Xn+1 I X1.... , Xn] > Xn a.s.
Because a(X1,... , Xn) Proof. For all n and all A E -4(R), Xn 1(A) E is the smallest a-algebra that contains Xn 1(A) for all A E R(R), it follows that a(Xj, .... Xn) C 5n for all n. Consequently, by the towering property of conditional expectations (Theorem 8.5), with probability one, 1
(8.15)
E[Xn+1 I X1, ... , Xn] = E [E(Xn+1 15n) I X1, ... , Xn]
> E[Xn I Xi,... , Xn] = Xn.
The last equality is a consequence of Theorem 8.3.
0
8. Martingales
128
Lemma 8.18. If X is a martingale and 1' is convex, then '1i(X) is a submartingale, provided that 1'(Xn) E Ll (P) for all n. If X is a submartingale and ib is a nondecreasing convex function, then rlt(X) is a submartingale, provided that ?P(Xn) E L1(P) for all n.
Proof. Thanks to the conditional form of Jensen's inequality (Theorem 8.3), E[iI'(Xn+i) 19n] > '(E[Xn+1 1-9"n]) a.s. This holds for any process X and any convex function ip as long as O(Xn) E L1(P). If X is a martingale, then Vi(E[Xn+1 1 ,fin]) = ii(Xn) a.s., whence follows the result.
If, in addition, ip is nondecreasing but X is a submartingale, then Vi(E[Xn+1 I Win]) ? V,(X,) a.s., which has the desired result.
Remark 8.19. If X is a martingale, then X+, IXIP, and ex are submartingales, provided that they are integrable at each time n. If X is a submartingale then X+ and ex are also submartingales as long as they are integrable. However, one can construct a submartingale whose absolute value is not a submartingale; e.g., consider Xk := -1/k. The definition of semi-martingales is motivated by the following, whose proof is as interesting as the fact itself:
Doob's Decomposition. Any submartingale X can be written as Xn = Yn + Zn, where Y is a martingale, and Z is a non-negative previsible a.s.increasing process with Zn E L'(P) for all n. In particular, sub- and supermartingales are semi-martingales, and any semi-martingale can be written as the difference of a sub- and a supermartingale.
Proof. Define Xo := 0 and dj := Xj - Xj_1 (j = 1,2,...), so that Xn En En dj. Let Zn := En E[dj I and Yn 1(dj-E[dj j1 1
A
direct computation reveals that this yields the promised decomposition.
The preceding is one among many decomposition theorems for semimartingales. Next is the decomposition theorem of Krickeberg (1963, Satz 33, p. 131). See also Krickeberg (1965, Theorem 33, p. 144). Before introducing it however, we need a brief definition. Definition 8.20. {Xi}i=1 is bounded in L?(P) if sup,, IIXnIIp < 00.
Krickeberg's Decomposition. Suppose X is a submartingale which is bounded in L'(P). Then we can write Xn = Yn-Zn, where Y is a martingale and Z is a non-negative supermartingale.
Proof. By the submartingale property, Y. = limn-,,. E[Xm I Vin] exists a.s. as an increasing limit. Note that Y is an adapted process, and Yn >
3. Stopping Times and Optional Stopping
129
X. Moreover, by the monotone convergence theorem, EY = limn EXm = sup,,, EX,,,, which is finite since X is bounded in L1(P). Finally, we appeal to the towering property of conditional expectations (Theorem 8.5) and the conditional form of the monotone convergence theorem (Theorem 8.3) to E[Xm I . n] = Yn a.s. This proves that Y find that E[Yn+1 I JFn] = is a martingale and Zn = Y,, - Xn > 0. Also, because Y is a martingale and X is a submartingale, I Jn] - E[Xn+1 I .stn] < Y. - X. = Zn, almost surely. This completes our proof. (8.16)
E[Zn+1 I
One of the implications of Doob's decomposition is that any submartingale X is bounded below by some martingale. The Krickeberg decomposition implies a powerful converse to this: Every L1-bounded submartingale is also bounded above by a martingale.
Remark 8.21. The preceding processes Y and Z are bounded in L'(P). Here is a proof: EIYnI < supk EIXkI + EZn; thus it suffices to show that EZ is bounded in n. But the martingale property of Y implies that EZn = EY1 - EXn, whence we have IEZ,, I < IEY1I + supk EIXk1. Also, we may remark that Y > 0 whenever X > 0.
3. Stopping Times and Optional Stopping Definition 8.22. A stopping time (with respect to a filtration Jr) is a random variable T : Il N U {oo} such that IT = k} E Jrk for all k E N. This is equivalent to saying that IT < k} E 5k for every k E N. You should think of 5k as the total amount of information available by time k. For example, if we know 5k, then we know whether or not A E .5ik
has occurred by time k. With this in mind, the above can be interpreted as saying that T is a stopping time if and only if we only need to know the state of things by time k to decide measurably whether or not T < k. Example 8.23. Non-random times are stopping times (check!). Next suppose {Xi}°° 1 is a stochastic process that is adapted to a filtration 9. If A E then T(w) := inf{n > 1 : X,,(w) E A} is a stopping time provided that we define inf 0 := oo. Indeed, IT = k} = n3-i {Xj ¢ Al n {Xk E A} for every k > 2. Because {T = 1} _ {X1 E A}, we find that IT = k} E 5k
for every k > 1. The random variable T is the first time the process X enters the set A. Likewise, one shows that the kth time that X enters A is a stopping time for all k > 1. This example is generic in the following sense:
If T is a stopping time with respect to a filtration, then there exists an adapted process X such that T := inf{j > 1 : Xj = 1}, where inf 0 := oo. A simple recipe for X is Xj(w) := 1{T=j}(0)
8. Martingales
130
Remark 8.24. The previous example shows that the kth time a process enters a Borel set is a stopping time for any k > 1. Although this produces a large collection of stopping times, not all random times are stopping times. For instance, consider (8.17)
L(w) := sup{n > 1 : Xn(w) E A},
where sup 0 := 0 and A is a Borel set. Thus, L is the last time X enters
A, and {L = k} = njtk+1{Xj ¢ A} n {Xk E A}. This is in Fk if and only if Xk, Xk+l,... are all Borel functions of X1,. .. , Xk; a property that does not generally hold. (For example, consider the case when the Xn's are independent.)
Lemma 8.25. If {T}1 are stopping times, then so too are T1 +
+ Tn,
T and maxi j XS a.s. If X is a supermartingale, then E[XT IFS] < XS a.s. If X is martingale, then E[XT I .FS] = XS a.s.
This result has the following interpretation in terms of a fair game. are i.i.d. mean-zero random variables so that Sn := X1 + + Xn can be thought of as the reward-or loss-at time n in a fair game. Because ESn = 0 for all n > 1, we do not expect to win with certitude at non-random times n. The optional stopping theorem states that the same fact holds for bounded stopping times n. In other words, when playing a Suppose
fair game, there is no free lunch unless you are clairvoyant.
4. Applications to Random Walks
131
Proof. It suffices to consider the submartingale case. We can find a non-random K > 0 such that with probability one, S < 1, where T < K. Now the trick is to write things in terms of 4 := Xn Xa := 0. Equivalently, X,, = E'=1 dj, and hence XT = EK 1 djl{j E[Xs; A], which is equivalent to the desired result.
Our next result follows immediately from the preceding one. But it is important, and deserves special mention.
Corollary 8.27. Suppose T is a stopping time with respect to a filtration 9 and X is a submartingale (respectively, supermartingale or martingale) with respect to 9. Then, {XTnn}n 1 is a submartingale (respectively, supermartingale or martingale) with respect to
4. Applications to Random Walks Definition 8.28. If {Xi}°_1 are i.i.d. random variables in Rm, then {Sn}n 1 is called a random walk, where Sn := X1 + + Xn.
Henceforth, consider the case that m = 1. It follows, after centering, that every L1-random walk is a martingale. Lemma 8.29. If Sn = X1+. +Xn defines a random walk in one dimension and X1 E L1 (P), then {Sn - nEX1 }n 1 is a mean-zero martingale. Suppose, in addition, that X1 E L2(P) and EX1 = 0. Then {S,2, - nVarXi}°O=1 is a mean-zero martingale.
4.1. Wald's Identity. If {Sn}n is a random walk whose increments 1
{Xi}°_°1 have a finite mean p, then Lemma 8.29 implies immediately that ESn = nµ. This identity generalizes to stopping times, as we prove next.
8. Martingales
132
Theorem 8.30. Consider a random walk defined by Sn = En X; where (n = X1 E LI(P). Let !Fn denote the a-algebra generated by {X;} 1.2.... ) with respect to which T is a stopping time with ET < oo. Then 1
1
EST = EX1 ET.
Proof. Combine Corollary 8.27 with Lemma 8.29 to find that for all n > 1, E[STnn] = EX1 E[T A n]. As n tends to infinity, E[T A n] / ET. Thus, it suffices to prove that (8.20)
Esup ISTA. 1 1, ISTnnI = I Ek 1 1{Tnn>k}XkI S Ek1 1{T>k}IXkI Take expectations to find that Esup,, ISTnnI is at most 'k 1E{IXk1;T > k}. (Why can we interchange the infinite sum with the expectation integral?) Because IT > k} = IT < k -1}° E Yk_1 is indepen0 dent of Xk, (8.20) follows from Lemma 6.12 on page 68.
4.2. Gambler's Ruin Problem. Definition 8.31. A random walk S n = X1 +
+ Xn is called a nearestneighborhood walk if X1 E {-1, +1 } a.s.; i.e., if at all times n = 1, 2, ... , we have Sn = S,,_1 ± 1 almost surely.
In other words, Sn is a nearest-neighborhood walk if there exists p E
[0,1] such that P{X1 = 1} = p = 1 - P{X1 = -1}. The case p = 1 is particularly special and has its own name.
Definition 8.32. If P{X1 = 1} = P{X1 = -1} = 1, then {Sn}k is called 1
the simple walk.
We can think of a nearest-neighborhood walk S,, as the amount of money won (lost if negative) in n independent plays of a game, where in each play one wins or loses a dollar with probabilities p and 1 - p respectively. Then the simple walk corresponds to the fortune process of the gambler in the case that the game is fair. Suppose that the gambler is playing against the house, there is a max-
imum house limit of h dollars, and the gambler's resources amount to a total of g dollars. Then consider the first time that either the house or the gambler is forced to stop playing. That is, (8.21)
T:= inf f j > 1 : S3 _ -g or h} ,
where T(w) = inf 0 := oo amounts to the statement that the particular realization w of the game is played indefinitely.
Lemma 8.33. With probability one, T < oo.
4. Applications to Random Walks
133
Proof. The proof appeals to the following "continuity property" of S: If Sn > h, then there exists i < n such that Si = h. Define m = g + h, and consider the events
Vk>0.
(8.22)
Evidently, {Ei,,,}°_o are independent, have the same probability of occurring, and P(E;n) = 1 - p'n > 0 for all i > 0. By the Borel-Cantelli lemma (p. 73) infinitely many of the E*,n's occur a.s. In particular, there a.s. exists a random finite integer r such that ET occurs. [More precisely,
Ekfl{r=k}54 0 forallk>1.] On{SrIZ (-g,h)}, wehave T 1 : Xj > A}, and {T < n} = {maxj A}. E[Xn; T < n] _ Fn j=l E[Xn; T = j], and since {T = Because j} E .3j for all j > 1, the submartingale property implies that E[X,n ] > F,I E[Xj; T = j] > AP{T < n}. This proves the first Doob inequality. For the second portion of (8.26) define (8.28)
7-:= inf{1 < j:5 n : Xj < -A},
where inf 0 := oo. By the optional stopping theorem, EXI < EXTnn Since
X. < -A on Jr < oo}, (8.29)
EXTnn = E[XT; r < n] + E[Xn; r > n] < -AP{r < n} + E[X,+].
This completes the proof.
0
The following convergence theorem of Doob (1940) is a consequence of the Doob inequalities.
The Martingale Convergence Theorem. Let X be a submartingale. Suppose either: (i) X is bounded in LI(P); or (ii) X is non-positive a.s.
Then, limn, Xn exists and is finite a.s. Proof. We follow the general outline of the proof of the strong law of large numbers (p. 73): We first prove things in the L2-case; then truncate down to LI(P). This is achieved in four easy steps.
5. Inequalities and Convergence
135
Step 1. The Non-negative L2-Bounded Case. If X is non-negative and bounded in L2(P), then for all n, k > 1, IIXn+k - XnII2 = IIXn+kII2 + IIXnII2 - 2E[Xn+kXnj (8.30)
= IIXn+kII2 +IIXnII2 - 2E [E(Xn+k I -rn)Xn] IIXn+kII2 - IIXnII2
According to Lemma 8.18, X2 is a submartingale since Xn > 0 for all n. Therefore, IIXnII2 / SUPm IIXmII2 as n / oo. It follows that {Xn}n 1 is a Cauchy sequence in L2(P), and so it converges in L2(P). Let X00 be the L2(P)-limit of Xn, and find nk T oo such that IIXoo - Xnk1I2 < 2-k. By Chebyshev's inequality, 00
00
EP{IXoo- XnkI? } 0, 00
EP (8.33)
k=1
max
jnk:5j:5nk+1
IXj - Xnk I
2
00
>- e< E IIXnk+l - Xnk III E k=1 6
0-0
< E 2-k < oo. k=1
We have used the fact that {Xn+j -
is a submartingale for each fixed n with respect to the filtration {9j+n}j9_o, and that this submartingale starts at 0. It follows from the Borel-Cantelli lemma that (8.34)
lim
max
k-.oo nk<j 1 and note that E[L; B] = limn. E[P(A J r,,); B] = P(A fl B) for all B E Fk. By the monotone class theorem (p. 30), E[L; B] = P(A fl B) for all B E 9, , where 9,,. denotes the smallest a-algebra that contains U°n°__1 n. Because L is 9,,,,-measurable and A E A"., this ensures that L = P(A 19 ...) = 1A a.s. The result follows.
6.2. Levy's Borel-Cantelli Lemma. In this section we describe an optimal improvement to the Borel-Cantelli lemma (p. 73; see also Problem 6.20, p. 86). This improvement is due to Levy (1937, Corollary 68, p. 249). First, we need a definition.
Definition 8.36. If E and F are events, then we say that E = F almost surely when 1E(w) = 1F(w) for almost every w.
6. Further Applications
137
Theorem 8.37. If {. n}n is a filtration and E1, E2, ... are events such 1
that En E .fin for all n > 1, then ( (8.35)
{En occurs infinitely often}
00
E P(Ef 19n_1) = 00
l a. S.
n=2
Consequently, the two events, F 1 := {E 2P(EnI,Fn-1) = oo} and F2 :_ {En occurs infinitely often} have the same probability. The proof of Levy's Borel-Cantelli lemma rests on a general result about martingales with bounded increments.
Theorem 8.38. Suppose {Xn}n is a martingale such that 1 Xn - Xn_1 < a a.s. for all n > 1, where a is a positive non-random constant. Consider the events L1 := {supra Xn < oc}, L2 := {infra Xn > -oo}, and L3 {limn-o. Xn exists and is finite}. Then L1 = L2 = L3 a.s. 1
Proof. For any A > 0 define Ta := inf{n > 1 : Xn > .1}, (8.36) where inf 0 := oo. By the optional stopping theorem (Corollary 8.27), {XnATa }°°_1 is a martingale. Moreover, the fact that the increments of X are at most a implies that XTA < a + A. Therefore, a + A - XnATA defines a non-negative martingale. This must converge a.s. Consequently, for any A > 0 there exists a null set off which limn_.oo XnAT,, exists and is finite. Take the union of these null sets, as A ranges over all positive rationals, to deduce that outside one null-set N, limn-c. X nATA exists and is finite for all rational A > 0. If w E L1 then Ta(w) is infinite for all rational A > supra Xn(w). There-
fore, L1 n N` C L3. By considering the martingale -X we find also that L2 n NC C L3. This proves that (L1 U L2) n N` C L3, whence 1L,uL2 5 1L3 a.s. Since L3 C (L1 n L2), the result follows. 2{1E, - P(E; 1 .!'i_ 1) } Proof of Theorem 8.37. The variables Xn = (n > 1) define a martingale with bounded increments. In the notation of Theorem 8.38, L1 = {Ei 1E, < oo} and L2 = {Ei P(EE j .9'i_1) < oc}.
[This merits a moment's thought.] The proof follows.
6.3. Khintchine's LIL. Suppose {Xi}°_1 are i.i.d. random variables taking the values ±1 with probability 1/2 each, and define Sn := X1 +- +Xn. By the strong law of large numbers (p. 73), Sn/n -i 0 a.s. This particular form of the strong law first appeared in the context of the normal number theorem of Borel (1909). See Problem 6.22 on page 86. One would like to know how fast Sn/n converges to zero. The central
limit theorem suggests that Sn/n cannot tend to zero much faster than
8. Martingales
138
n-1/2. The correct asymptotic size of Sn/n was found in a series of successive improvements by Hausdorff in 1913 (see 1949, pp. 420-421), Hardy and Littlewood (1914), Steinhaus (1922), and Khintchine (1923). The definitive result, along these lines, is the law of the iterated logarithm (LIL) of Khintchine (1924): S" S. 1 a.s. lim sup = - lim inf (8.37) n-oo (2n In In n)1/2 = n-oo (2n In In n)1/2
Khintchine's LIL has a remarkable extension that is valid for sums of general i.i.d. random variables with finite variance.
The Law of the Iterated Logarithm. If {Xi}i=1 are i.i.d. random vari+ Xn, then ables in L2(P) and Sn := XI + S. - nEXI lim sup
(8.38)
n-.oo (2n In In n)1/2
= SD(XI) a.s.
When the Xi's are bounded this was proved by Kolmogorov (1929). Cantelli (1933a) improved Kolmogorov's theorem to the case that X1 E L2+6(P) for some d > 0. Then the LIL for general mean-zero finite-variance increments remained elusive for nearly a decade, until Hartman and Wintner
(1941) devised an ingenious truncation method which reduced the general LIL to that of Kolmogorov. We will derive the LIL only in the case that the X's are normal. The theorem, in its full generality, is much more difficult to prove.
Proof of the LIL for Normal Increments. Without loss of generality, we may assume that the Xi's are standard normal random variables. Define
S.
A:= hm sup
(8.39)
m-oo (2m In In M) 1/z
According to the Kolmogorov 0-1 law (p. 69), A is almost surely a constant. Our task is to prove that A = 1. We do this in three steps.
Step 1. A Large-Deviations Estimate. Fix a t > 0 and define Mn :_ exp (tSn - t2n). Let An := o ({Xi}s 1), and verify that M is a non-negative 2 mean-one martingale. Moreover,
Imax Sl > nt) C I Imax Ml > exp(nt2/2)
(8.40) I.
i_
j_
JJJ
y. J
According to Doob's maximal inequality (p. 134), for all integers n _> 1 and all real numbers t > 0, (8.41)
P
max S. > nt y < exp (- nt2/2). JJJ
6. Further Applications
139
Step 2. The Upper Bound. Choose and fix c > 0 > 1, and define 9k oo, LOkJ (k =11,2.... ). We apply (8.41) to deduce that as k (8.42)
P { max Si > (2cOk_l lnln0k_1)l/2l < exp
J
11<j<ek
cOk-I In InOk_1
`
ek
=
k-(c1e)+°(1).
Thus, the left-hand side is summable in k. By the Borel-Cantelli lemma (p. 73), with probability one there exists a random variable ko such that for all k > k0, maxl<j<ek Sj < (2cOk_1 In ln0k_I )1/2. For all m > Bko we can find k > ko such that 0k_1 < m < °k, and hence, (8.43)
Sm < maxSj j!ek < (2cOk_1lnInOk_l)1/2 < (2cmInInm)1/2.
This proves that A < c1/2. Because the latter holds for all c > 0 > 1, it follows that A < 1. This is fully one-half of the LIL in the case of standard normal increments. We can also apply the preceding to the process -S to obtain the following:
(8.44)
ISmI
limsup
m-oo (2m In In m.)1/2
0 such that (8.45)
P {X1 > Al >
Ae _ 2/2
d'\ > 1.
See Problem 1.16, page 14 for a hint. Choose and fix 0 > 1, and define Ok := [ek J . Consider the events (8.46)
Ek := {Sek+1 - Sok >_ (2ati+1 ln+ In+ Bk)1`21
where ln+ x := ln(x V e), and (8.47)
ak+1
rr
Var (Sek+ 1
- Sek) = 0k+1 - Bk.
The Ek's are independent events, and because of (8.45), for all k large, (8.48)
A
P(Ek) = P {N(0,1) > (21nln0k)112l > JJ
In Ok (2 In In Ok)
112
Consequently, Ek P(Ek) = oo, and hence by the independence part of the Borel-Cantelli Lemma, (8.49)
lim sup k-.oo
Sell+1 - Sek 1/2 > 1 (2(0k+1 Ok) In In 0k)
-
as.
8. Martingales
140
Since 9k+1 - 9k ^' Ok+1(1 - 0-1) as k -4 oc,
S Sek+l - SOk
lim sup
(8.50)
1/2
>
C1 - B)
k-.oo (29k+11n In 9k ) 1/2
a.s.
Thanks to this, (8.44), and the fact that 9k+1 - 9.9k as k
oo,
Sek+1 A > li msup k-oo (29k+11nln9k)1/2
-
Sek+1 Sek > lim sup k-oo (20k+1 In In 0k)1/2
(8.51)
>
C1 -
1/2
0
-
91/2/2
- limk---Sup(29k+11n In 9k) /2 I Sek I
1
a.s.
Let 0 T oc to find that A > 1.
O
6.4. Lebesgue's Differentiation Theorem. The fundamental theorem of calculus asserts that if f : R -I R is continuous then F(w) := fo f (x) dx is differentiable and F' = f. In fact, we have the stronger result that w+6
1
1io
(8.52)
b
f
f(y)dy = f(w),
uniformly for all w in a given compact set. Here is why: For all w E R and b > 0, I
(8.53)
1
fw+6
rw+6 Jw
If (y)
f (y) dy - f (w) l < b
Jw
- f(w) l dy.
Therefore, (8.52) follows from the uniform continuity of f on compact sets. There is a surprising extension of this, due to H. Lebesgue, that holds for all integrable functions f. The following is the celebrated differentiation theorem of Lebesgue.
Theorem 8.39. If fo If (x) I dx < oc, then (8.52) holds for almost every w E [0,1]. Consider the Steinhaus probability space ([0, 1] , 66([0, 11), P). In probabilistic language, Theorem 8.39 states that (8.52) holds almost surely provided that f E L' (P). In order to derive this formulation we need a maximal inequality for the following function M f that is known as the HardyLittlewood maximal function (1930, Theorem 17). First we extend f to a function on R by setting f (w) := f (0) if w < 0 and f (w) := f (1) if w > 1. Then, we define: 1
(8.54)
(Mf)(w) = M(f)(w) =
sup
f
w+6
6E(0,1-w) 8 w
where 0/0 := 0 to ensure that (M f) (1) = 0.
If(y)I dy
"w E [0, 11,
6. Further Applications
141
Theorem 8.40. For all A > 0, p > 1, and f E L'(P), P{Mf > A} < 8PIlflip
(8.55)
Let us first prove Theorem 8.39 assuming the preceding maximal-function inequality. Theorem 8.40 is proved subsequently.
Proof of Theorem 8.39. For notational convenience, define the "averaging operators" A6 as follows: 1
(8.56) A6(f)(w) := (A6f)(w) :=
b
f
w+6
f(y)dy dw E [0,1], f E L1(P)
Thus, we have the pointwise equality, (M f)(w) = suP6E[o,1-w] A6(I f I)(w)
Throughout this proof we tacitly extend the domain of all continuous functions g : [0, 11 - R to R by setting g(w) := g(0) for w < 0 and g(w) := g(1) for w > 1. Because continuous functions are dense in L1(P) (Problem 4.18, p. 50), for every n > 1 we can find a continuous function gn such that II9n - f II i n-1. Let 2 := lim sup6lo I A6f - f I to find that
.2 A, then by the triangle inequality one of the two terms on the right-most side must be at least A/2. Therefore, we can write (8.58)
P {2 > Al < T1 +T2,
where (8.59)
T1:=P{M(I9n-fl)> 2}
and
T2:=P{I9n-fI>
2}.
We estimate T1 and T2 separately. On one hand, we can apply Theorem 8.40, with p = 1, to deduce that (8.60)
T15
16 A
16 II9n-fII1 0. Also, we will extend the domain of the definition of f by setting f (w) := 0 for all w E R \ (0, 1]. Define .9'n to be the collection of all dyadic intervals in (0, 1]. That is, 1 E . !FnO if and only if I = (j2-n, (j + 1)2-n] where j E {0 , ... , 2n - 11 and n > 0. Define .$n to be the o-algebra generated by .fin. Since every element of FnO is a union of two of the elements of .fin+1+ it follows that gn C JFn+1
(8.62)
do > 0.
That is, {.$n}n°_o is a filtration; it is known as the dyadic filtration. We can view the function f as a random variable, and compute Mn E[f I 9n] using Corollary 8.8: (8.63)
r Mn(w) _ E 1Q(w)2n f f (y) dy
for almost all w E [0, 11.
QE.
It should be recognized that the preceding sum consists of one term only. Next define i to be the collection of all shifted dyadic intervals of the
form J = (j2-n +
2_n-1
,
(j + 1)2-n +
2-n_1),
where j E Z and n > 0.
Let Wn denote the o-algebra generated by the intervals in Wno, and define Nn := E[ f ] 8on]. Because f vanishes outside [0, 11, (8.64)
Nn(w) _
1Q(w)2n
JQ
for almost all w E [0, 11.
f (y) dy
Consider w E (0,1) and b E (0,1 - w). There exists n = n(w) > 0 such that 2-n-1 < 5 < 2-n. We can find 1(w) E JrnO and J(w) E fl-both containing w-such that (w, w + 6) C I (w) U J(w). Because f > 0, f =- 0 off [0,1], and b > 2-n-1, this implies that 1
(8.65)
d
f+' f (y) dy 0, (8.66)
P{Mf >A} n>0
4
A}+P(supNn> n>O
Al 4
Note that Mn + N,, is not a martingale because M and N are adapted to different filtrations. However, M and N are martingales in their respective filtrations. We apply the first maximal inequality of Doob (p. 134) to the
6. Further Applications
143
submartingale defined by IMnII" to find that (8.67)
P j suP Mn > l n>0
41
0
P. Ap IIfIIp
[The last inequality follows from the conditional form of Jensen's inequality.]
A similar inequality holds for N. We can combine our bounds to obtain, 2,4P (8.68)
P{Mf > A}
1 and f E LP(P) then (8.69)
f l I(Mf)(t)I" dt < (p8p1)p f 1 If(t)Ip dt.
6.5. Option-Pricing in Discrete Time. We now take a look at an application of martingale theory to the mathematics of securities in finance. In this example we consider a simplified case where there is only one
type of stock whose value changes at times n = 1,2,3, ... , N. We start with yo dollars at time 0. During the time period (n, n + 1) we look at the performance of this stock up to time n. Based on this information, we may decide to buy An+i-many shares. Negative investments are also allowed in the marketplace: If An(w) < 0 for some n and w, then we are selling short for that w. This means that we sell An(w) stocks that we do not own, hoping that when the clock strikes N, we will earn enough to pay our debts. Let Sn denote the value of the stock at time n. We simplify the model further by assuming that ISn+i -SnI = 1. That is, the stock value fluctuates b y exactly one unit at each time step, and the stock value is updated precisely at time n f o r every n = 1, 2, .... The only unexplained variable is the ending time N; this is the so-called time to maturity and will be explained later. Now we can place things in a more precise framework. Let St denote the collection of all possible w = (w1, ... , wN) where every
wj takes the values ±1. Intuitively, wj = 1 if and only if the value of our stock went up by 1 dollar at time j. Thus, wj = -1 means that the stock went down by a dollar, and Sl is the collection of all stock movements that are theoretically possible. Define the functions Si, ... , SN by So(w) := 0, and (8.70)
Sn(wi,...,wn)
wi +...+wn
dn= 1,...,N.
8. Alartingales
144
We may abuse the notation slightly and write &(w) in place of S,,(wi ...... a ).
In this way, S,,(w) represents the value of the stock at time n, and corresponds to the stock movements w1, ... , w,,. During the time interval (n, n + 1), we may look at w1,... , w,,, choose a number A,,+1(w) =A.+, (w, .... , wn ), and buy which might depend on shares. If our starting fortune at time 0 is yo, then our fortune at time n depends on { A; (w) };_11, and is given by n (8.71)
YY(w) = Yn(wl,...,wn) = yo + E A.i(w)[Sj(w) - Sy-1(w)], J=1
as n ranges from 1 to N. The sequence {A=(w)}( 1 is our investment strategy.
Recall that it depends on the stock movements {w;}N I in a "previsible manner": i.e., for each n > 2, A,,(w) depends only on w,.... ,Wn-l- JAI does not depend on w.]
A European call option is a gamble wherein we purchase the right to buy the stock at a given price C-the strike or exercise price-at time N. Suppose we have the option to call at C dollars. If it happens that SN(w) > C, then we have gained (SN(w) - C) dollars. This is because we can buy the stock at C dollars and then instantaneously sell the stock at SN(W). On the other hand, if SN(w) < C then it is not wise to buy at C. Therefore, no matter what happens, the value of our option at time N is (SN(w) - C)+. An important question that needs to be settled is this: (8.72)
What is the fair price for a call at C?
This was answered by Black and Scholes (1973) and Merton (1973) for a related, but slightly different, model. The connections to probability were discovered later by Harrison and Kreps (1979) and Harrison and Pliska (1981). The present model, the so-called "binomial pricing model," is due to Cox, Ross, and Rubenstein (1979). In order to explain their solution to (8.72) we need a brief definition from finance.
Definition 8.42. A strategy A is a hedging strategy if-
(i) Using A does not lead us to bankruptcy; i.e., Yn(w) > 0 for all
n = 1,...,N. (ii) Y attains the value of the stock at time N; i.e., YN(W) = (SN(w) - C)+. Of course any strategy A is also previsible.
Let us posit that there are no "arbitrage opportunities," where arbitrage is synonymous to "free lunch." That is, we assume there are no risk-free investments. Then, in terms of our model, yo is the "fair price of a given
6. Further Applications
145
option" if, starting with yo dollars, we can find a hedging/investment strategy that yields the value of the said option at time N, no matter how the stock values behave.
The solution of Black and Scholes (1973), transcribed to the present simplified setting, depends on first making (Q, .1'(l)) into a probability space. Here, Y(Q) denotes the power set of Q. Define the probability measure P so that Xj (w) = wj are i.i.d. taking the values ±1 with probability each. In words, under the measure P, the stock values fluctuate at random
but in a fair manner. Another, yet equivalent, way to define P is as the product measure: (8.73)
P(dw) = Q(dwl) ... Q(dwN)
dw E 1,
where Q({1}) = Q({-1}) = 1/2. Using this probability space (1, .9(l), P), {A;}°°1, {S1}°O1, and {Yt}°_1 are stochastic processes, and we can present the so-called Black-Scholes formula for the fair price yo of a European option.
The Black-Scholes Formula. A hedging strategy exists if (8.74)
yo = E [(SN - C)+] .
Proof (Necessity). We first prove Theorem 6.5 assuming that a hedging strategy A exists. If so then the process Yn defined in (8.71) is a martingale; see Example 8.15. Moreover, by the definition of a hedging strategy, Yn > 0
for all n, and YN = (SN - C)+ a.s. (in fact for all w). On the other hand, martingales have a constant mean; i.e., EYN = EY1 = yo, thanks to (8.71). Therefore, we have shown that yo = E[(SN - C)+] as desired. (] In order to prove the second half we need the following.
The Martingale Representation Theorem. In (Q, .9(1k), P), the process S is a mean-zero martingale. Any other martingale M is a martingale transform of S; i.e., there exists a previsible process H such that n
(8.75)
Mn = EMI + > H3 (Si - Sj-1)
do = 1, ... , N.
j=1
Proof. Because Lemma 8.29 proves that S is a mean-zero martingale, we can concentrate on proving that M is a martingale transform. Since M is adapted, Mn is a function of w1, ... , wn only. We abuse the notation slightly, and write (8.76)
Mn(w) = Mn(w1, ... , Wn)
Vw E I1.
The martingale property states that E[Mn+1 1,9n] = Mn a.s. Now suppose 01, ... , On are bounded and Oj is a function of wj only. Then, thanks to the
8. Martingales
146
independence of the wj's,
E
OLi.Mn+h] =1
(8.77)
[Jctj(wj)Mn+l(wl,...,wn,-1)Q(dwl) ... Q(dwn)
-2 1
+2
J j=1
j(wj)Mn+1(w1,...,wn,1)
G1
1) ... Q(dwn)-
That is, we can write n
n
E f Oj M-+1 = E H Oj . Nn
(8.78)
j=1
,
j=1
where (8.79)
1
Nn(w)
1
=2Mn+1(wl,...,wn, 1)+2Mn+1(wl,...,wn,
Note that Nn is Fn measurable for every n = 1, . . . , N. Therefore, (8.77) and the martingale property of M together show that M = N a.s. This leads us to the formula (8.80)
1
1
Mn(w) = 2Mn+l(wl,... ,wn, l) + 2Mn+1(wl,...,wn, -l),
valid for almost all w E 52.1 In fact, because fl is finite and P assigns positive measure to each wj, the stated equality must hold for all w. Moreover, since
go = 10, 52}, the preceding discussion continues to hold for n = 0 if we define Mo = EMI. Since Mn(w) = ZMn(w) + Z1Lin(w), the following holds for all 0 < n
0 such that for all x, y E (0, 11, (8.84)
If (x)
- f(y)I < AIx - yl.
The optimal choice of A is called the Lipschitz constant of f . If f' exists and is continuous, then one can perform a one-term Taylor expansion to note that f is Lipschitz continuous. The following theorem of Rademacher (1919) asserts a remarkable converse.
Theorem 8.43. If f : (0, 1] -+ R is Lipschitz-continuous then it is differentiable almost everywhere.
Proof. Let ((0,1] , 4((0,1]) , P) denote the Steinhaus probability space, so that P is Lebesgue measure, and "a.e." is the same thing as "a.s." Also let {.`fi'n}°° 1 and {.9n}n°_1 respectively denote the dyadic intervals and filtration
(p. 142).
8. Martingales
148
two numbers, f(Q) and We associate to all dyadic intervals Q E r(Q): t(Q) is the left end-point of Q and r(Q), the right one. For instance, if Q = (k2--, (k + 1)2-n], then f(Q) = k2-n and r(Q) = (k + 1)2-n. Define (8.85)
Xn(w) :_ QE.9"n
f(r(Q)) - f(e(Q))1Q(w) r(Q) - e(Q)
dw E (0, 1).
This is a difference quotient because the sum consists of exactly one term and the Lipschitz continuity off ensures that sup,, I Xn I is a bounded random variable. From here on, the proof splits into two steps. Step 1. X is a martingale with respect to .9. To prove this, write (8.86)
Xn(w) = E E
f (r(Q))2- f (E(Q))1Q(w)
JE.9 _1 QE.$,o,:
QCJ
By Corollary 8.8, if w E J E .fin-1 and Q E gn is a subset of J, then P(Q I gn_1)(w) is the classical probability P(Q I J), which is z. If U) ¢ J then P(Q I.n_1)(w) = 0. Thus, E[Xn 19n-1] =
E
2
f (t(Q))1 J
f (r(Q)
QE.Fn:
2-
n
QCJ (8.87)
f (r(J)) - f (t(J))1J
_
2-'
JE.3n_1
= Xn_1.
According to the martingale convergence theorem, all bounded martingales
converge a.s. and in L1(P) (p. 134). Therefore, we can find X. such that Xn
X,,,, a.s. [P] and in L1(P).
Step 2. The Conclusion. Suppose I, J E .fin for the same n > 1, and I lies to the left of J; that is, every u E I is less than every v E J. Then we denote this by I < J. For all Q E .5n, fQ Xn(w) du) = f (r(Q)) - f (f(Q)). Therefore, for all
(8.88)
r f (r(J)) - f (0) = E / Xn(w) dw = IEYfO:
I<J
I
j
"M
Xn(w) dw.
6. Further Applications
149
Given any x E (0, 11 and n > 1, we can find a unique J E .'ro such that w E .`ro . If A denotes the Lipschitz constant of f, then
- f (r(J))I < Aix - r(J) I < 2 Therefore, If (x) - f (0) - fo ") Xn(w) dwi < A2-n. Also, (8.89)
If (x)
jr(j) Xn(w)
(8.90)
rx+2-^
dw - fox Xn(w) dw
1 such that the sequence {X1, ... , Xk} contains either "01" or "11." Then, we argue as before and
find that E[WT - ZT,2) = 0. But WT - ZT,2 = q-1 on ("01" comes up first}, and WT - ZT,2 = -(1/p) - (1/p2) = -(p + 1)/p2 on {"11" comes up first}. Therefore,
0=E[WT-ZT,21 (8.98)
= P { "01" comes up first} -
p p+21
P { "11" comes up first}.
Solve to find that (8.99)
P ("01" comes up before "11" } = 1 - p2.
Problems
151
6.8. Random Quadratic Forms. Let {Xi}i=1 be a sequence of i.i.d. random variables. For a given double array {ai,,, } 2,7 =1 of real numbers, we wish
to consider the "quadratic form" process, do > 1.
Q,,:= >2 > ai,jXiXj
(8.100)
1 1, hk(t) = Ee1Xk exists and is finite for all t E (-to, to) for a fixed to > 0. Prove that whenever III < to, Mn(t) _ e1Sn / {Zk=1 hk(t) defines a mean-one martingale. [As usual, S. denotes E
1
X,.]
8.16 (Likelihood Ratios). Suppose f and g are two strictly positive probability density functions on R. Prove that if {X,};__1 are i.i.d. random variables with probability density f, then Fl,'= I [g(X, )/ f (Xj)J defines a mean-one martingale. When does it converge. and in what sense(s)?
8.17 (Ptilya's Urns). An urn initially contains R red and B black balls. Except for their colors, the balls are identical. A ball is chosen at random. If it comes up red (reap. black), then it is replaced with two red (reap. black) balls. Let X denote the number of red balls in the urn after n draws. Prove that the fraction fn = Xn/(n+ R + B) of red balls has an almost-sure limit. 8.18. Prove that Definition 8.12(iii) is equivalent to E[Xn+k I ,fnJ >- Xn as, for all k, n > 1. 8.19. Prove that Doob's decomposition (p. 128) is a.s.-unique, as long as we insist that {Zn}n°__1 is previsible and Z1 = 0.
8.20. Let {Xn}n 1 be a martingale. Prove that X is bounded in L1 (P) if sup E[X,+, J < oo.
8.21. Prove that if X is a martingale, then (8.105)
r PtlmaxJXi[>a <E(]X.[P) 111
j_
-
"p > 1, n > 1, \ > 0.
Also, prove that Doob'a inequalities imply Kolmogorov's maximal inequality (p. 74).
Problems
153
8.22 (Doob's LP Inequality). Suppose l; and ( are a.s. non-negative random variables such that va > 0.
P{(> a} < 1 E[(;(> a]
(8.106)
a
Prove that for all p > 1, (8.107)
IIfIIP 5 (p p 1) IKIIP
Use this show the strong LP-inequality of Doob: If X is a non-negative submartingale and Xn E LP(P) for all n > 1 and some p > 1, then r
11
E Ilmax X?J
1), and consider an f-stopping time T that has a finite mean. Prove Wald's second identity, VarSr = VarXi ET. (HINT: Problem 8.22.)
8.26. Consider two random variables X and Y, both of which are defined on a common probability space (11, Jr, P). Define (8.109)
(-) 1{xeL,2
Xn =
2n
7=-a
" (7+1)2 ^))
vn
1.
Prove that for any Y E LI(P), limn_ ELY I Xn] = E[Y I X) a.s. and in Lt (P).
8.27. Suppose that Y E L1(P) is real-valued, and that X is a random variable that takes values in Rn. Prove that there exists a Borel measurable function f such that E[Y I X] = f(X) almost surely.
8.28. Suppose that {Xi}1,=1 are independent mean-zero random variables in L2(P) that are bounded; that is, that there exists a constant B such that almost surely, IXnI < B for all n. Prove that for allA>Oandn>1, (8.110)
(
P(ma!cn lS,I-A} 1925).
(B
+
VarSn
(Khintchine and Kolmogorov,
8.29 (Martingale Convergence in LP). Refine the martingale convergence theorem by showing that limn-- Xn exists in LP(P) whenever X is bounded in LP for some p > 1. In addition, prove
that if X = lim-a, Xn, then Xn = E[X I9n] as. for all n > 1. (HINT: Use Problem 8.22.)
8. Martingales
154
8.30 (Double-or-Nothing). Let {y,) 1 denote a sequence of i.i.d. random variables with P{11 = 0) = P{71 = 1) = 1. Consider the stochastic process X, where X1 := 1, and Xn := 2Xn_i7n for all n > 2. Prove that X is an L1-bounded martingale that does not converge in L'(P). Consequently, Problem 8.29 can fail for p = 1. Compute the almost-sure limit of X.. 8.31. Let {Xn}n 1 be independent standard normal random variables and define S.
X,
(n > 1). Prove that Mn = (n + 1)-1/2exp{Sn/(2n + 2)} defines a mean-one martingale (Woodroofe, 1975, Problem 12.10, p. 344).
8.32 (Problem 8.31, Continued). Define Mn as in Problem 8.31. Use only the martingale convergence theorem (p. 134) and the CLT (p. 100) to prove that limn-,c Mn = 0 a.s. Derive the following precursory formulation of the LIL (Steinhaus, 1922):
Sn=o((ninn) 1/2)
(8.111)
a.s.asn-+oo.
8.33. Suppose X is a submartingale with bounded increments; i.e., there exists a non-random finite constant B such that almost surely, ]X,, - Xn_ 1 I < B for all n > 2. Then prove that limn Xn exists as. on the set {sup,, IXm] < oo}.
8.34. Suppose {.fin}- 1 is a filtration of a-algebras, and Y E L1(P) is fixed. Define M,, _ E[Y ]5n] to be the corresponding Doob martingale. Prove that for all finite stopping times T, MT = ELY I.5T1 as. (Dubins and Freedman, 1966).
8.35. Prove that X is a martingale if and only if EMT = EMI for all bounded stopping times T. Characterize super- and submartingales similarly. 8.36. Let {Xn}°,,°-e be a non-negative supermartingale that attains the value zero at some a.s.finite time. Prove that limk_,o Xk = 0 as. 8.37. Follow the proof of Theorem 8.40 and prove that c1 A(f) < M f < c2A(f) where cl and c2 are positive and finite constants that do not depend on f, and A(f) := sup,, Afn + sup,, Nn. 8.38. The following is a variant of Problem 8.17. First choose and fix A E (0, 1). Then consider random variables X,, E (0, 11, adapted to .in, such that a.s. for all n > 1,
P{Xn+1=A+(1-A)XnIAn}=Xn (8.112)
P(Xn+l =0-X)Xnl9n}=1-Xn.
Prove that X := limn-,o Xn exists as. and in LP(P) for all p > 1, and that X_ is zero or one almost surely. Compute P{X = *-
8.39. Suppose that {X,},'=1 are i.i.d. with P{X1 = 1} = P{Xl = -1} = 1/2. As before, let
S,, :=X1+...+Xn.
(1) Prove that Eexp(tS,) 1 and t E R(2) Prove that (8.41) continues to hold in the present setting.
(3) Prove the following half of the LIL for (±1) random variables: limsup
Sn
1.
Problems
155
8.42 (Theorem 8.39, Continued). Prove that if f : (0, 1)k -» R is integrable (k > 2), then for almost all x E (0, 1)k 1
(8.113)
:k+a
610 Sk I lim k
...
S,+e I1
j(u) duI ... duk =
f (X).
8.43. Let XI be uniformly distributed on (0,1). Conditionally on XI, define X2 to be uniformly distributed on (0, XI); i.e., P{X2 E A I XI) = m(A n [0, X I ])/X 1 , where m denotes the Lebesgue measure. Iteratively define (8.114)
P{Xn E AI X1,..., Xn-1} =
m(An(0, Xn-1]) Xn-1
Explore the structure of {Xn}n I, and the behavior of Xn for large n. 8.44 (Patterns). Verify Lemma 8.44. Also, find the probability that we see f consecutive ones before k consecutive zeros.
8.45 (U-Statistics). Prove that that (8.115)
0 in the proof of Theorem 8.45. From this conclude
E [(Qn - An)2] =4 F
alt + Var(X?)
a2 ;.
1 1. Prove that Esupn ISn/ni and E{IXII In+ IXII) converge and diverge together (Burkholder. 1962). 8.58 (Problem 8.39, Continued; Harder). Prove the other half of the LIL (p. 138) for (±1) random variables: lim supn_,o Sn/an > 1 a.s., where an := (2n In In n)1 /2. You may use the following argument (de Acosta, 1983, Lemma 2.4):
(1) Prove that it suffices to show that for all c > 0,
InP {Sn > cI/2an} > -c. 1 n-. lnlnn (2) To establish (1) choose pn -. oo such that n divides p,,, and then prove that liminf
(
n
P {Sn > cI/Zan} > \P {Spy > eI/ZanPn/n}) (3) Use the central limit theorem and the preceding with pn - an/(Inlnn) to prove (1). Conclude the proof of the LIL for (±1) random variables.
(HINT: For part (2) first write S. = 5,,,, +(S2p - S,,,)+ +(Sn -S(n_I)p,,/n). Next observe that if each of these n/p terms is greater than pn A/n, then S. > A. Finally choose A judiciously. For part (3) optimize over the choice of a.) 8.57 (Problem 8.24, Continued; Harder). Suppose {Xn }n= is a martingale with respect to some filtration {.9rn},a 1. Suppose also that do = X. - Xn_I satisfies Idnl S a for all n > 1, where Xo = 0, Sto = (0,f1), and a is a non-random positive constant.
(1) Prove that for all x E R, e'' < I + x + x2el=I. Use this to prove that for all I E R
and all i=1,2,..., ed,
I
._
1.
Moreover, verify that EMn < 1 for all n > 1. (3) Prove that if X. > 0, then limn-.oo X. IA. exists and is finite as. Prove also that
lim Xn(w) = 0
for almost all w E {A, = oo}.
Notes
157
Notes (1) Martingales were first introduced and studied by Ville (1939). The current powerful theory was formed by Doob (1940, 1949) shortly thereafter. (2) Our proof of the martingale convergence theorem (p. 134) is due to Isaac (1965). Aside from this and the original proof of Doob (1940) there are other nice proofs of Doob's martingale convergence theorem. For example, see Chatterji (1968), Helms and Loeb (1982), Lamb (1973), and C. lonescu Tulcea and A. lonescu Tulcea (1963). (3) An enormous literature is devoted to the study of the law of the iterated logarithm and its variants. An excellent starting point is the theorem of Strassen (1967). It implies that if X1, X2,... are i.i.d., EXi = 0, and VarXj = 1, then on some suitable probability space there exist i.i.d. N(0, 1) random variables {G;},°, such that
i_t Xi - E" _tGi n
nlim(n log log n) 1/2
l
-0
a.s.
In particular, this shows that the general LIL follows from the one proved here. More-
over, if the X;'s have higher moments than two, then the rate of approximation can be improved upon. This is the starting point of a theory of "strong approximations." Csorg6 and Rhvbsz (1981) is an excellent treatment. Two scholarly reviews of the LIL are Feller (1945), for the classical theory, and Bingham (1986), for the more modern advances.
(4) Equation (8.45) has the following improvement, due to Laplace (1805, pp. 490-493):
P{X1>a)=
A2/2
kr
A2
1+
A2
1+2 1+3
A2
1+4
(5) Theorem 8.39 is also known as the Lebesgue density theorem. It states that the antiderivative of every f E L'(dx) is f a.e. On the other hand, it is the case that "most" continuous functions are nowhere differentiable (Banach, 1931; Mazurkiewicz, 1931; Paley, Wiener, and Zygmund, 1933; Kahane, 1997, 2000, 2001). (6) The material on option-pricing (§6.5) is based in part on the discussions of Baxter and Rennie (1996, Chapter 2) and Williams (1991, Section 15.2). There you will learn, among other things, that there are in fact hedging strategies that never sell short. This demonstration requires only a little more effort than the proof described here, and is worth looking at. (7) The notion of Lipschitz continuity is due to the work of Lipschitz (1876) on differential equations. (8) One can streamline the method of §6.7; see Li (1980) and Gerber and Li (1981). (9) Problem 8.46 is, in essence, borrowed from the exciting book of Mahmoud (2000, pp. 48-51) on sorting. It can be shown that
f
R. - (n/2)
N(0, 1/12).
(ibid., Proposition 1.10, p. 51).
(10) Problem 8.48 is due to de Finetti (1937), but the proof outlined here is borrowed from Doob (1949). Aldous (1985) presents a masterly modern-day account of exchangeability and related topics. (11) When the increments of X are independent, Problem 8.50 is due to Hoeffding (1963). The general case is due to Azuma (1967), and is proved by the same argument.
158
8. Martin gales
(12) Problem 8.54 states that all exchangeable sequences of zeros and ones are "conditionally i.i.d." The proof outlined here is motivated by Exercise 6.3 of Durrett (1996, p. 271). For a detailed historical account see Cifarelli and Regazzini (1996). Remarkably enough, de Finetti's theorem has consequences in diverse subjects such as the philosophy of statistics (de Finetti, 1937; Kyhurg and Smokier, 1980), statistical mechanics (Georgii. 1988), and geometry of Hilbert spaces (Bretagnolle and Dacunha-Castelle. 1969). (13) Problem 8.57 generalizes the 'law of large numbers" of Dubins and Freedman (1965). The central ideas used here come from a paper of de Acosta (1983).
Chapter 9
Brownian Motion
The theory of random functions always makes the impression of a much greater degree of artificiality than corresponds to the facts.
-Raymond E. A. C. Paley and Norbert Wiener
On March 29, 1900, a doctoral student of J. H. Poincare by the name of Louis Jean Baptiste Alphonse Bachelier presented his thesis to the Faculty of Sciences of the Academy of Paris. Louis Bachelier's work was chiefly concerned with finding "a formula which expresses the likelihood of a market fluctuation" (Bachelier, 1964, p. 17). Bachelier's solution to this problem required the introduction of a number of novel ideas, one of which was today's "Brownian motion." See also the English translation in the volume edited by Cootner (Bachelier, 1964). In 1828, the botanist Robert Brown noted empirically that the grains of pollen in water undergo erratic motion. Brown himself admitted to not having a scientific explanation for this phenomenon. And it was years later, in 1905, that an explanation was found by Albert Einstein. The key idea in Einstein's solution was the introduction of a stochastic process that Einstein called "the Brownian motion." Unaware of the earlier work of Bachelier in economics, Einstein had rediscovered that Brownian motion is related concretely to the diffusion of particles. As a main application of his theory, Einstein found a very good estimate for Avogadro's constant. Einstein's theory was tacitly based on the assumption that the Brownian motion process exists. Nearly two decades later, Wiener (1923a) proved the validity of Einstein's assumption. In the present context, the contributions of von Smoluchowski (1918) and Perrin (1913) are also particularly noteworthy. 159
9. Brownian Motion
160
From a mathematical point of view, Bachelier's work went further than Einstein's. However, we introduce the latter's work because it is easier to describe. Thus, we begin with a modern statement of Einstein's postulates: Brownian motion .( W(t) }t>o is a random function oft (= "time") such that:
(P-a) W(0) = 0, and for any given time t > 0, the distribution of W(t) is normal with mean zero and variance t.
(P-b) For any 0 < s < t, W(t) - W(s) is independent of Think of s as the current time. Then, this condition is saying that "given the value of W at the present time, the future is independent of the past." This is called the Markov property.
(P-c) The random variable W(t) - W(s) has the same distribution as W(t - s). That is, Brownian motion has stationary increments. (P-d) The random path t' -+ W(t) is continuous with probability one.
Remark 9.1. One can also have a Brownian motion B that starts at an arbitrary point x E R by defining (9.1)
B(t) := x + W(t),
where W is a Brownian motion started at the origin. One can check directly that B has all the properties of W, except that B(t) has mean x for all t > 0, and B(0) = x. Unless stated to the contrary, our Brownian motions always
start at the origin.
So why are (P-a) through (P-d) postulates and not facts? The sticky point is the a.s.-continuity (P-d). In fact, Levy (1937, Theoreme 54.2, p. 181) has proven that if in (P-a) we replace the normal by any other distribution,
then either there is no process that satisfies (P-a)-(P-c), or else (P-d) fails to hold. In summary, while the predictions of theoretical physics were correct, a more solid understanding of Brownian motion required a rather in-depth undertaking such as that of N. Wiener. Since Wiener's work Brownian motion has been studied by multitudes of mathematicians. This and the next chapter aim to whet your appetite to learn more about this elegant theory.
1. Gaussian Processes Let us temporarily leave aside the question of the existence of Brownian motion, and first study normal distributions, Gaussian random variables, and Gaussian processes. Before proceeding further, you may wish to recall §5.2 (p. 11), as well as Examples 3.18 (p. 28) and 7.12 (p. 97), where normal random variables and their characteristic functions have been introduced.
1. Gaussian Processes
161
1.1. Normal Random Variables. Definition 9.2. An R-valued random variable Y is said to be centered if Y E L' (P) and EY = 0. An R"-valued random variable Y = (1',.. . , Yn)' is said to be centered if each Y is. If, in addition, E{IYj2} < 00 for all i = 1,. .. , n, then the covariance matrix Q = (Q,,j) of Y is the matrix whose (i,j)th entry is the covariance of Y and Yj; i.e., Qjj = E[YYj]. Suppose that X = (X1,. .. , Xn)' is a centered n-dimensional random variable in L2(P). Let a E R" denote a constant (column) vector, and note that a'X = a X = En 1 a;X1 is a centered R-valued random variable in L2(P) with variance n
n
Var(cz X) _ E E aiE[X;Xj[aj = a'Qa,
(9.2)
i=1 j=1
where Q = (Q;,j) is the covariance matrix of X. Since the variance of any random variable is non-negative, we have the following. Lemma 9.3. If Q denotes the covariance matrix of a centered L2 (p) -valued
random variable X = (X1, ... , Xn)' in R", then Q is a symmetric nonnegative definite matrix. Moreover, the diagonal terms of Q are given by Qj,j = VarXj. Definition 9.4. An R"-valued random variable X = (X1,.. . , Xn)' is centered normal (or centered Gaussian) if for all a E R",
= e-za'Qa where Q is a symmetric non-negative definite real matrix. The matrix A is called the covariance matrix of X. (9.3)
Ee'a.X
We have seen in Lemma 9.3 that covariance matrices are symmetric and non-negative definite. The following implies that the converse is true also.
Theorem 9.5. Let Q be a symmetric non-negative definite (n x n) matrix of real numbers. Then there exists a centered normal random variable X = (X1, . . , Xn) whose covariance matrix is Q. If Q is non-singular, then the distribution of X is absolutely continuous with respect to the n-dimensional .
Lebesgue measure and has the density o f Example 3.18 (p. 28) with Q replac-
ing E there. Finally, an R"-valued random variable X = (X1, ... , Xn) is centered Gaussian if and only if a'X is a mean-zero normal random variable
for all a E R. denote the n eigenvalues of Q. The )j's 1 are real and non-negative. Let {v;};'_1 denote the respective orthonormal eigenvectors, and view them as column vectors. Then the (n x n) matrix
Proof (Sketch). Let {a;}
9. Brownian Motion
162
P = (vi, ... , v,) is orthogonal. Moreover, we can write Q = P'AP, where A is the diagonal matrix of the eigenvalues al, ... , .\n. Next, let {Z;} 1 denote n independent standard normal random variables. It is not difficult to see that Z = (Z1,. .. , Zn)' is a centered Rnvalued random variable whose covariance is the identity matrix. Define
X := P'A1/2Z, where A'/2 denotes the diagonal matrix whose jth diagonal entry is x 2. Since X = (X1, ... , Xn)' is a linear combination of centered random variables, it too is a centered Rn-valued random variable. Define n
(9.4)
ll at (P'A1/2)1,k
Ak =
vk = 1.... , n.
1=1
Then a. X = Ek=1 ZkAk for all a E Rn. By independence, n
Eet°'X = [J Ee`4Ak = e-2 Ek=, Ak. k=1
One can check readily that Ek=1 Ak = a'Qa. Therefore, we have constructed a centered Gaussian process X that has covariance matrix Q. To check that Q is indeed the matrix of the covariances of X, make another round of computations to see that E[X;XJJ = Q. Next we suppose that Q is non-singular. Let j2 denote the distribution
of X. We have shown that µ(t) = exp(-Zt'Qt). Because µ is absolutely integrable on R1, the inversion theorem (see Problem 7.12, p. 112) implies that the probability density f = du/dx exists and is given by the formula (9.6)
AX) =
1
e-it"_2t'Qt
dt
vx E R.
Write Q:= P'A1/2A1/2P, and change variables [s = A2 Pt]. This transforms the preceding n-dimensional integral into a product of n one-dimensional integrals. Each of the latter integrals can be computed painlessly by completing the square. Therefrom follows the form of f. To complete this proof, we derive the assertion about linear combinations. First suppose a'X is a centered Gaussian variable in R. We have seen already that its variance is a'Qa, where Q denotes the covariance matrix of
X. In particular, thanks to Example 7.12 (p. 97), Ee'a'X = exp(-Za'Qa). Thus, if a'X is a mean-zero normal variable in R for all a E Rn, then X is centered Gaussian. The converse is proved similarly.
Remark 9.6. According to Theorem 9.5, the covariance matrix of a centered normal random variable determines its distribution. However, it can happen that Xl and X2 are normally distributed even though (X1, X2) is not; see Problem 9.4 below. This demonstrates that in general the normality
1. Gaussian Processes
of (X1, ... ,
163
is a stronger property than the normality of the individual
XD's.
The following important corollary asserts that for normal random vectors independence and uncorrelatedness are one and the same.
Corollary 9.7. Let (X 1, ... , X,,, Y1,. .. , Ym) be a centered normal random variable such that Cov(X;, YY) = 0 for all i = 1, ... , n and j = 1, ... , m. Then (XI, ... , and (Y1,. .. , Ym) are independent.
1.2. Brownian Motion as a Gaussian Process. Definition 9.8. We say that a real-valued process X is centered Gaussian if (X (tl ), ... , X (tk)) is a centered normal random variable in Rk for all 0 < t1, t2, t3, ... , tk. The function Q(s, t) := E[X (s)X (t)] is called the covariance function of the process X.
For the time being, we assume that Brownian motion exists, and derive some of its elementary properties. We establish the existence later on.
Theorem 9.9. If W := {W(t)}t>o denotes a Brownian motion, then it is a centered Gaussian process with covariance function Q(s, t) := s A t. Conversely, any a.s.-continuous centered Gaussian process that has covariance function Q and starts at 0 is a Brownian motion. Furthermore:
(1) (Quadratic Variation) For each t > 0, as n V n-I
jr
oo,
- W \ \n) t)J2
t.
(2) (The Markov Property) For any T > 0, the process
{W(t + T) - W(T)}t>o is a Brownian motion that is independent of of{W(r)}o 0, so we can take expectations and apply Minkowski's inequality to obtain rr
E suP IA(t)1 2 L t>0
(9.20)
l J
2n+1
<E
2
k=24+1 1 2n}1
_E
k=2n+l
2n-1
2n+1 -1
1=1
k=2n+l
+2
+
2n-1 2 t=1
V.
XkXl+k k(k + 1)
2'+1 -1
E
k=2n+1
2
XkXl+k 2
k(k+l)
1
12
(Why?) The final squared-L2(P)-norm is equal to k-2(k + l)-2. On the other hand, by monotonicity, .k=2n+1 k-2 < 2_n and .1=1 1(1+2n)-1 < 1. Therefore, (9.21)
E [SUP f''n(t)12]
< 2-n +2.2
-n/2.
It follows that En Supt>0 IS2n+1 (t) - S2n (t)I < oo a.s. Thus, as n -+ oo, S2n (t) converges uniformly in t > 0 to the limiting random process (9.22)
sin(jnt)Xi
Si(t) j=1
7
In particular, W (t) = limn-,,,, W2n (t) exists uniformly in t > 0 almost surely. Step 2. Continuity and Distributional Properties. The random map t -+ W2n (t) defined in (9.15) is obviously continuous. Because W is an a.s.uniform limit of continuous functions, it is a.s. continuous; see Step 3 below
for a technical note on this issue. Moreover, since W2n is a mean-zero
9. Brownian Motion
168
Gaussian process, so is W (Problem 9.2). Since W(0) = 0, it remains to prove that (9.23)
E [IW(t) - W(s)12] = t - s
dO < s < t.
But the independence of the X's, together with Lemma 6.8 (p. 67), yields
E [IW(t) - W(s)12] = (t - s)2 + (9.24)
_ (t - s)2 +
T2 E
[(S.(t) - Soo(s))2]
J=1J 2
(sin(Jlrt)
"0
sin(as)) 2
2
Define f(x) = 1[,ra,nti(x)+1[-nt,-as](x) (x E [-7r, -7r)) and On(x) = ei"z/ 2ir (x E [-7r, 7r]; n = 0, ±1, ±2,...). Then, 00
E [IW(t) - W(8)12] = (t - s)2 + (9.25)
1
-
2ar
00
f
2
f (x)O, (x) dx 2
IIn
f(x) b (x) dxl =-oo
a
By the Riesz-Fischer theorem (Theorem A.5, p. 205), the right-hand side is equal to (21r)-' f "'r lf (x) I2 dx = (t - s). This yields (9.23). Step 3. Technical Wrap-up. Now we tie up a subtle loose end: The uniform limit W (t) = limn-W W2.. (t) is known to exist, but only with probability one. This is insufficient because we need to define W (t, w) for all W. Thus, we define
W(t) := limsupW2n(t). n-oo The process W is well defined and continuous a.s. The remainder of the calculations of Step 2 goes through, since by redefining a random variable on a set of measure zero we do not change its distribution (unscramble this!). Finally, the completion is needed to ensure that the event C that W is continuous is measurable; in Step 1, we showed that C° is a subset of a null set. Because the underlying probability is complete, C` is null. (9.26)
3. Nowhere-Differentiability We are in a position to prove the following striking theorem of Paley, Wiener, and Zygmund (1933). Throughout, W denotes a Brownian motion.
Theorem 9.13. Suppose the underlying probability space is complete. Then, Brownian motion is nowhere differentiable almost surely.
3. Nowhere-Differentiability
169
Proof. For any A > 0 and n > 1, consider the event (9.27)
Ea = { 3s E [0,1] :
IW(s) - W(t)I < sup tE[s-2-°,s+2 ^]
A2-n JJ
(Why is this measurable?) We intend to show that En P(Ea) < oo for all
A>0. Indeed, suppose there exists s E [0, 11 such that for all t within 2-n of s, IW(s) - W(t)l < A2-1. Then there must exist a possibly random j = 0, ... , 2' - 1 such that s E D(j; n), where D(j; n) := [j2-n, (j +1)2 -nj . Thus, IW(s) -W(t)I < A2-n for all t E D(j;n). By the triangle inequality, 2A2-n = A2-n+1 for all u, v E D(j; n). we can deduce that I W (u) - W (v) I < Subdivide D(j; n) into four smaller dyadic intervals, and note that the successive differences in the values of W (at the endpoints of the subdivided intervals) are at most A2-n+1 This leads us to the following: (9.28)
2n-1 3
En g U n {Io,tI
0, the "post-T" process t i--, W(T + t) - W(T) is a Brownian motion that is independent of a({W (u)}o 0. Given a stopping time T, we can define OT by (9.34)
v/T={AE.,F: An{T0}.
T is called a simple stopping time if there exist 0 < To, 7-1.... < oo such that T(w) E {ro,T1i...} for all w E Q. Define (9.35)
.3° = o ({W(u)}ot.`3', s1t = nE>o
for all t>0. Next we construct interesting stopping times.
Proposition 9.17. If A C R is either open or closed, then the first hitting time TA := inf{t > 0 : W(t) E A}, where inf 0 := oo, is a stopping time with respect to the Brownian filtration F. Remark 9.18. If you know what Fo- and G6-sets are, then convince yourself that when A is of either variety, then TA is a stopping time. You may ask further, "What about TA when A is measurable but is neither Fo nor G6?" The answer is given by a quite deep theorem of Hunt (1957): TA is a stopping time for all Borel sets A.
Proof of Proposition 9.17. We prove this proposition in two steps. Step 1. TA is a stopping time when A is open. Suppose A is open. We wish to prove that {TA < t} E .fit for all t > 0. It suffices to prove that (TA < t} E .fit for all t > 0, because the right-continuity of Jr ensures that {TA < t} = n,>o{TA < t + e} E F. But (TA < t} is the event that there exists a time s before t at which W (s) E A. Let C denote the collection of all w such that t H W (t , w) is continuous. We know that P(C) = 1. And
9. Brownian Motion
172
since A is open, {TA < t} n C is the event that there exists a rational s < t at which W(s) E A. That is,
{TA T.
To check that Tn is a stopping time, note that {Tn < (k + 1)2-n} _ {T < (k + 1)2-n} E 9(k+1)2-n, since T is a stopping time. Now given any
5. The Strong ltiarkov Property
173
t > 0, we can find k and n such that t E [k2-n, (k + 1)2-n). Therefore, {T < t} = {Tn < k2-n} = IT < k2-n} E 5k2-n C St. This proves that the Tn's are non-increasing simple stopping times. Moreover, Tn converges to T since 0 < Tn - T < 2-n. It remains to prove that 5T nn5T,,; see Proposition 9.17 but replace 9° by 9 everywhere.
IfAEnk
An{Tn _ landt>_0.
Therefore, 00
(9.41)
00
An{T0
The lemma follows from the right-continuity of the filtration 5.
Proof of Proposition 9.19. First suppose that T is a simple 3-stopping time which takes values in {T° , -r1 , ... }. In this case, given any Bore] set A
and any t > 0, (9.42)
{W(T) E Al n IT < t} = U {W(Tn) E Al n IT = Tn} E S. n>O:
-r. o is a Brownian motion that is independent of 9T.
Proof. We prove this first for simple stopping times, and then approximate, using Lemma 9.20, a general stopping time with simple ones. Step 1. Simple Stopping Times. If T is a simple stopping time, then there exist To < rl < such that T E {To, 7-1i ...} a.s. Now for any A E .?T, and for all B1,... , Bm E R(R),
P(An{'i<m: W(T+ti)-W(ti)EBi}) (9.45)
P(An{t'i<m: W(7k+ti)-W(Tk)EBi, T=Tk}). k=0
But A n IT = Tk} = A n IT < Tk} n IT < Tk_1}' is in .irk since A E 9T. Therefore, by the Markov property (Theorem 9.9),
P(An{'i<m: W(T+ti)-W(ti)EBi}) (9.46)
=EP{di<m: W(Tk+ti)-W(Tk)EBi}P({T=Tk}nA) k=0
= P {di < m : W(ti) E Bi} P(A). This proves the theorem in the case that T is a simple stopping time. Indeed,
to deduce that t H W(t + T) - W(T) is a Brownian motion, simply set A = R. The asserted independence also follows since A E 9T is arbitrary. Step 2. The General Case. In the general case, we approximate T by simple stopping times as in Lemma 9.20. Namely, we find Tn I T-all simple stopping times-such that nn,!FTn = 9T. Now for any A E 9T, and for all open B1i... , B,n C R,
P(An{di<m: W(T+ti)-W(ti)EBi}) (9.47)
= n-oo limp (An { 'i < m : W(Tn + ti) - W(ti) E Bi}) = lim P {Vi < n-oo
m : W(ti) E Bi} P(A).
In the first equation we used the fact that the B's are open and W is continuous, while in the second equation we used the fact that A E .99'T for all n, together with the result of Step 1 applied to T. This proves the theorem.
6. The ReBection Principle
175
6. The Reflection Principle The following "reflection principle" is a prime example of how the strong Markov property (Theorem 9.21) can be applied to make nontrivial computations for the Brownian motion.
Theorem 9.22. If t is non-random and positive, then supo 0, (9.48)
I
P ( sup W(s) > a = lo<s 0 : W(s) > a} where inf 0 = oo. Thanks to Proposition 9.17, Ta is an f- and hence an f-stopping time. Step 1. T. is finite a.s. By scaling (Theorem 9.9), for any t > 0, the event {W(t) > v'} has probability (21r) -1/2 f1 ° e-x2/2 dx = c > 0. Consequently,
P{t.oo limsup
(9.49)
()>1}>c>0.
-
JJJ -
(Why?). Among other things, this and the zero-one law (Problem 9.11) together imply that lim supt_ao W (t) = oo a.s. Since W is continuous a.s., it must then hit a at some finite time a.s. Therefore, with probability one, T. is finite and W(Ta) = a. Step 2. Reflection. Note that {supo<s a} = {Ta < t} E .fit. Moreover, P{Ta < t} is equal to P {Ta < t
,
W(t) > a} + P {Ta < t, W(t) < a}
=P{W(t)>a}+P{Ta OI9T};Ta a}+P{Ta < t, W(Ta + (t - Ta)) - W(Ta) > 0}
=P{W(t)>a}+P{Taal = 2P {W (t) > a),
because P{W(t) = a} = 0. The latter is manifestly equal to the integral in the statement of the theorem. In addition, thanks to symmetry (Theorem 9.9),
2P{W(t) > a} = P{W(t) > a} +P{-W(t) > a} (9.52)
= P{IW(t)I > a}.
This completes our proof. The reflection principle has the following peculiar consequences: While we expect Brownian motion to reach a level a at some finite time, this time has infinite expectation. That is,
Corollary 9.23. Let Ta = inf{s > 0 : W(s) = a} denote the first time Brownian motion reaches the level a. Then for all a
0, Ta < oo a.s. but
ETa=oo. Proof. (Sketch) We have seen already that Ta is a.s. finite; let us sketch a proof that Ta has infinite expectation. Without loss of generality, we can assume that a > 0 (why?). Then, thanks to Theorem 9.22, = a/f e-v2/2 P {Ta > t} dy. (9.53)
f.lVt
27r
See Problem 9.13 for more details. The preceding formula demonstrates that
g)1/2
(9.54)
P{Ta>t}Haast-goo.
Therefore, E°_1 P{Ta
n} = oo, and Lemma 6.8 (p. 67) finishes the
proof.
Problems Throughout, W denotes a Brownian motion. 9.1. Prove the following: If X and Y are respectively R"- and R'"-valued random variables, then X and Y are independent if and only if (9.55)
Ee"'X+wY=&wX ivY
Use this to prove Corollary 9.7.
vuER",vER"`.
Problems
177
9.2. Suppose for every n = 1, 2, ... , C" = (G',..., GA) is an Rk-valued centered normal random variable. Suppose further that Q;j = lim_ E[G°G'`,'] exists and is finite. Then prove that Q is a symmetric nonnegative definite matrix, and that G" converges weakly to a centered normal random variable C = (G 1, ... , Gk) whose covariance matrix is Q.
9.3 (Linear Regression). Suppose C = (G1,...,G") is an R"-valued centered normal random variable, and let f denote the a-algebra generated by (G1,...,Gm), where m < n. Prove that, conditionally on 1, (G,,,+i,...,G") is a centered normal random variable. Find the conditional mean, as well as the covariance matrix.
9.4. Construct an example of two random variables X1 and X2 such that each of them is standard normal, although (XI, X2) is not a Gaussian random variable. (HINT: If X = ±1 with probability 1/2, and Z is an independent N(0,1), then X[Z[ is an N(0, 1) also.)
9.5. Prove that if W denotes Brownian motion, then { fp W(s) ds}t>o is a continuously differentiable Gaussian process. Use this as a guide to construct a k-times continuously differentiable Gaussian process. 9.6. Let t > 0 be fixed and define V"(t) as in Theorem 9.9. Prove: (1) The first two moments of V"(t) are respectively t and (2t2/n) + t2. Use this to verify that V"(t) converges to tin probability. (2) There exists a constant A > 0 such that for all it > 1, [[V,(t) - t1I4
0, t - s-I/2Z(s,t) is a Brownian motion on (0,1]. The process {Z(s,t)}a,tE1o,11 is the so-called Brownian sheet on [0,1]2.
9.27 (Problem 9.16, continued; Harder). Prove that the embedding scheme of Problem 9.16 satisfies ET, = 1 and E[TT ] < oo. Conclude that
IT,, - nI = O ((n in in n)1/2) < oo as n - oo as. Use this to prove that for all p > 141 (9.64)
nlimo
IW(T
W(n)I = 0
as.
no
Conclude that, on a suitable probability space, we can construct a simple walk {S,}1-1 and a Brownian motion {W(t)}t>o such that
max ISk - W(k)] = o(n") as n - oo a.s.
1o Hk,n+l0k+l,n+1, and split the sum according to whether k = 2j or k = 2j + 1: 00
00
Zn+1(H) = E Hj,n02j+1,n+1 + E H2j+1,n+10 j+ 1 nn+l j=o
j=0 00
_
(10.2)
j,n
2j+1,n+1
Hj,n A2j+1,n+1 + j+1,n j=o 00
2j+1,n+1
(H2j+l,n+1 - Hj,n) Oj+1,n j=0
Because 02 +1,n+1 + Aj+l nn+l = Oj+1,nl the first term is equal to Zn(H), whence follows the lemma.
Definition 10.3. A process H = {H(s)}t>o is adapted to the Brownian filtration 9 if H(s) is .T,-measurable for each s > 0. We say that H is a compact-support process if there exists a non-random T > 0 such that with probability one, H(s) = 0 for all s > T. We also need the following technical definition.
Definition 10.4. Choose and fix p > 1. We say that H is Dini-continuous in LP(P) if H(s) E 11(P) for all s > 0 and (10.3)
J o
1
p(r) dr < oo,
r
where
Op(r) :=
sup s,t: Is-tI 1; i.e.,
1. The Indefinite Ito Integral
183
limt-.oVip(t) = 0. Dini-continuity in LP(P) ensures that ?Pp converges to zero at some minimum rate. Here are a few examples:
Example 10.5.
(a) Suppose H is a.s. differentiable with a derivative that satisfies K := sups IIH'(t)IIp < oo. [Since (s,w) H H(s,w) is product measurable, f I H'(r) I P dr is a random variable, and hence IIH'(t)IIp are well defined.) By the fundamental theorem of calculus,
ift>s>0then
t
IIH(s) - H(s)IIp S f IIH'(r)llpdr < Kit - sl. s
Therefore, Op(r) < Kr, and H is Dini-continuous in L1(P).
(b) Suppose H(s) = f (W (s)), where f is a non-random Lipschitzcontinuous function; i.e., there exists L such that
If(x)-f(y)I SLly - xI It follows then that ilip(r) = O(r1/2) as r -i 0, and this yields the Dini-continuity of H in L1(P) for any p > 1. (c) Consider H(s) = f (W (s) , s), where f (x, t) is a non-random function, twice continuously differentiable in each variable with bounded
derivatives. Suppose, in addition, that there exists a non-random T > 0 such that f (x, s) = 0 for all s > T. Because I H(s) - H(t)I is bounded above by
If (1't'(s), a) - f (u'(t), a)1 + If MO, a) - f MO, t)I , by the fundamental theorem of calculus we can find a constant M such that for all s, t > 0,
IH(s)-H(t)I <M(IW(s)-W(t)I +It-sl) By Minkowski's inequality for all p > 1,
IIH(s) - H(s)IIp S M(IIW(s) - W(t)IIp+ It - sl)
=M(c it-611/2+It-sl)
,
where cp := IIN(0,1)IIp. Therefore, i/ip(r) = O(r1/2) as r - 0, whence follows the Dini-continuity of H in any L1(P) (p > 1).
Remark 10.6 (Cauchy Summability Test). Dini-continuity in L1(P) is equivalent to the summability of ikp(2-nf). Indeed, we can write (10.4)
n=0
10. Terminus: Stochastic Integration
184
Because iPp is nondecreasing, 00
00
lEOp(21 t)dt<EOp(2-").
(10.5)
1
0
n=0
n=0
We can now define f H dW for adapted compact-support processes that are Dini-continuous in L2(P). We will improve the construction later on.
Theorem 10.7. Suppose H is an adapted compact-support stochastic process that is Dini-continuous in L2(P). Then limn-,,2n(H) exists in L2(P).
If we write[fHdw] fH dW for this limit, the[(JHdw)2] n
(10.6) E
=0
= E [f°°H2(s)ds].
and E
If a, b E R, and V is another adapted compact-support stochastic process that is Dini-continuous in L2(P), then with probability one,
= aJ HdW +bJ V dW.
(10.7)
Definition 10.8. The second identity in (10.6) is called the Ito isometry (Ito, 1944).
Proof (Sketch). For t > s, W(t) - W(s) is independent of Jr. (Theorem 9.21, p. 174), and H(u) is %'. measurable for u < s. Therefore, Lemma 10.2 implies the following. We use the notation introduced in the proof of Lemma 10.2 to simplify the type setting: IITn+1(H) -Tn(H)II2 IIAI+ 11H {22n+11)
(10.8)
0<j 0. (10.17) (.4'H)(t) = sup n / IH(s)I ds n>1
t-(1/n)
2. Continuous Martingales in L2(P)
187
For each w, (4'H)(t + n-1) is the Hardy-Littlewood maximal function of H. Also, . WH is a n adapted process, whence 0
H
I (-*fH)(s)I2 ds
T and suet IIH(t)112 < oo. Therefore, f °O H2(s) ds < oo for almost all w. Moreover, we take first expectations (Fubini-Tonelli), and then square roots, to deduce
that (10.19)
II'HIIL2(mxP) 0. (2) If t > s > 0 then E[M(t) I9(s)] = M(s) a.s. The process M is a continuous L2-martingale if t H M(t) is almost-surely continuous, and M(t) E L2(P) for all t > 0. Much of the theory of discrete-time martingales transfers to continuous L2(P)-martingales. Here is a first sampler.
The Optional Stopping Theorem. If M is a continuous L2-martingale and S < T are bounded s-stopping times, then (10.21)
E[M(T) I.J"rs] = M(S)
a.s.
10. Terminus: Stochastic Integration
188
Proof. Throughout, choose and fix some non-random K > 0 such that T < K almost surely. If S and T are simple stopping times, then the optional stopping theorem is a consequence of its discrete-time namesake (p. 130). In general, let S 1 S and T I, T be the simple stopping times of Lemma 9.20 (p. 172). Note that
the condition S < T imposes 5,,, < T, + 2-' for all n, m > 1. Because T + 2-n` is a simple stopping time, it follows that (10.22)
= M(S,n)
E [M(Tn + 2-m) I
a.s.
Moreover, this very argument implies that {M(TL + 2-m)}°O_, is a discretetime martingale. Since Tn < T + 2-1 < K + 2-n, Problem 8.22 on page 153 implies that 1
(10.23)
E[sup jM(Tn+2-m)I2] AP
I sup JAI(s)I > $ o<s AJ 0 U. The following is of paramount importance to us, since it says something about the properties of the random function t F-. fo H dW .
Theorem 10.12. If H is an adapted process such that E[ f0 H2 (s) ds] < oc then we can construct the process { fo H dW }t>o such that it is a continuous L 2 -martingale.
Proof. According to Theorem 10.9, fo H dW exists. Thus, we can proceed by verifying the assertions of the theorem. We do so in three steps. Step 1. Reduction to H that is Dini-continuous in L2(P). Suppose we have proved the theorem for all processes H that are adapted and Dini-
continuous in L2(P). In this first step we prove that this implies the remaining assertions of the theorem. Let H be an adapted process such that E(fo H2(s) ds] < oc for all t > 0. We can find adapted processes H that are Dini-continuous in L2(P) and
lim E [f'
(10.28)
1
H(s))2 ds] = 0.
Indeed, we can apply Proposition 10.10 to H110.1.], and use the recipe of the said proposition for H,,. Then apply the proposition to t as well. By (10.28) and the Ito isometry (10.6),
H
lim E
[(f1Hdw r(10.29)
- J Hn, dW
=
0
Because f o Hn dW - fo Hn+, dW = f f (H - Hn+x) dW defines a continuous L2-inartingale, by Doob's maximal inequality (p. 189), for all non-random
10. Terminus: Stochastic Integration
190
but fixed T>0, (10.30)
r lim E I sup
n-oo
L0o and a subsequence n' - oo such that ( 10.31)
lim
sup
.o0 0 0, (10.39)
lim sup E [(.7n+1(H)(t) n-oo 0 s > 0,
(10.55)
f(9(t)) - f(9(s)) = f tf'(9(u))9 (u) du. e
For example, let f (x) = x2 to find that (10.56)
92(t)-92(0)= fgdg t
where dg(s) = g'(s) ds. What if g were replaced by Brownian motion? As a consequence of our next result we have t
(10.57)
W2(t) - W2(0) = f W dW + 2
a.s.
0
Compared with (10.56), this has an extra factor (t/2).
Ito's Formula 1. If f : R R has two continuous derivatives, then for all t > s > 0, the following holds a.s.: (10.58)
f(W(t)) - f(W(s)) =
f
t
f'(W(r)) W(dr) + 1 f t f"(W(r))dr. 2
a
s
Ito's formula is different from the chain rule for ordinary integrals because the nowhere differentiability of W forces us to replace the right-hand side of (10.55) with a stochastic integral plus a second-derivative term. Remark 10.14. Ito's formula continues to hold even if we assume only that f" exists almost everywhere and f t (f'(W (r))2 dr < oo a.s. Of course then we have to make sense of the stochastic integral, etc.
Proof in the Case that f"' is Bounded and Continuous. We assume, without loss of too much generality, that s = 0.
The proof of Ito's formula starts out in the same manner as that of (10.55). Namely, by telescoping the sum we first write
f(W(2-"L2"t-1])) (10.59)
-f(0)
E
00.
f(x,t)=cos(n-2x7r)exp(n
The function f solves the partial differential equation, 102f (10.81)
+
=0 subject to f(±l,t)=0 't>0.
This is a kind of heat equation on [-1, 1] with Dirichlet boundary conditions.
Barring technical conditions, Ito's formula 2 tells us that f (W (t) , t) - 1 defines a mean-zero martingale. [This uses the fact that f (0, 0) = 1.] By the optional stopping theorem (p. 187), E[f (W (T At) , T At)] = 1. Equivalently, (10.82)
E[f(W(T),T); Tt]=1.
282t)S
Because W(T) = ±1 a.s., the first term in (10.82) vanishes a.s. Whence we obtain the following cosine formula: /
(10.83)
\
1
E[cosl mr2(t) I; T>tJ
=exp(-n
2. A Fourier Series. Let L2(-2,2) denote the collection of all measurable functions g : [-2,2] - R such that f 22 g2 (x) dx < oo. Theorem A.5 (p. 205), after a little fidgeting with the variables, shows that 2, 2-1/2 sin(n7rx/2), 2-1/2 cos(m7rx/2) (n, m = 1,2,...) form an orthonormal basis for L2(-2,2). In particular, any 0 E L2(-2, 2) has the representation,
(10.84)
¢(x) = 20 +
[An cos n=1
(n7rx) + B. sin (n7rx 11 2
where:
- The infinite sums converge in L2(-2,2); - Ao = 2-1 f22 0(x) dx;
- An = 2-1/2 f 22 O(x) cos(n7rx/2) dx for n > 1; and
- Bn = 2-1/2 f22 22 0(x) sin(n7rx/2) dx for n > 1.
2
/J
Problems
199
Step 3. Putting it Together. We can apply the result of Step 2 to the function O(x) := 1(_1.1)(x) to obtain 1(-1,1)(x) -
(10.85)
1
_ 2 °O (-1)n
2
cos -E 2n + 1 n=0
((2n +1)7rx) 2
J
We "plug" in x := W (t , w), multiply by 1{T(w)>t}, and then apply expectations to find that
P{W(t)E(-1,1),T>t}-2PIT >t} (10.86)
=22n'
1E[cos((211+
)irW(t));T>tJ )
\
n=0
Since the left-hand side is equal to ZP{T > t}, the cosine formula of Step 1 completes our proof. The preceding proof is a sketch only because: (i) we casually treat the L2identity in (10.85) as a pointwise identity; and (ii) we exchange expectations with an infinite sum without actually justifying the exchange. With a little effort, these gaps can be filled.
Problems 10.1. In this exercise we construct a Dini-continuous process in LP(P) that is not as, continuous.
(1) Prove that if0<s 1, whereas `BI includes but is not limited to all continuously differentiable functions. (2) Prove that if 0 < a < 1, then if" is a complete normed linear space that is normed by
Illlly°
sup s,tE10,11
If(s) - f(t)I + sup If(t)I Is - t1°
tE iO,1]
.?it
(3) Given two functions f and g, define for all n > 1,
Ifbn9= klf (2) [9(k2 1)-9(2 )1'
10. Terminus: Stochastic Integration
200
Suppose f E 'a and g E `6'O for some 0 < a, 0 < 1. Prove fo f 6g := limn fo f bng exists whenever a + 0 > 1. Note that when we let g(x) = x, we recover the Riemann integral of f; i.e., that fo I6g = foI f(.) dx. (4) Prove that fo gbf is well defined, and
IIf69=f(1)9(l)-f(0)9(0)-I gbf 0 0 The integral f fbg is called a Young integral. (HINT: Lemma 10.2.) 10.3. In this problem you are asked to derive Doob's maximal inequality (p. 189) and its variants. We say that M is a submartingale if it is defined as a martingale, except
EIM(t) J F.] > M(s) a.s. whenever t > s > 0. M is a supermartingale if -M is a submartingale. A process M is said to be a continuous L2(10.88)
submartingale (respectively, continuous L2-supermartingale) if it is a submartingale (respectively
supermartingale), {M(t)}t>o is as, continuous, and M(t) E L2(P) for all t > 0. Prove: (1) If Y is in L2(P), then M(t) = EIY ),fit) is a martingale. This is a Doob martingale in continuous time. (2) If M is a martingale and t' is convex, then >'(M) is a submartingale provided that tl'(M(t)) E L'(P) for each t > 0. (3) If M is a submartingale, t' is a nondecreasing convex function, and tj'(M(t)) E L' (P) for all t > 0, then O(M) is a submartingale. (4) The first Doob inequality on page 189 holds if )MI is replaced by any a.s.-continuous submartingale. (HINT: Prove first that supo 1.
10.5 (Gambler's Ruin). If W denotes a Brownian motion, then for any a E R, define T. inf{s > 0 : W(s) = a} where inf 0 := oo. Recall that T. is an .ir-stopping time (Proposition 9.17, p. 171). If a, b > 0 then prove that b
P{T 0, we
have 2(f) = (71,1) for all f E H. It remains to prove uniqueness. but this too is easy for if there were two of these functions, say 7r1 and 7r2. then
for all f E H, (f, a, - 7r2) = 0. In particular. let f = 7r1 - r2 to see that D
ire = 712.
2. Fourier Series Throughout this section, we let T = [-7r, 7r] denote the torus of length 27r, and consider some elementary facts about the trigonometric Fourier series on T that are based on the following functions: (A.6)
t¢n(x) =
einx
ex E T, n = 0, ±1, ±2,...
.
27r
Let L2(T) denote the Hilbert space of all measurable functions f : T -. C such that (A.7)
IIf 11T,
JT
If(x)I2dx < oo.
As usual, L2(T) is equipped with the (semi-)norm IIf IIT and inner product (A.8)
(f, g) := IT
Our goal is to prove the following theorem.
Theorem A.S. The collection {0 }FEZ is a complete orthonormal system in L2(T). Consequently, every f E L2(T) can be written as
` 00
(A.9)
f=
(f, 0n)4n
n=-ac
where the convergence takes place in L2(T). Furthermore. 00
(A.10)
11f 11T
=
n=-oo
(f
I
The proof is not difficult, but requires some preliminary developments.
Definition A.6. A trigonometric polynomial is a finite linear combination of the fn's. An approximation to the identity is a sequence of integrable functions wro, wi , ...: T R+ such that:
Appendix
206
(i) fT ipn(x) dx = 1 for all n. Ef (ii) There exists co > 0 such that limn-,,,, f O (x) dx = 1 for all
e E 10,C01-
Note that (a) all the 1bn's are nonnegative; and (b) the preceding display shows that all of the area under 4pn is concentrated near the origin when n is large. In other words, as n -' oo, 7pn looks more and more like a point mass.
For n = 0,1,2.... and x E T consider (A.11)
cnx) = (1+cosx)" an
where an = f (1+cos(x))ndx. T
Lemma A.7. {}°_°o is an approximation to the identity. xe E (0, it/21. Then, Proof. Choose and fij(i +cosx)ndx < ir(1 +cose)n.
(A.12)
By symmetry, this estimates the integral away from the origin. To estimate the integral near the origin, we use a method of P.-S. Laplace and write (A.13)
j(i + cosx)n dx =
E
J
e engixl dx,
where g(x) := ln(1 + cos x). Apply Taylor's theorem with remainder to deduce that for any x E [0, e] there exists [; E [0, x] such that g(x) = In 2 - x2/(1 + cos (). But cos (> 0 because 0 < (< e < it/2. Thus, for all n > 1, (A.14)
j(i + cos x)dx >_ 2n J e2 dx > 0
V/n
J e-z2 dz. o
It follows from this and (A.12) that ff