Graduate Texts in Mathematics
230
Editorial Board S. Axler F. W Gehring K. A. Ribet
Graduate Texts in Mathematics 2...
125 downloads
946 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Graduate Texts in Mathematics
230
Editorial Board S. Axler F. W Gehring K. A. Ribet
Graduate Texts in Mathematics 2 3 4 5 6 7 8 9 10 II
12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
TAKEUTI/ZARING. Introduction to Axiomatic Set Theory. 2nd ed. OxTOBY. Measure and Category. 2nd ed. ScHAEFER. Topological Vector Spaces. 2nd ed. HILTON/STAMMBACH. A Course in Homological Algebra. 2nd ed. MAc LANE. Categories for the Working Mathematician. 2nd ed. HUGHES/PIPER. Projective Planes. J.P. SERRE. A Course in Arithmetic. TAKEUTI/ZARING. Axiomatic Set Theory. HUMPHREYS. Introduction to Lie Algebras and Representation Theory. CoHEN. A Course in Simple Homotopy Theory. CONWAY. Functions of One Complex Variable I. 2nd ed. BEALS. Advanced Mathematical Analysis. ANDERSON/FULLER. Rings and Categories of Modules. 2nd ed. GoLUBITSKY/GUILLEMIN. Stable Mappings and Their Singularities. BERBERIAN. Lectures in Functional Analysis and Operator Theory. WINTER. The Structure of Fields. ROSENBLATT. Random Processes. 2nd ed. HALMOS. Measure Theory. HALMOS. A Hilbert Space Problem Book. 2nd ed. HUSEMOLLER. Fibre Bundles. 3rd ed. HUMPHREYS. Linear Algebraic Groups. BARNES/MACK. An Algebraic Introduction to Mathematical Logic. GREUB. Linear Algebra. 4th ed. HOLMES. Geometric Functional Analysis and Its Applications. HEWITT/STROMBERG. Real and Abstract Analysis. MANES. Algebraic Theories. KELLEY. General Topology. ZARISKIISAMUEL. Commutative Algebra. Vol.L ZARISKIISAMUEL. Commutative Algebra. Vol.Il. JACOBSON. Lectures in Abstract Algebra I. Basic Concepts. JACOBSON. Lectures in Abstract Algebra II. Linear Algebra. JACOBSON. Lectures in Abstract Algebra III. Theory of Fields and Galois Theory. HIRSCH. Differential Topology.
34 35 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
SPITZER. Principles of Random Walk. 2nd ed. ALEXANDERfWERMER. Several Complex Variables and Banach Algebras. 3rd ed. KELLEYINAMIOKA et al. Linear Topological Spaces. MONK Mathematical Logic. GR.AUERTIFRITZSCHE. Several Complex Variables. ARVESON. An Invitation to C*Algebras. KEMENY/SNELL/KNAPP. Denumerable Markov Chains. 2nd ed. APOSTOL. Modular Functions and Dirichlet Series in Number Theory. 2nd ed. J.P. SERRE. Linear Representations of Finite Groups. GILLMANIJERISON. Rings of Continuous Functions. KENDIG. Elementary Algebraic Geometry. LOEVE. Probability Theory I. 4th ed. LOEVE. Probability Theory II. 4th ed. MOISE. Geometric Topology in Dimensions 2 and 3. SACHSIWu. General Relativity for Mathematicians. GRUENBERG/WEIR. Linear Geometry. 2nd ed. EDWARDS. Fermat's Last Theorem. KLINGENBERG. A Course in Differential Geometry. HARTSHORNE. Algebraic Geometry. MANIN. A Course in Mathematical Logic. GR.AVERIWATKINS. Combinatorics with Emphasis on the Theory of Graphs. BROWN/PEARCY. Introduction to Operator Theory 1: Elements of Functional Analysis. MASSEY. Algebraic Topology: An Introduction. CROWELL/Fox. Introduction to Knot Theory. KoBLITZ. padic Numbers, padic Analysis, and ZetaFunctions. 2nd ed. LANG. Cyclotomic Fields. ARNOLD. Mathematical Methods in Classical Mechanics. 2nd ed. WHITEHEAD. Elements of Homotopy Theory. KARGAPOLOVIMERLZJAKOV. Fundamentals of the Theory of Groups. BOLLOBAS. Graph Theory. (continued after index)
Daniel W. Stroock
An Introduction to Markov Processes
�Springer
Daniel W. Stroock MIT Department of Mathematics, Rm. 272 Massachusetts Ave 77 021394307 Cambridge, USA dws @math.mit.edu
Editorial Board S. Axler Mathematics Department San Francisco State University San Francisco, CA 94132 USA axler@ sfsu.edu
F. W. Gehring Mathematics Department East Hall University of Michigan Ann Arbor, MI 48109 USA fgehring @math.lsa. umich.edu
K. A. Ribet Mathematics Department University of California at Berkeley Berkeley, CA 947203840 USA ribet@ math. berkeley .edu
Mathematics Subject Classification (2000): 6001, 60110, 60J27 ISSN 00725285 ISBN 3540234993 Springer Berlin Heidelberg New York Library of Congress Control Number: 20041113930 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+ Business Media springeronline.com © SpringerVerlag Berlin Heidelberg 2005 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Cameraready by the translator Cover design: design & production GmbH, Heidelberg Printed on acidfree paper 41/3142 XT 54 3 2 1 0
This book is dedicated to my longtime colleague: Richard A. Holley
Contents
. . . . . . . . . .
Preface .
.
. . . . . .
.
. . .
A Good Place to Begin 1 . 1 . Nearest Neighbor Random Walks on Z 1 . 1 . 1 . Distribution at Time n . . . . . . . . . 1 . 1.2. Passage Times via the Reflection Principle 1 . 1 .3. Some Related Computations . . . . . 1 . 1 .4. Time of First Return . . . . . 1 . 1 .5. Passage Times via Functional Equations 1 .2. Recurrence Properties of Random Walks . . . 1.2.1. Random Walks on _zd . . . . 1 .2.2. An Elementary Recurrence Criterion . . 1 .2.3. Recurrence of Symmetric Random Walk in Z2 1 .2.4. Transience in Z3 1 .3. Exercises . . . .
Chapter 1 Random Walks
.
Chapter
2
Doeblin's Theory for Markov Chains
2.1. Some Generalities . . . . . . . . . . . . . 2.1.1. Existence of Markov Chains . . . . . . . 2.1 .2. Transition Probabilities & Probability Vectors 2.1.3. Transition Probabilities and Functions 2.1 .4. The Markov Property . 2.2. Doeblin's Theory . . . . 2.2. 1. Doeblin's Basic Theorem 2.2.2. A Couple of Extensions 2.3. Elements of Ergodic Theory 2.3. 1 . The Mean Ergodic Theorem 2.3.2. Return Times . . 2.3.3. Identification of 1r 2.4. Exercises .
Chapter 3 More about the Ergodic Theory of Markov Chains
3.1. Classification of States . . . . . . . . . 3.1.1. Classification, Recurrence, and Transience 3.1.2. Criteria for Recurrence and Transience 3.1.3. Periodicity . . . . . . . . . 3.2. Ergodic Theory without Doeblin 3.2.1. Convergence of Matrices . . .
Xl
1 1 2 3 4 6 7 8 9 9 11 13 16 23 23 24 24 26 27 27 28 30 32 33 34 38 40 45 46 46 48 51 53 53
viii
Contents
3.2.2. Abel Convergence . . . . . . . . . 3.2.3. Structure of Stationary Distributions 3.2.4. A Small Improvement . . . . . . . 3.2.5. The Mean Ergodic Theorem Again 3.2.6. A Refinement in The Aperiodic Case . 3.2. 7. Periodic Structure 3.3. Exercises Chapter 4 Markov Processes in Continuous Time
4.1. Poisson Processes . . . . . . . . . 4.1.1. The Simple Poisson Process . . . . . 4. 1.2. Compound Poisson Processes on zd . 4.2. Markov Processes with Bounded Rates 4.2. 1 . Basic Construction . . . . . . . . 4.2.2. The Markov Property . . . . . . . 4.2.3. The QMatrix and Kolmogorov's Backward Equation 4.2.4. Kolmogorov's Forward Equation . . . . . . . . . . 4.2.5. Solving Kolmogorov's Equation . . . . . . . . . . 4.2.6. A Markov Process from its Infinitesimal Characteristics 4.3. Unbounded Rates . . . . . . . . . . 4.3. 1 . Explosion . . . . . . . . . . 4.3.2. Criteria for Nonexplosion or Explosion 4.3.3. What to Do When Explosion Occurs 4.4. Ergodic Properties . . . . . . . . . . 4.4.1 . Classification of States . . . . . . . . 4.4.2. Stationary Measures and Limit Theorems 4.4.3. Interpreting irii 4.5. Exercises Chapter 5 Reversible Markov Processes
5.1 . Reversible Markov Chains . . . . 5.1.1. Reversibility from Invariance . . 5.1 .2. Measurements in Quadratic Mean 5.1.3. The Spectral Gap . . . . . . . 5.1.4. Reversibility and Periodicity 5.1 .5. Relation to Convergence in Variation 5.2. Dirichlet Forms and Estimation o f /3 5.2.1. The Dirichlet Form and Poincare's Inequality 5.2.2. Estimating !3+ . . . . . . . . . . . . . 5.2.3. Estimating (3_ . . . . . . . . . . . . . 5.3. Reversible Markov Processes in Continuous Time 5.3. 1 . Criterion for Reversibility . . . . . . 5.3.2. Convergence in L2 (ir) for Bounded Rates 5.3.3. L2 ( ir )Convergence Rate in General . .
55 57 59 61 62 65 67 75 75 75 77 80 80 83 85 86 86 88 89 90 92 94 95 95 98 101 102 107 107 108 108 1 10 1 12 113 115 115 117 119 120 120 121 122
Contents 5.3.4. Estimating ). . . . . . . . . . . 5.4. Gibbs States and Glauber Dynamics 5.4. 1 . Formulation 5.4.2. The Dirichlet Form 5.5. Simulated Annealing 5.5. 1 . The Algorithm 5.5.2. Construction of the Transition Probabilities . 5.5.3. Description of the Markov Process 5.5.4. Choosing a Cooling Schedule 5.5.5. Small Improvements 5.6. Exercises . . . . . Chapter 6 Some Mild Measure Theory
6.1. A Description of Lebesgue's Measure Theory . 6.1.1. Measure Spaces . . . . . . . . . . . . . 6.1.2. Some Consequences of Countable Additivity 6.1.3. Generating aAlgebras 6 .1.4. Measurable Functions 6. 1.5. Lebesgue Integration . 6.1.6. Stability Properties of Lebesgue Integration . 6.1 .7. Lebesgue Integration in Countable Spaces 6.1 .8. Fubini's Theorem . . . . . . . . . . . . 6.2. Modeling Probability . . . . . . . . . . . 6.2. 1 . Modeling Infinitely Many Tosses of a Fair Coin 6.3. Independent Random Variables . . . . . . . 6.3. 1 . Existence of Lots of Independent Random Variables 6.4. Conditional Probabilities and Expectations 6.4. 1 . Conditioning with Respect to Random Variables
ix 125 126 126 127 130 131 132 1 34 134 137 138 145 145 145 147 148 149 150 151 153 155 157 158 162 163 165 166
Notation .
167
References .
168
Index . . . .
169
This Page Intentionally Left Blank
Preface
To some extent, it would be accurate to summarize the contents of this book as an intolerably protracted description of what happens when either one raises a transition probability matrix P (i.e. , all entries (P)ij are non negative and each row of P sums to 1) to higher and higher powers or one exponentiates R (P I), where R is a diagonal matrix with nonnegative entries. Indeed, when it comes right down to it, that is all that is done in this book. However, I, and others of my ilk, would take offense at such a dismissive characterization of the theory of Markov chains and processes with values in a countable state space, and a primary goal of mine in writing this book was to convince its readers that our offense would be warranted. The reason why I, and others of my persuasion, refuse to consider the theory here as no more than a subset of matrix theory is that to do so is to ignore the pervasive role that probability plays throughout. Namely, probability theory provides a model which both motivates and provides a context for what we are doing with these matrices. To wit, even the term "transition probability matrix" lends meaning to an otherwise rather peculiar set of hypotheses to make about a matrix. Namely, it suggests that we think of the matrix entry (P)ij as giving the probability that, in one step, a system in state i will make a transition to state j. Moreover, if we adopt this interpretation for (P)ij , then we must interpret the entry (P"")ij of P"" as the probability of the same transition in n steps. Thus, as n oo, P"" is encoding the long time behavior of a randomly evolving system for which P encodes the onestep behavior, and, as we will see, this interpretation will guide us to an understanding of limn_,00 (P"")ij · In addition, and perhaps even more important, is the role that probability plays in bridging the chasm between mathematics and the rest of the world. Indeed, it is the probabilistic metaphor which allows one to formulate mathematical models of various phenomena observed in both the natural and social sciences. Without the language of probability, it is hard to imagine how one would go about connecting such phenomena to P"". In spite of the propaganda at the end of the preceding paragraph, this book is written from a mathematician's perspective. Thus, for the most part, the probabilistic metaphor will be used to elucidate mathematical concepts rather than to provide mathematical explanations for nonmathematical phe nomena. There are two reasons for my having chosen this perspective. First, and foremost, is my own background. Although I have occasionally tried to help people who are engaged in various sorts of applications, I have not accu mulated a large store of examples which are easily translated into terms which are appropriate for a book at this level. In fact, my experience has taught me that people engaged in applications are more than competent to handle the routine problems which they encounter, and that they come to someone like me only as a last resort. As a consequence, the questions which they 
+
Preface
xii
ask me tend to be quite difficult and the answers to those few which I can solve usually involve material which is well beyond the scope of the present book. The second reason for my writing this book in the way that I have is that I think the material itself is of sufficient interest to stand on its own. In spite of what funding agencies would have us believe, mathematics qua mathematics is a worthy intellectual endeavor, and I think there is a place for a modern introduction to stochastic processes which is unabashed about making mathematics its top priority. I came to this opinion after several semesters during which I taught the introduction to stochastic processes course offered by the M.I.T. department of mathematics. The clientele for that course has been an interesting mix of undergraduate and graduate students, less than half of whom concentrate in mathematics. Nonetheless, most of the students who stay with the course have considerable talent and appreciation for mathematics, even though they lack the formal mathematical training which is requisite for a modern course in stochastic processes, at least such courses are now taught in mathematics departments to their own graduate students. As a result, I found no ready made choice of text for the course. On the one hand, the most obvious choice is the classic text A First Course in Stochastic Processes, either the original one by S. Karlin or the updated version [4] by S. Karlin and H. Taylor. Their book gives a no nonsense introduction to stochastic processes, especially Markov processes, on a countable state space, and its consistently honest, if not al ways easily assimilated, presentation of proofs is complemented by a daunting number of examples and exercises. On the other hand, when I began, I feared that adopting Karlin and Taylor for my course would be a mistake of the same sort adopting Feller's book for an undergraduate introduction to probabil ity, and this fear prevailed the first two times I taught the course. However, after using, and finding wanting, two derivatives of Karlin's classic, I took the plunge and assigned Karlin and Taylor's book. The result was very much the one which I predicted: I was far more enthusiastic about the text than were my students. In an attempt to make Karlin and Taylor's book more palatable for the students, I started supplementing their text with notes in which I tried to couch the proofs in terms which I hoped they would find more accessible, and my efforts were rewarded with a quite positive response from my students. In fact, as my notes became more and more extensive and began to diminish the importance of the book, I decided to convert them into what is now this book, although I realize that my decision to do so may have been stupid. For one thing, the market is already close to glutted with books which purport to cover this material. Moreover, some of these books are quite popular, al though my experience with them leads me to believe that their popularity is not always correlated with the quality of the mathematics they contained. Having made that pejorative comment, I will not make public which are the books which led me to this conclusion. Instead, I will only mention the books on this topic, besides Karlin and Taylor's, which I very much liked. Namely, as
as
Preface
xiii
J. Norris's book [5] is an excellent introduction to Markov processes which, at the same time, provides its readers with a good place to exercise their measuretheoretic skills. Of course, Norris's book is only appropriate for stu dents who have measuretheoretic skills to exercise. On the other hand, for students who possess those skills, Norris's book is a place where they can see measure theory put to work in an attractive way. In addition, Norris has included many interesting examples and exercises which illustrate how the subject can be applied. The present book includes most of the math ematical material contained in [5] , but the proofs here demand much less measure theory than his do. In fact, although I have systematically employed measure theoretic terminology (Lebesgue's Dominated Convergence Theorem, the Monotone Convergence Theorem, etc.), which is explained in Chapter 6, I have done so only to familiarize my readers with the jargon which they will encounter if they delve more deeply into the subject. In fact, because the state spaces in this book are countable, the applications which I have made of Lebesgue's theory are, with one notable exception, entirely trivial. The one exception, which is made in § 6.2, is that I have included a proof that there exist countably infinite families of mutually independent random variables. Be that as it may, the reader who is ready to accept that such families exist has no need to consult Chapter 6 except for terminology and the derivation of a few essentially obvious facts about series. For more advanced students, an excellent treatment of Markov chains on a general state space can be found in the book [6] by D. Revuz. The organization of this book should be more or less selfevident from the table of contents. In Chapter 1 , I give a bare hands treatment of the basic facts, with particular emphasis on recurrence and transience, about nearest neighbor random walks on the square, ddimensional lattice zd . Chapter 2 introduces the study of ergodic properties, and this becomes the central theme which ties together Chapters 2 through 5. In Chapter 2, the systems under consideration are Markov chains (i.e., the time parameter is discrete) , and the driving force behind the development there is an idea which was introduced by Doeblin. Restricted as the applicability of Doeblin's idea may be, it has the enormous advantage over the material in Chapters 3 and 4 that it provides an estimate on the rate at which the chain is converging to its equilibrium distribution. After giving a reasonably thorough account of Doeblin's theory, in Chapter 3 I study the ergodic properties of Markov chains which do not necessarily satisfy Doeblin's condition. The main result here is the one sum marized in equation (3.2.15) . Even though it is completely elementary, the derivation of (3.2. 15) , is, without doubt, the most demanding piece of analy sis in the entire book. So far as I know, every proof of (3.2.15) requires work at some stage. In supposedly "simpler" proofs, the work is hidden elsewhere (either measure theory, as in [5] and [6] , or in operator theory, as in [2]) . The treatment given here, which is a reworking of the one in [4] based on Feller's renewal theorem, demands nothing more of the reader than a thorough un derstanding of arguments involving limits superior, limits inferior, and their
xiv
Preface
role in proving that limits exist. In Chapter 4, Markov chains are replaced by continuoustime Markov processes (still on a countable state space). I do this first in the case when the rates are bounded and therefore problems of possible explosion do not arise. Afterwards, I allow for unbounded rates and develop criteria, besides boundedness, which guarantee nonexplosion. The remainder of the chapter is devoted to transferring the results obtained for Markov chains in Chapter 3 to the continuoustime setting. Aside from Chapter 6, which is more like an appendix than an integral part of the book, the book ends with Chapter 5. The goal in Chapter 5 is to obtain quantitative results, reminis cent of, if not as strong as, those in Chapter 2, when Doeblin's theory either fails entirely or yields rather poor estimates. The new ingredient in Chapter 5 is the assumption that the chain or process is reversible (i.e., the transition probability is selfadjoint in the L 2 space of its stationary distribution) , and the engine which makes everything go is the associated Dirichlet form. In the final section, the power of the Dirichlet form methodology is tested in an analysis of the Metropolis (a.k.a. as simulated annealing) algorithm. Finally, as I said before, Chapter 6 is an appendix in which the ideas and terminol ogy of Lebesgue's theory of measure and integration are reviewed. The one substantive part of Chapter 6 is the construction, alluded to earlier, in § 6.2.1. Finally, I have reached the traditional place reserved for thanking those individuals who, either directly or indirectly, contributed to this book. The principal direct contributors are the many students who suffered with various and spontaneously changing versions of this book. I am particularly grateful to Adela Popescu whose careful reading brought to light many minor and a few major errors which have been removed and, perhaps, replaced by new ones. Thanking, or even identifying, the indirect contributors is trickier. Indeed, they include all the individuals, both dead and alive, from whom I received my education, and I am not about to bore you with even a partial list of who they were or are. Nonetheless, there is one person who, over a period of more than ten years, patiently taught me to appreciate the sort of material treated here. Namely, Richard A. Holley, to whom I have dedicated this book, is a true probabilist. To wit, for Dick, intuitive understanding usually precedes his mathematically rigorous comprehension of a probabilistic phenomenon. This statement should lead no one to to doubt Dick's powers as a rigorous mathematician. On the contrary, his intuitive grasp of probability theory not only enhances his own formidable mathematical powers, it has saved me and others from blindly pursuing flawed lines of reasoning. As all who have worked with him know, reconsider what you are saying if ever, during some diatribe into which you have launched, Dick quietly says "I don't follow that." In addition to his mathematical prowess, every one of Dick's many students will attest to his wonderful generosity. I was not his student, but I was his colleague, and I can assure you that his generosity is not limited to his students. Daniel W. Stroock, A ugust
2004
CHAPTER 1
Random Walks
A Good Place to Begin
The purpose of this chapter is to discuss some examples of Markov processes which can be understood even before the term "Markov process" is. Indeed, anyone who has been introduced to probability theory will recognize that these processes all derive from consideration of elementary "coin tossing." 1.1 Nearest Neighbor Random Walks on Z Let p be a fixed number from the open interval (0, 1), and suppose that 1 {En: n E z + } is a sequence of {1, 1}valued, identically distributed Bernoul li random variables2 which are 1 with probability p. That is, for any n E z+ n and any E (El , . . . , En) E { 1, 1} , lP'(B1 = E 1 , . . . , Bn =En) = pN(E)qnN(E) where q 1  p and n (1.1.1) n + �n(E) when Sn(E) L = 1} = N(E) #{ m : 1 =
=
=
tm
=
Cm·
Next, set (1.1.2)
Xo = 0 and Xn =
n
L Bm for n E z+ .
m =l
The existence of the family { Bn : n E z + } is the content of § 6.2.1. The above family of random variables { Xn : n E N} is often called a nearest neighbor random walk on Z. Nearest neighbor random walks are examples of Markov processes, but the description which we have just given is the one which would be given in elementary probability theory, as opposed to a course, like this one, devoted to stochastic processes. Namely, in the study of stochastic processes the description should emphasize the dynamic aspects 1 Z is used to denote the set of all integers, of which N and z+ are, respectively, the non negative and positive members. 2 For historical reasons, mutually independent random variables which take only two values are often said to be Bernoulli random variables.
2
1 RANDOM WALKS
of the family. Thus, a stochastic process oriented description might replace ( 1.1.2) by IP'(Xo = 0) =
( 1.1.3)
1 and
IP'(Xn  Xn1 = E I Xo, ... , Xn1) =
{�
if E = 1 if E = 1,
where IP'(Xn  Xn1 = E I Xo, ... , Xn1) denotes the conditional probability ( cf. § 6.4.1) that XnXn1 = E given a( {Xo, ... , Xnd ). Notice that ( 1 .1.3) is indeed more dynamic a description than the one in ( 1.1.2). Specifically, it says that the process starts from 0 at time n = 0 and proceeds so that, at each time n E z+, it moves one step forward with probability p or one step backward with probability q, independent of where it has been before time n. 1 . 1 . 1 . Distribution at Time n: In this subsection, we will present two approaches to computing IP'(Xn = m ) . The first computation is based on the description given in ( 1.1.2). Namely, from ( 1.1.2) it is clear that IP'(IXnl :::; n) = 1. In addition, it is clear that n odd
===? IP'(Xn is odd ) = 1 and
===? IP'(Xn is even) = 1 .
n even
{ {
Finally, given m E n, . . . , n } with the same parity as n and a string E = (E1, ... , En) E 1, l}n with (cf. ( 1.1.1)) S (E) m , N(E) = ntm and so n
= n
+m
nnt
IP'( B1 = E1, ... , Bn = En) = p_2_q_2_.
Hence, because, when (k) kl(}�k)! is the binomial coefficient "£ choose k," there are ( � ) such strings E, we see that 2 =
"'
"
IP'(Xn = m)
( 1.1.4)
=
if m E Z, lml :::; n, and
( mtn) p
m
"1"' q"2"'
has the same parity as n
and is 0 otherwise. Our second computation of the same probability will be based on the more dynamic description given in ( 1.1.3). To do this, we introduce the notation (Pn)m IP'(Xn = m) . Obviously, (P0)m = bo,m, where bk,R. is the Kronecker symbol which is 1 when k = £ and 0 otherwise. Further, from ( 1 .1.3), we see that IP'(Xn = m) equals =
IP'(Xn1 = m  1 & Xn = m) + IP'(Xn1 = m + 1 & Xn = m) = p!P'(Xn1 = m 1 ) + q!P'(Xn1 = m + 1 ) .
That is,
( 1 .1.5)
1.1 Nearest Neighbor Random Walks on Z
3
Obviously, (1.1.5) provides a complete, albeit implicit, prescription for com puting the numbers (Pn )m, and one can easily check that the numbers given by (1.1.4) satisfy this prescription. Alternatively, one can use (1.1.5) plus in duction on n to see that ( pn ) m = 0 unless m = 2£  n for some 0 :::; £ :::; n and that (Cn )c = (Cn )£ 1 + (Cn )£ 1 when (Cn )c pCqn C(pn hc  n · In other words, the coefficients {(Cn )c n E N & 0 :::; £ :::; n} are given by Pascal's triangle and are therefore the binomial coefficients. 1 . 1.2. Passage Times via the Reflection Principle: More challenging than the computation in § 1 . 1 . 1 is finding the distribution of the first passage time to a point a E Z. That is, given a E Z \ {0}, set3 ::=:
:
(a = inf {n :::::1 : Xn = a} (::=: oo when Xn =/= a for any n ::::: 1 ) .
(1.1 .6)
Then (a is the first passage time t o a, and our goal here is to find its distribu tion. Equivalently, we want an expression for IP'((a = n), and clearly, by the considerations in §1.1.1, we need only worry about n's which satisfy n ::::: l a l and have the same parity as a. Again we will present two approaches to this problem, here based on (1. 1.2) and in §1.1.5 on (1.1 .3) . To carry out the one based on (1 .1.2), assume that a E z+ , suppose that n E z+ has the same parity as a, and observe first that
IP' ((a
=
n) = IP'(Xn =a & (a > n  1 ) = p!P' ((a > n  1 & Xn1 = a  1) .
Hence, it suffices for us to compute IP'( (a > n  1 & Xn  1 = a  1 ) . For this purpose, note that for any E E {  1 , 1} n  1 with Sn  1 (E) = a  1, the event { (B1 , . . . , Bn  1 ) = E} has probab1hty p 2 q � 2 Thus, •
•
n+a_l
•
IP'((a = n) = N(n, a)p__ 2 q__ 2 where N(n, a) is the number of E E {  1 , 1 } n  1 with the properties that Sc(E) :::; a  1 for 0:::; £ :::; n  1 and Sn  1 (E) = a  1. That is, everything comes down to the computation of N(n, a). Alternatively, since N(n, a) = ( �::� 1 ) N'(n, a), where N'(n, a) is the number of E E { 1 , 1} n  1 such that Sn_2 1 (E) = a  1 and Sc(E) ::::a: for some £:::; n  1 , we need only compute N'(n, a). For this purpose we will use a beautiful argument known as the reflection principle. Namely, consider the set P(n, a) of paths (S0 , . . . , Sn  1 ) E zn with the properties that So = 0, Sc Sm 1 E { 1 , 1 } for 1 :S m :S n  1, and Sm ::::: a for some 1 :::; m :::; n  1 . Clearly, N' (n , a) is the numbers of paths in the set L( n, a) consisting of those (So , . . . , Sn  d E P( n, a) for which Sn_ 1 = a  1 , and, an application of the reflection principle, we will show that the set L(n, a) has the same number of elements as the set U(n, a) whose elements are those paths (So , . . . , Sn  d E P(n, a) for which Sn  1 = a + 1 . Since (So , . . . , Sn  d E U(n, a) i f and only if So = 0, Sm  Sm 1 E { 1, 1 } n+a
(*)
na
n
as
3
As the following indicates, we take the infemum over the empty set to be
+ oo.
1 RANDOM WALKS
4
=
for all 1
===}

>
===}
with (1 .1 .16), we have shown that
z+ i f  a E z+ if a E
(1.1 . 17)
for
l si < 1 .
1.2 Recurrence Properties of Random Walks
In § 1 . 1 .4, we studied the time p0 of first return of a nearest neighbor random walk to 0. As we will see in Chapters 2 and times of first return are critical ( cf. §2.3.2) for an understanding of the long time behavior of random walks and related processes. Indeed, when the random walk returns to 0, it starts all over again. Thus, if it returns with probability 1, then the entire history of the walk will consist of epochs, each epoch being a sojourn which begins and ends at 0. Because it marks the time at which one epoch ends and a second, identically distributed, one begins, a time of first return is often called a recurrence time, and the walk is said to be recurrent if 1P'(p0 < oo) 1 . Walks which are not recurrent are said to be transient. In this section, we will discuss the recurrence properties of nearest neigh bor random walks. Of course, we already know (cf. that a nearest neighbor random on Z is recurrent if and only if it is symmetric. Thus, our interest here will be in higher dimensional analogs. In particular, in the hope that it will be convincing evidence that recurrence is subtle, we will show that the recurrence of the nearest neighbor, symmetric random walk on Z persists when is replaced by but disappears in Z3.
3,
=
(1.1.13))
z
:2:2
1 .2 Recurrence Properties of Random Walks
9
1.2.1. Random Walks on :zd: To describe the analog on :zd of a nearest neighbor random walk on :Z, we begin by thinking of N 1 { 1 , 1 } as the set of nearest neighbors in :Z of 0 . It should then be clear why the set of nearest neighbors of the origin in :zd consists of the 2d points in :zd for which (d  1) coordinates are 0 and the remaining coordinate is in N1 . Next, we re place the N1valued Bernoulli random variables in § 1 . 1 by their ddimensional analogs, namely: independent, identically distributed Ndvalued random vari ables B 1, ..., Bn, ....8 Finally, a nearest neighbor random walk on :zd is a family {Xn : n 2: 0} of the form =
Xo= 0 and Xn=
n Bm L m=1
for n 2: 1.
The equivalent, stochastic process oriented description of {Xn : n 2: 0} is lP'(Xo=0)= 1 and, for n 2: 1 and E E Nd, (1 .2.1)
lP'(Xn  Xn  1=E I Xo, · · · , Xn_l )=PE, where pE lP'(B1=E). When B 1 is uniformly distributed on Nd, the random walk is said to be symmetric. =
In keeping with the notation and terminology introduced above, we define the time p0 of first return to the origin equal to n if n 2: 1, Xn = 0, and X m i= 0 for 1 :S: m < n, and we take p0 = oo if no such n 2: 1 exists. Also, we will say that the walk is recurrent or transient according to whether lP'(p0 < oo) is 1 or strictly less than 1 . 1.2.2. An Elementary Recurrence Criterion: Given n 2: 1, let p�n ) be the time of the nth return to 0. That is, p� 1 ) = p0 and, for n 2: 2,
p�n  1 ) < 00 ====} p�n) =inf{m > p(n  1 ) : X m = 0} and p�n  1 )= oo ====? p�n ) = oo . Equivalently, if g : ( Nd)z+ :z + U { oo} is t
determined so that
m g(E 1, . . . Ee, . . . ) > n if L Ee i= 0 for 1:::; m:::; n, £= 1 " m where then p0=g ( B . . . , Be, . . . ) , and p0(n )=m ====? p0(n+l )=m+p0oz.., m Po o � is equal to g(B m+ 1, . . . , B m+£, . . . ) . In particular, this leads to lP'(p�n+l ) < oo)= L lP'(p�n )=m & Po o � m < oo) m= 1 1,
CXJ
The existence of the Bn's can be seen as a consequence of Theorem 6.3.2. Namely, let {Un : n E z+} be a family of mutually independent random variables which are uniformly distributed on [0, 1). Next, let ( k1, . . . , k2d) be an ordering of the elements of Nd, set Nd so that f3o = 0 and f3m = I:;:1IP'(Bl = ke ) for 1 ::; m ::; 2d, define F : [0, 1) F r [f3m1, f3m ) = km, and set Bn = F(Un)· 8
+
1 RANDOM WALKS
10
since {p6n )= m} depends only on (B 1 , ... , Bm), and is therefore independent of p0 �m, and the distribution of Po �m is the same as that of p0. Thus, we have proved that o
o
( 1 .2.2) One dividend of ( 1 .2.2) is that it supports the epochal picture given above about the structure of recurrent walks. Namely, it says that if the walk returns once to 0 with probability 1, then, with probability 1, it will do so infinitely often. This observation has many applications. For example, it shows that if the mean value of ith coordinate of B1 is different from 0, then {Xn : n � 0} must be transient. To see this, use Yn to denote the ith coordinate of Bn, and observe that {Yn  Yn  1 n � 1} is a sequence of mutually independent, identically distributed { 1, 0, 1 }valued random variables with mean value p, cl 0. But, by the Strong Law of Large Numbers ( cf. Exercise 1.3.4 below), this means that �n > p, cl 0 with probability 1, which is possible only if IXnl � IYnl > oo with probability 1, and clearly this eliminates the possibility that, even with positive probability, Xn= 0 infinitely often. A second dividend of (1.2.2) is the following. Define :
00
To= L l{o}(Xn ) n=O to be the total time that {Xn n � 0} spends at the origin. Since X 0= 0, T0 � 1. Moreover, for n � 1, T0 > n n )= 1 + L IP(p6n ) < oo) = 1 + L IP(po < oo)n, n=O n=l n=l and so (1 .2.3)
E[To]= 1  IP(po1 < oo)
=::
1
IP(po= oo) ·
Before applying ( 1.2.3) to the problem of recurrence, it is interesting to note that T0 is a random variable for which the following peculiar dichotomy holds: ( 1 .2.4)
IP(To < oo) > 0 ===? E[To] < oo E[To] = oo ===? IP(T0= oo)= 1 .
Indeed, i f IP(To < oo) > 0, then, with positive probability, Xn cannot b e 0 infinitely often and so, by ( 1 . 1 . 13), IP(p0 < oo) < 1, which, by ( 1.2.3) , means that E[T0] < oo. On the other hand, if E[T0]= oo, then (1 .2.3) implies that IP(p0 < oo) = 1 and therefore, by ( 1.2.2) , that IP(T0 > n) IP(p6n) < oo)= 1 for all n � 1 . Hence (cf. (6. 1 .3)) , IP(T0= oo )= 1. =
1 .2 Recurrence Properties of Random Walks 1 .2.3. Recurrence of Symmetric Random Walk in
11
Z2: The most fre
quent way that (1.2.3) gets applied to determine recurrence is in conjunction with the formula
lE[To] =
(1.2.5)
00
L lP'(Xn n=O
=
0).
Although the proof of ( 1 .2.5) is essentially trivial (cf. Theorem 6.1.15):
in conjunction with (1 .2.3) it becomes powerful. Namely, it says that ( 1 .2.6)
{Xn :
n ;::: 0} is recurrent if and only if
00
L lP'(Xn n=O
=
0) = oo,
and, since lP'(Xn= 0) is more amenable to estimation than quantities which involve knowing the trajectory at more than one time, this is valuable infor mation. In order to apply ( 1.2.6) to symmetric random walks, it is important to know that when the walk is symmetric, then 0 is the most likely place for the walk to be at any even time. To verify this, note that if k E
) L lP'(Xn=C
lP'(X2n= k =
£EZd
zd ,
&
X2n Xn = k £)
where, in the passage to the last line, we have applied Schwarz's inequality (cf. Exercise 1.3.1 below) . Up to this point we have not used symmetry. However, if the walk is symmetric, then lP'(Xn=C) = lP'(Xn=C), and so the last line of the preceding can by continued as
= L lP'(Xn=C)lP'(X2n Xn=C)=lP'(X2n= o). £EZd
Thus, (1 .2. 7)
{Xn:
n ;::: 0} symmetric
=?
lP'(X2n= 0) = maxlP'(X2n = k) . kEZd
1 RANDOM WALKS
12
To develop a feeling for how these considerations get applied, we begin by using them to give a second derivation of the recurrence of the nearest neighbor, symmetric random walk on Z. For this purpose, note that, because J!D([Xnl::::; n) = 1, (1.2.7) implies that
2n 1 = L lF(X2n = €) ::::; (4n + 1) 1F(X2n = 0) , £=2n and therefore, since the harmonic series diverges, that L�=O JF(Xn = 0) = oo The analysis for the symmetric, nearest neighbor random walk in Z2 re quires an additional ingredient. Namely, the ddimensional analog of the preceding line of reasoning would lead to J!D(X2n = 0) � (4n + 1) d, which is inconclusive except when d 1. In order to do better, we need to use the fact that .
=
(1.2.8) To prove (1.2.8), note that each coordinate of Bn is a random variable with mean value 0 and variance �. Hence, because the Bn's are mutually indepen dent, the second moment of each coordinate of Xn is :;:} . Knowing (1.2.8), Markov's inequality (6.1.12) says that
which allows us to sharpen the preceding argument to give9
1
2 ::::; JF(IX2nl
< 2y'n)
L JF(X2n = £) l£1 2 exp og 2 e (n+l) l n+1 lP'(X2n = 0) as long as 0:::; f :::; nt1. Because lP'(X2n = 2£) = lP'(X2n = 2£), we can now say that lP'(X2n = 0):::; e2lP'(X2n = 2£) for 1£1:::; Vn· But, because 2:clP'(X2n = 2£) = 1, this means that (2fo1)lP'(X2n 0):::; e2, and so _______ _:_ ,.:..
0 {_

3
3
=
(1 .2.11) when {Xn: n 2 0} is the symmetric, nearest neighbor random walk on Z. If, as they most definitely are not, the coordinates of the symmetric, nearest neighbor random walk were independent, then (1.2.11) would yield the sort of upper bound for which we are looking. Thus it is reasonable to examine to what extent we can relate the symmetric, nearest neighbor random walk on _zd to d mutually independent symmetric, nearest neighbor random walks {Xi,n: n 2 0}, 1:::; i :::; d, on Z. To this end, refer to (1.2.1) which p" 2�, and think of choosing XnXn_1 in two steps: first choose the coordinate which is to be nonzero and then choose whether it is to be + 1 or 1. With this in mind, let {In : n 2 0} be a sequence of {1, . . . , d}valued, mutually =
1
14
RANDOM WALKS
independent, uniformly distributed random variables which are independent of {X ;,n : 1 :::; i :::; d & n 2 0}, set, for 1 :::; i :::; d, N;,o = 0 and N;,n = L::: : =l l {;} (Im ) when n 2 1 , and consider the sequence {Yn : n 2 0} given by
Yn = (X l Nl , . . . , Xd Nd ) .
(1.2. 12)
'
, ·n
'
, 1�
Without too much effort, one can check that {Yn : n 2 0} satisfies the conditions in (1.2.1) for the symmetric, nearest neighbor random walk on zd and therefore has the same distribution as {Xn : n 2 0}. In particular, by (1.2.11) ,
1P' (X2n= 0) = L 1P'(Xi,2mi = 0 & Ni,2n = 2m; for 1 :::; i :::; d)
(I!
mENd
�'
m m 1 /\···/\ m d 2": �
+ 3d
:::; e2
P(X,,
m,
,
�
)
0) P(N,,,n
�
2m, foe 1 S i S
l::
mENd m l/\···1\ md < :;[
(2
In y�1
) d
+ 1P'(Ni,2n :::; � for some 1 :::; i :::; d) .
Thus, we will have proved that there is a constant A ( d)
< oo
such that
1P'(X2n= 0) :::; A ( d)n � , n 2 1 ,
(1.2.13)
once we show that there i s a constant B (d) (1 .2.14)
d)
< oo
such that
1P'(Ni,2n :::; � for some 1 :::; i :::; d) :::; B (d) n � , n 2 1.
In particular, this will complete the proof that
d2
00
3 ==? nL=O lP'(X2n= 0) 00
. (p q) 7/J"(A) = (qe>v + pe>q ) 2 where X>, e � > (v+ q) , Taylor's formula allows us to conclude that =
( 1.2.15 )
A
Starting from ( 1.2.15 ) , there are many ways to arrive at ( 1.2.14) . For ex ample, for any > 0 and R > 0, Markov's inequality (6.1.12 ) plus ( 1 .2.15 ) say that
which, when A = 4nR , gives
( 1.2.16 )
lP'
(�
)
Zm ::; np  nR ::; e 2nR 2 .
Returning to the notation used earlier and using the remark with which our discussion of ( 1 .2.14 ) began, one see from ( 1.2.16 ) that
2 n2 , 1P'(Ni, 2n ::; � for some 1 ::; i ::; d) ::; ded2 which is obviously far more than is required by ( 1.2.14) .
The argument which we have used in this subsection is an example of an extremely powerful method known as coupling. Loosely speaking, the coupling method entails writing a random variable about which one wants to know more ( in our case X2n) a function of random variables about which one knows quite a bit ( in our case {(Xi,m , Ni,m) : 1 ::; i ::; d & m � 0} ) . Of course, the power of the method reflects the power of its user: there are lots of ways in which to couple a random variable to other random variables, but most of them are useless. as
16
1 RANDOM WALKS 1.3 Exercises
EXERCISE 1 . 3 . 1 . Schwarz 's inequality comes in many forms, the most ele mentary of which is the statement that, for any {an : n E Z} � JR. and {bn n E Z} � JR., :
Moreover, when the right hand side is finite, then
if and only if there is an o: E JR. for which either bn = o:an , n E Z, or an = o:bn, n E Z. Here is an outline of one proof of these statements. (a) Begin by showing that it suffices to treat the case in which an = 0 = bn
for all but a finite number of n's. (b) Given a real, quadratic polynomial P(x) = Ax2 + 2Bx + C, use the quadratic formula to see that P ?: 0 everywhere if and only if C ?: 0 and B2 :S: AC. Similarly, show that P > 0 everywhere if and only if C > 0 and B2 < AC. (c) Assuming that a n = 0 = bn for all but a finite number of n's, set P(x) = Ln (anx + bn) 2 , and apply (b) to get the desired conclusions. Finally, use (a) to remove the restriction of the an 's and bn 's. EXERCISE 1.3.2. Let {Yn : n ?: 1} be a sequence of mutually independent, identically distributed random variables satisfying JE[ I Y1 1 J < oo. Set Xn = :L:= l Ym for n ?: 1. The Weak Law of Large Numbers says that
In fact,
(1.3.3) from which the above follows as an application of Markov's inequality. Here are steps which lead to (1.3.3). (a) First reduce to the case when lE[Y1 ] = 0. Next, assume that lE [Yn < oo , and show that
Hence the result is proved when Y1 has a finite second moment.
1.3 Exercises R) � (b) Given R > 0, set y = Yn l[o , R) ( ! Ynl)  IE [Yn, !Ynl < R) � 2::::: �= 1 y . Note that, for any R > 0,
[[ IJ
n IE :
oo , and combine this with the results in ( a) and (b) to arrive at
and
Because f�oo e  ""2 dO" R / oo . 2
= v'27f,
it is clear that ( 1 .3.6) follows after one lets
EXERCISE 1 .3 . 7. The argument in § 1 .2.3 is quite robust. Indeed, let {Xn n ;::: 0} be any symmetric random walk on ?} whose jumps have finite second moment. That is, Xo = 0, {Xn  Xn  1 n ;::: 1 } are mutually independent, :
:
identically distributed, symmetric (X1 has the same distribution as  X1 ) , Z 2 valued random variables with finite second moment. Show that {Xn : n ;::: 0} is recurrent in the sense that 1P'(3n ;::: 1 Xn = 0) = 1. 10
A unit exponential random variable is a random variable T for which IP'(T > t)
o = e t v
20
o
1 RANDOM WALKS
EXERCISE 1.3.8. Let {Xn : n 2 0} be a random walk on zd : X = 0, {Xn  Xn _ 1 : n 2 1} are mutually independent, identically distributed, zd_ valued random variables. Further, for each 1 ::; i ::; d, let ( Xn) i be the ith coordinate of Xn , and assume that < oo and ( a1 , . . . , ad ) E [O , oo ) d with I:;� ai > 1, J!D(( Xn ) i = 0) ::; Cno:' , n 2 1, show that { Xn : n 2 0} is transient in the sense that J!D(::ln 2 1 Xn = 0) < 1. EXERCISE 1.3.9. Let {Xn : n 2 0} b e a random walk o n zd , as in the preceding. Given k E zd , set If, for some C
Tk
00
=
L l {k} ( Xn)
n=O
and
(k = inf{n 2 0 : Xn k}. =
Show that
(1.3.10)
JE[Tk] = J!D( (k < oo )lE[To] =
=�P(ko � 00�, oo
where p0 = inf { n 2 1 : Xn = 0} is the time of first return to 0. In particular, if { Xn : n 2 0} is transient in the sense described in the preceding exercise, show that
lE
[fo
]
l B (r) ( Xn) < oo
for all r
E
(0, oo ) ,
where B ( r ) = {k : lkl ::; r } ; and from this conclude that I Xn l ______, oo with probability 1. On the other hand, if {Xn : n 2 0} is recurrent, show that Xn = 0 infinitely often with probability 1. Hence, either {Xn : n 2 0} is recurrent and Xn = 0 infinitely often with probability 1 or it is transient and I Xn l ______, oo with probability 1. EXERCISE 1.3. 1 1 . Take d = 1 in the preceding, Xo = 0, and { Xn  Xn 1 : n 2 1} to be mutually independent, identically distributed random variables for which 0 < 1E [ I X1 1] < oo and JE [X1 ] = 0. By a slight variation on the argument given in §1.2.1, we will show here that this random walk is recurrent but that Xn nlim +oo
= oo
and
lim Xn n +oo
= oo
with probability
1.
( a) First show that it suffices to prove that supn Xn = oo and that infn Xn = Next, use the Weak Law of Large Numbers ( cf. Exercise 1.3.2) to show
oo.
that
max nlim + 00 1 :o:; m :o:; n
lE [ I Xn l]
n
= 0.
1. 3 Exercises 1, set T� n) = L�==10 l{k} (Xm),
(b) For n :::0: for all k E Z , and use this t o arrive at
21 n n show that IE [T� ) J Po) 1, and conclude that m m lP'(ry1 > P6 )) 0, show that IP'(M= = oo) = 1 . When f1 = 0, use Exercise 1 .3. 1 1 to reach the same conclusion. Hence, when lE [E1] 2: 0, IP'( Qn = j) + 0 for all j E N. That is, when the expected number of arrivals is a least as large of the expected number of people served, then, with probability 1 , the queue grows infinitely long. (d) Now assume that f1 = JE [E1] 0. Then the Strong Law of Large Numbers ( cf. Exercise 1 .3.4 for the case when E1 has a finite fourth moment and Theorem 1 .4.11 in [9] for the general case) says that � + f1 with probability 1 . In particular, conclude that M= oo with probability 1 and therefore that L jE N Vj = 1 when 1/j = limn_,= IP'( Qn = j) = IP'(M= j).
oo JLpn+l = limn_,00 ( JLP n ) P = 1r That is, 1r would have to be a left eigenvector for with eigenvalue 1. A probability vector 1r is, for obvious reasons, called a stationary distribution for the transition probability matrix P if 1r = 1rP. Although we were thinking about finite state spaces in the preceding dis cussion, there are situations in which these musings apply even to infinite state spaces. Namely, if, no matter where the chain starts, it has a positive probability of visiting some fixed state, then, as the following theorem shows, it will stabilize.
(JLpnm )Pm
JL.
P
P.
2 . 2 . 1 DOEBLIN'S THEOREM. Let P be a transition probability matrix with the property that, for some state ]o E § and E > 0, P jo 2: E for all i E § . Then P has a unique stationary probability vector 1r, 1r ) jo 2: and, for all initial distributions JL,
( () i E ,
PROOF: The key to the proof lies the observations that if vector with < oo , then
!I P! v
(2.2.2)
L (pP)j = L (P)i and L (P)i 0 ==} ! pPn ! v (1  E)n ! I P ! v j E§
=
p
E JR.5 is a row
iE§
:S
for n 2: 1.
i E§
The first of these is trivial, because, by Theorem 6.1.15,
As for the second, we note that, by an easy induction argument, it suffices to
2.2 Doeblin's Theory check it when n = 1. Next, suppose that
I
29
Li (p)i = 0, and observe that
(pP)j I = l l:iE§ )p)i (P) ij I I LiE§ (P)i((P)ij  u5j,j0) I S: LiE§ i (P)ii((P)ij  E0j,j0) , =
and therefore that
I PP I v jLE§ (2::iE§ i (P)ii((P)ij  E0j,j0) ) = L i ( P)i l (L((P)ij  E0j,j0) ) = (1  c) I P I v · iE§ j E§ n . Then, because Now let Jt be a probability vector, and set Jtn Jt P m Jtn Jtnmp and Li ((Jtn m )i  Jti) = 1  1 0, S:
=
=
=
JtI n 7r l v{Jt�n }f 7r limn_,= (JtPn )P P, iE§
for 1 S: m < n. Hence, is Cauchy convergent; and therefore there exists a for which 0. Since each is a probability vector, it is clear that 1r must also be a probability vector. In addition, 1r = = 1r and so 1r is stationary. In particular,
Jtn
limn_,= Jtpn+l
=
iE§
Finally, if v is any probability vector, then
which, of course, proves both the stated convergence result and the uniqueness of 1r as the only stationary probability vector for D It is instructive to understand what Doeblin's Theorem says in the language of spectral theory. Namely, as an operator on the space of bounded functions ( a.k.a. column vectors with finite uniform norm ) , has the function 1 as a right eigenfunction with eigenvalue 1: P l = 1 . Thus, at least if § is finite, general principles say that there should exist a row vector which is a left eigenvector of P with eigenvalue 1. Moreover, because 1 and the entries of are real, this left eigenvector can be taken to have real components. Thus, from the spectral point of view, it is no surprise that there is a nonzero row vector E with the property that On the other hand, =
P. P
P
Jt JR§
JtP Jt .
30
2
MARKOV CHAINS
standard spectral theory would not predict that J.L can be chosen to have non negative components, and this is the first place where Doeblin's Theorem gives information which is not readily available from spectral theory, even when § is finite. To interpret the estimate in Doeblin's Theorem, let M1 (§; C) denote the space of row vectors E C§ with Then,
v l vl v = 1. l vP ! I v 1 for all v and so sup{ l al : a vP = av } : 0, 4
[2] .
2 .2. 1 ( 2 . 2 . 3)
for all probability vectors J.L and a unique stationary probability vector To see this, let 1r be the stationary probability vector for pM , the one guaranteed by Theorem and note that, for any probability vector J.L, any m E N, and any 0 :
+
mE
For all m E
and
>
E
E
=
In particular, if
is recurrent, then
for all m E In fact, if is recurrent, then, conditional on : m ::=: is a sequence of mutually independent random variables each of which has the same distribution as
PROOF: To prove the first statement, we apply (2. 1 .13) and the Monotone Convergence Theorem, Theorem 6.1.9, to justify
IP'(p)m) < oo I Xo = i) = nL= 1 IP'(p)m 1) = n P)m) < oo I Xo i) = L tl�oo E [1  FN,j (Xn , . . . , Xn+N ), p)m  1) = n I Xo = i] n=001 . . . ,XN) E[ l FN, j (Xo, I Xo = j]IP'(p)m  1 ) n I Xo = i) L Nlim + oo n=1 limoo IP'(pj N I Xo = j )IP'(p( m  1 ) = n I Xo = i) =" � N+ n= 1 IP'(pj < oo I Xo = j)IP'(p)m 1) < oo I Xo i). Turning to the second statement, note that it suffices for us prove that IP'(Pj(m+ 1) n nm I Xo = . Pj( 1) = n1 , . . . , Pj(m) = nm) = IP'(pj n I Xo = j ) . 00
&
=
00
=
=
00
::;
J
=
=
>
>
+
J,
2 M ARKOV CHAINS
36
But, again by ( 2.1.13 ) , the expression on the left is equal to
IE [Fn,J (Xn= ' . . . ,Xn= +n ) I Xo  ( 1)  n1 , . . . ,pj(m)  nmJ = IE[Fn,j (Xo, . . . ,Xn ) I Xo = jJ = lP'(pj > n I Xo = j). _
J , Pj .
_
_
D
Reasoning as we did in §1 .2.2, we can derive from the first part of Theorem 2.3.6: < . i)
IE[T I X0 = �·] = + lP'(lP'(pj.  oooo!I XXoo = ') 2.3. 7) IE[Tj I Xo j] = oo lP' (TJ = oo I Xo j) = 1 IE[Tj I X0 = jJ oo lP' (Tj oo I Xo j) 1, where Tj = :2..: :;=;: 0 l{j } (Xm ) i s the total time the chain spends in the state j. Indeed, because m lP' (T > m I X0  �.)  { lP'lP'(PJ( j(m)+l) ooooI XoI X=o j) �') ifif ii =I jj, 5t,)
J
(
=
J
P1 
=
{===}
0, and so lP'(pJm) = oo I Xo = j) 2: lP'(PJm) = oo Xm = Jo I Xo = j) lim IE[FN,j (Xm , . . . ,Xm+ N ), Xm = jo i Xo = j ] = N+oo = lim lE[FN,j (Xo, . . . ,XN) I Xo = Jo]lP'(Xm = Jo i Xo =j ) N+oo lP'(pj = oo I Xo = Jo) (Pm)jjo > 0. =
+
00
=
=
nMJ X0 i) for n z + and i §. Then, by (2. 1 . 13), u (n + 1 , i) = 2:: IP'(pj > (n + 1)M & XnM = k I Xo = i) kE§ = I: JE [FM, j (XnM , . . . ,X( n +1 ) M ), Pi > nM & XnM = k I Xo = i] kE§ = 2: 1P'(p1 > M I Xo = k)IP'(p1 > nM XnM = k I Xo = i) kE§ = 2:: u 1 , k)IP'(PJ > n M & XnM = k I Xo = i) . kE§ Hence, u ( n + 1, i) :::; Uu (n, i) where U maxkE§ u(1, k). Finally, since u ( 1 , k) =  IP'(pj :::; M JXo = k) and IP'(PJ :::; M I Xo = k) O:Smax m<M (Pm ) kj 2: (AM) kj E, U :::; 1  c. In particular, this means that u ( n + 1 , j) :::; (1  c)u ( n, j), and therefore that IP'(p1 > nMJ X0 = j) :::; ( 1  c) n , from which lE [p� I Xo = J ] = 2:: nPJP'(p1 = nJ X0 = j) n= 1mM :::; 2:: ( mM) P m= 1 n= (m l)M+1 IP'(pj = n I Xo = j) :::; MP 2:: mPJP'(p1 > (m  1)MJ Xo = j) m= 1 :::; MP 2:: mP ( 1  E) m  1 m= 1 o
(*)
J
>
j
==?
inf(AM' "
E
1
C
L.....
( )
L.....
C
2:
< oo
E
=
=
E
&
(
=
1
2:
2:
00
00
00
00
< oo
2 MARKOV CHAINS
38 follows immediately. D
2.2 .5,
2.3.3. Identification of 1r: Under the conditions in Theorem we know that there is precisely one Pstationary probability vector 1r. In this section, we will give a probabilistic interpretation of 1r)] . Namely, we will show that
(
sup sup inf ( AM ) ij 0 M2: 1 jE§ 'E§ (2. 3.9) (7r )j = lE[PJI X1 o j] = 0 if j is transient ) . The idea for the proof of (2. 3 . 9 ) is that, on the one hand, cf. (2 . 3 . 3)) >
=
===}
(
(
while, on the other hand, pj"'>  1
£=0 l{j} (Xc) = PZ:J l . L
PJ, the
Thus, since p)m ) is the sum of m mutually independent copies of preceding combined with the Weak Law of Large Numbers should lead
To carry out the program suggested above, we will actually prove a stronger result. Namely, we will show that, for each E
j § ,6 (2.3.10) lP' (nlim+CXJ rjn) = lE PJ I Xo1 •] I Xo = j) = 1 . In particular, because 0 � T � 1 , Lebesgue's Dominated Convergence Theorem, Theorem 6.1 . 11 , says that 1 lE[pJ I Xo = j] from (2 . 3 . 1 0). Thus, we need only prove (2. 3 . 1 0). To this end, choose and 0 so that A M ijo for all i. If j0 fj, then, by Theorem j0,follows 2 .3.8, j is transient, and so, by (2. 3 . 7), lP' ( TJ I X0 = j) = 1 . Hence, [
=J
(n)
M,
E >
(
)
;::: E
< oo
Statements like the one which follows are called individual ergodic theorems because they, distinguished from the first part of Theorem 2.3.4, are about convergence with probability 1 as opposed to convergence in mean. See Exercise 3.3.9 below for more information.
6
as
2.3 Elements of Ergodic Theory
39
lP'(pj Tj X0
n conditional on = j, Tj ) : 0, and so I = j] = oo. Hence, we have proved (2.3. 10) in the case when )o fj. Next assume that j0'>j. Then, again by Theorem 2.3.8, = j] < oo ) 1) m m and, conditional on m ;:::: 1 } is a sequence of mutually = j, {PJ  PJ independent random variables with the same distribution as Pj · In particular, by the Strong Law of Large Numbers (cf. Exercise 1 .3.4)
X0
Xo
lP'
(
lim + oo
m
JE[pj Xo lE [pj i Xo
+
PJm) m
:
=
rj
I Xo ) =j
On the other hand, for any m
;::::
=1
where rj
=
lE[pji Xo
=
j] .
1,
and
while, since PJm )
;::::
m,
I T(Pj 
(rn)
1
)  r:1 1
1
Hence,
Finally, by taking mn
.
=
1
I m I ( m)
r1J·
0._
 r1· .
[ [!; J we get n
I I mn
Pj(m, )  r· rj
2 3 (n I T )  r: 1 I :j0 for all i E §. In particular, if such a j0 exists, conclude that, for all probability vectors JL,
)ijo
l
II JLAn  1r v :S:
2(N  1) ne
,
n 2': 1 ,
P. EXERCISE 2 .4 . 3 . Here is a version of Doeblin's Theorem which sometimes gives a slightly better estimate. Namely, assume that (P)ij Ej for all (i, j), 0, show that the conclusion of Theorem 2.2.1 holds and set = I:j Ej· If and that ( 1r )i Ei for each i §. EXERCISE 2 .4.4. Assume that P is a transition probability matrix on the finite state space §, and show that j § is recurrent if and only if lE[p1 I Xo Of course, the "if" part is trivial and has nothing to do with the j] where 1r is the unique stationary probability vector for
e
2':
2':
e>
E
E
1 ===} lim c) ,
(d)
>
=
=
E
=
=
=
=
i)
i)
>
1 and
=
=
The last conclusion has the ominous implication that, when the expected number of progeny is larger than 1, then the population either becomes extinct or, what may be worse, grows indefinitely. EXERCISE 2 . 4 . 8 . Continue with the setting and notion in Exercise 2.4.7. We want to show in this exercise that there are significant differences between the cases when 1 < 1 and 1 = 1. = = ( a) Show that Hence, when 1 < 1, the expected size of the population goes to 0 at an exponential rate. On the other hand, when 1 = 1 , the expected size remains constant, this in spite of the fact that = i) 1 . Thus, when 1 = 1 , we have a typical situation of = the sort which demonstrates why Lebesgue had to make the hypotheses he did in his dominated convergence theorem, Theorem 6.1.11. In the present case, the explanation is simple: as " oo , with large probability = 0 but, nonetheless, with positive probability is enormous. (b) Let be the time of first return to 0. Show that
lE[Xn I Xo i] iJn .
lP'(Xn OI Xo
>
n Xn Xn p0 lP'(po ::::: ni Xo i) lP'(Xn OIXo i) (fo (n l ) (Jlo )r, =
=
=
=
=
and use this to get the estimate
JE [p0 I X0 i] ::::: JE[p0 I X0 i] i lP'(po ni Xo
In particular, this shows that = < oo when 1 < 1 . ( c ) Now assume that 1 = 1. Under the additional condition that f3 = 1) = r !"(1) = Lk> 2 k(k  1 )f.lk < oo , start from = and show that = oo for all 2': 1 .
(n l) (Jlo ),
2.4 Exercises
43
Hint: Begin by showing that
for n > m. Next, use this to show that oo >
E[poi Xo = 1] = 1 + L0 (1  r n (J.Lo )) 00
would lead to a contradiction. (d) Here we want to show that the conclusion in (c) will, in general, be false without the finiteness condition on the second derivative. To see this, let + e '\'oo k E ( 0 1 ) be given, and check that j ( s ) s + ( 1 �s)1 +0 = L.... k =O s J.lk, where . . . , J.lk , . . . ) is a probability vector for which J.lk > 0 unless k = 1 . J.L = Now use this choice of J.L to see that, when the second derivative condition in (c) fails, = 1] can be finite even though 1 = 1 . Hint: Set an = 1 note that a n  an + l = and use this first 1 and then that there exist 0 < c2 < c2 < oo such that to see that anan+ l c1 : 0 00
for all i E §
=
§ \ r.
( a) Show that h(i ) = Lj E §(P)iJ h(j) for all i E §, and conclude that the matrix P given by (P)ij = hCi (P)iJh(j) for (i, j) E (§) 2 is a transition ) probability matrix on §. (b) For all n E N and (j0 , . . . E (§) n + l , show that, for each i E §,
,Jn ) P(Xo Jo, . . . , Xn Jn I Xo = i) = lP'(Xo = Jo, ... , Xn = Jn I Pr = oo & Xo = i) , =
=
where lF is used here to denote probabilities computed for the Markov chain on § whose transition probability matrix is P. That is, the Markov chain determined by P is the Markov chain determined by P conditioned to never hit r. 7
The "h" comes from the connection with harmonic functions.
44
2 MARKOV CHAINS
EXERCISE 2 . 4 . 1 0 . Here is another example of an htransform. Namely, as sume that j0 E § is transient but that i *]o for all i E §.8 Set h(i) = lP' ( PJo < ooiXo = i) for i =/: Jo and h(jo) = 1 . (a) After checking that h(i) > 0 for all i E §, define P so that if i = Jo if i =1 j0. Show that P is again a transition probability matrix. (b) Using lF to denote probabilities computed relative to the chain deter mined by P, show that
t
lF (PJo > n I Xo = i) = n < Pio < oo I Xo = i ) h i) IP'( for all n E N and i =/: Jo. (c ) Starting from the result in (b) , show that j0 is recurrent for the chain determined by P.
8
B y Exercise 2.4.2, this i s possible only if § i n infinite.
CHAPTER 3
More about the Ergodic Theory of M arkov Chains
In Chapter 2, all of our considerations centered around one form or another of the Doeblin condition which says that there is a state which can be reached from any other state at a uniformly fast rate. Although there are lots of chains on an infinite state space which satisfy his condition, most do not. As a result, many chains on an infinite state space will not even admit a stationary probability distribution. Indeed, the fact that there are infinitely many states means that there is enough space for the chain to "get lost and disappear." There are two ways in which this can happen. Namely, the chain can disappear because, like the a nearest neighbor, nonsymmetric random walk in Z (cf. (1.1.13)) or even the symmetric one in Z3 (cf. §1.2.4), it may have no recurrent states and, as a consequence, will spend a finite amount of time in any given state. A more subtle way for the chain to disappear is for it to be recurrent but not sufficiently recurrent for there to exist a stationary distribution. Such an example is the symmetric, nearest neighbor random walk in Z which is recurrent, but just barely so. In particular, although this random walk returns infinitely often to the place where is begins, it does so at too sparse a set of times. More precisely, by (1.2.7) and (1.2.13), if is the transition probability matrix for the symmetric, nearest neighbor random walk on Z, then = = 0) :S A(1)n  ! + 0, :S and so, if were a probability vector which was stationary for then, by Lebesgue's Dominated Convergence Theorem, Theorem 6.1.11, we would have the contradiction (JL)J lim = 0 for all j E Z.
P
(P2n )ij (P2n )ii IP'(X2n JL =
"' (JL)i(P2n )ij n1> CXJ � iEZ
P,
In this chapter, we will see that ergodic properties can exist in the absence of Doeblin's condition. However, as we will see, what survives does so in a weaker form. Specifically, we will no longer be looking for convergence in the ll vnorm and instead will settle for pointwise convergence. That is, we will be looking for results of the form + for each j rather than 1r ll v + 0.
I I JLP ·
(JLP)1 (1r)j
46
3 MORE ERGODIC THEORY OF MARKOV CHAINS 3 . 1 Classification of States
In this section we deal with a topic which was hinted at but not explic itly discussed in Chapter 2. Namely, because we will no longer be making an assumption like Doeblin's, it will be necessary to take into account the possibility that the chain sees the state space as a union of disjoint parts. To be precise, given a pair (i, j) of states, recall that we write i>j and say that j is accessible from i if, with positive probability, the chain can go from state > 0 for some n E N. Notice that accessibility is i to state j. That is, transitive in the sense that
(Pn )ij
(3.1.1)
(Pm+n )iR Lk (Pm )ik(Pn ) kc ;::: (Pm )ij(Pn )p! =
> 0.
If i and j are accessible from one another in the sense that i>j and j>i, then we write i,....j and say that i communicates with j. It should be clear that ,.... is an equivalence relation. To wit, because = 1 , i,....i , and it is trivial that j,....i if if+j. Finally, if i,....j and j,.... £, then (3. 1 . 1 ) makes it obvious that i,....£ . Thus, ",...." leads to a partitioning of the state space into equivalence classes made up of communicating states. That is, for each state i, the communicating equivalence class [i] of i is the set of states j such that i,....j ; and, for every pair (i, j) , either [i] = [j] or [i] n [j] = 0. In the case when every state communicates with every other state, we say that the chain is irreducible. 3 . 1 . 1 . Classification, Recurrence, and Transience: In this subsection, we will show that recurrence and transience are communicating class prop erties. That is, either all members of a communicating equivalence class are recurrent or all members are transient. Recall (cf. §2.3.2) that PJ is the time of first return to j and that we say j < ooiXo = j) is equal to is recurrent or transient according to whether or strictly less than 1 . 3 . 1 . 2 THEOREM. Assume that i is recurrent and that j i= i. Then i>j if and only if lP' ( PJ < P i Xo = i) > 0. Moreover, if i>j, then < oo i Xo = £) = 1 for any , £ E {i, j} 2 . In particular, i>j implies that if+j and that j is
(P0)ii
JID(pj
JID(pk
(k ) i
recurrent.
PROOF: Given j i= i and n ;::: 1 , set (cf. (2.3.5))
Gn (ko, . . . , kn) (Fn l , i (ko, . . . , kn l )  Fn,i(ko, . . . , kn))Fn,j (ko, . . . , kn)· If { p;m ) : ;::: 0} are defined as in §2.3.2, then, by (2.1.2), =
m
3.1 Classification of States
47
lP' (Pi(m+1) < Pi I Xo = ') £=1"' IP'(Pi(m) = Pi(m+ 1 ) < Pi I Xo = ') "' "' IP'(Pi(m) Pi(m+1) < Pi I Xo ) £=1 n=1 = £L,n= 1 lE[Gn (Xe, . . . ,X£+n ), p�m) = I! < PJ I Xo = i] = £,Ln=1 lE[Gn (Xo, ... ,Xn ) I Xo i]lP'(p�m) = I! < PJ I Xo = i) e,n=1 00
z
00
00
_
�� _
=
n
n t.
�
_ n  t
 t,
+
&
z

n
_
z'
00
00
=
and so
IP'(p�m) < PJ I Xo = i) = IP'(pi < PJ I Xo = i) m . suppose that i +j but lP'(pj < Pi I Xo = i) = 0. Then lP'(pi that0, andi+jthe opposite lP'(pi < ooi Xo = j) = 1 , first observe that IP'(PJ < Pi < oo I Xo = i) = nlim>oo mL= 1 IP'(PJ < Pi + I Xo i) nl_i_,� mL= 1 lE[1  Fn,i (Xm , . . . ,Xm+n), PJ < Pi I Xo = i] lim "' IP'(pj = < Pi I Xo = i)lE[ 1  Fn ,i (Xo, . . . , Xn ) I Xo j ] = n+cx: :> m�= 1 = IP'(PJ < Pi I Xo = i)IP'(pi < oo I Xo = j) . Thus, after combining this with lP' ( pi < ooi Xo = i) 1 , we have IP'(PJ < Pi I Xo = i) = IP'(PJ < Pi I Xo = i)IP'(pi < oo I Xo = j), which, because lP'( PJ < Pi I Xo = i) > 0, is possible only if lP' ( pi < ooi Xo = j) 1 . In particular, we have now proved that j+i and therefore that i++j. j =/: i
(3.1 .3)
===?
=1
m
�
�
� m,
1
=
===?
00
=
(£ lP'(Tl)Nm >m INXom i Xo L lP'(Tm > (£ l)Nm =
+
&
iEFm
lP'(Tm ooi Xo Bm> Tm Nm Fm . Bm Xm, i I Xo j) =
=
iEF,
lP'(Tm > £Nm i Xo
lP'(Tm ooi Xo
= = j) :S 8� , and so = j) = 0. D Remark: The preceding criteria are examples, of which there are many oth ers, which relate recurrence of j E § to the existence or nonexistence of certain types of functions which satisfy either Pu = u or Pu :::; u on § \ {j}.
Thus,
All these criteria can be understood as mathematical implementations of the intuitive idea that
( ) Xn 1= j, u(Xn)
u(i) =
Pu
i
or (Pu) i :::; u(i) for i != j,
implies that as long as will be "nearly constant" or "nearly nonincreasing" as n increases. The sense in which these "nearly's" should be interpreted is the subject of martingale theory, and our proofs of these criteria would have been simplified had we been able to call on martingale theory. 3.1.3. Periodicity: Periodicity is another important communicating class property. In order to describe this property, we must recall Euclid's concept of the greatest common divisor gcd(S) of a nonempty subset S c;;; Z. Namely, we say that d E z+ is a common divisor of S and write d iS if J E Z for every s E S. Clearly, if S = {0}, then d S for every d E z + , and so we take gcd(S) On the other hand, if S 1= {0}, then no common divisor of S can be larger than min { l s l : s E S \ {0} } , and so we know that gcd(S) < Our interest in this concept comes from the role it plays in the ergodic theory of Markov chains. Namely, as we will see below, it allows us to distin guish between the chains for which powers of the transition probability matrix converge and those for which it is necessary to take averages. More precisely, given a state i, set =
(3. 1 . 10)
i
oo.
S(i)
=
{n 2': 0 :
(Pn )i > 0}
oo.
and d(i) = gcd ( S(i) ) .
52
3 MORE ERGODIC THEORY OF MARKOV CHAINS
Then d( i) is called the period of the state i, and, i is said to be aperiodic if d(i) = 1 . As we will see, averaging is required unless i is aperiodic. However, before we get into this connection with ergodic theory, we need to take care of a few mundane matters. In the first place, the period is a communicating class property:
(3.1.11) i'4j ===? d(i) = d(j). To see this, assume that (P m )ij > 0 and (P n )ji > 0, and let d be a common divisor of S(i). Then for any k E S(j), (P m + k +n )ii � (P m )ij (P k )jj (Pn )j i > 0, and so m + k + n E S(i). Hence dl {m + k + n : k E S(j)}. But, because m + n E S(i) , and therefore dl(m + n ) , this is possible only if d divides S(j), and so we now know that d(i) ::;; d(j). After reversing the roles of i and j, one sees that d(j) ::;; d(i), which means that d(i) must equal d(j). We next need the following elementary fact from number theory. 3 . 1 . 12 THEOREM. Given 0 # S 0 < (J.L)J' · But, again by (3.2.8) , this means that J.L = Bn° + (1  B)v, where c = {i : if+j}, e = L iE c (J.L) i E (0 , 1), and (v) i equals 0 or (1  e)  1 (J.L) i depending on whether i is or is not in C. Clearly v E Stat ( P ) and, because VJ' > 0 = (n°)J' , v # n ° . Hence, J.L cannot be extreme. Equivalently, every extreme J.L is n° for some communicating class C of positive recurrent states. Conversely, given a communicating class C of positive recurrent states, suppose that n° = (1 B)J.L+Bv for some B E (0, 1) and pair (J.L, v) E Stat (P ) 2 . Then (J.L) i 0 for all i ¢:. C , and so, by (3.2.8) , we see first that J.L = n° and then, as a consequence, that v = n° as well. D The last part of Theorem 3.2.10 shows that positive recurrence is a commu nicating class property. Indeed, if j is positive recurrent and C { i : if+j}, then n° E Stat ( P ) , and so, since (n°)J > 0, 1rii = (n°) i > 0. In particular, if a chain is irreducible, we are justified in saying it is positive recurrent if any one of its states is. See Exercise 3.3.3 below for a criterion which guarantees positive recurrence. 3.2.4. A Small Improvement: The next step in our program will be the replacement of Abel convergence by Cesaro convergence. That is, we will show that ( 3.2.5 ) can be replaced by =
=
1 n 1 where An =  L pm . n
(3.2. 11)
m =O
As is shown in Exercise 3.3.1, Cesaro convergence does not, in general, follow from Abel convergence. In fact, general results which say when Abel con vergence implies Cesaro convergence can be quite delicate and, because the original one was proved by a man named Tauber, they are known as Taube rian theorems. Fortunately, the Tauberian theorem required here is quite straightforward. A key role in our proof will be played by the following easy estimate:
n 1
(3.2. 12)
{am }!f' c;;; [0, 1] & An = � L ac n o
�
! A n  An m I
S::
m n

for 0 S:: m < n.
60
3
MORE ERGODIC THEORY OF MARKOV CHAINS
m 1 1 { nn1 m An  Anm n C=2::nm ac  n n  m C=O ac _ !I!_
� L..t
In addition, for any j
0 =* limn __, An ) which, after another application of the second part of Lemma 3.2.13, means that we have proved (3.2.11). 3.2.5. The Mean Ergodic Theorem Again: Just as we were able to use Theorem 2.2.5 in §2.3.1 to prove Theorem 2.3.4, so here we can use (3.2. 1 1 ) to prove the following version o f the mean ergodic theorem. 3 . 2 . 14 THEOREM. Let C a communicating class of positive recurrent states. E C) = 1, then =
IfJID(Xo
J�
E [ (� f>{jj(Xm) 
�,,
) ']
�
0.
See Exercises 3.3.9 and 3.3. 1 1 below for a more refined statement.
JID(Xm
E C for all m E N) 1 , without loss in generality we P ROOF: Since may and will assume that C is the whole state space. In keeping with this assumption, we will set 1r = 1r0. Next note that if f.Li = = i), then =
JID(X0
and so it suffices to prove that
But, because 1ri i > 0 for all i
E
§ and
62
3
MORE ERGODIC THEORY OF MARKOV CHAINS
it is enough to prove the result when 1r is the initial distribution of the Markov chain. Hence, from now on, we will be making this assumption along with
C = §.
Now let f the column vector whose ith component is l{j} (i)  1fjj · Then, just as in the proof of Theorem 2.3.4,
Since 1r E Stat(P),
and therefore the preceding becomes
2 n mn(JAmf) , :'::: 2 n mL =l where (!Am f)i (f) i (Am fk Finally, by (3.2.11), for each E > 0 there exists an N€ E z+ such that =
Hence, we find that
J'.� E
[ (�,%
l{j } (Xm )  0 for all n 2: N. l ::O: m ::O: n
lP' (p)ml
(3.2.16)
To check this, use (3.1 . 14) to produce an N E z+ such that (P n )jj > 0 for all n ;::: N. Then, since (Pn )jj =
m) = n I lP'(pJ L m=l n
Xo = j ) for n 2: 1,
(3.2.16) is clear. The second, and key, step is contained in the following lemma. 3 . 2 . 1 7 LEMMA. Assume that j is aperiodic and recurrent, and set aj limn_,00 (Pn )jj and aj = limn _,00 (Pn )jj · Then there exist subsequences {n£ : C 2: 1} and {nt : C ;::: 1} such that . (Pn ±  r ) jj l1m a1± = C>oo
for all r ;::: 0.
e
j lP'(pt) m lP'(pJ ) PJm) = I = j) + lP'( = jm PJm) =/= I j) = j) I + lP' ( e j nc  M PJ ) + lP'(Xne = j PJm) nc  M I = j)
1} so that (p ne )jj + a , and, using (3.2.16), choose N ;::: 1 so that max 1 ::; m ::; n = niXo = j) > 0 for all n ;::: N. Given r ;::: N, choose 1 :::; m :::; r so that r5 = riXo = j) > 0. Now for any M E z+ , observe that, when nc ;::: M + (P n' )jj is equal to PROOF: Choose a subsequence
{nc
: C ;:::
=
T,
lP' (Xn, = j & = r5(Pn, r b Furthermore,
T
Xn = &
nLe M lP'(pJm) = k=l kfr
while
Xo
Xn
e
&
T
;:::
=!= T Xo
&
>
Xo = Xo
kiXo = j)(Pn,  k )jj :::; (1  8) sup (P n )jj , n ?_ M
·
3 MORE ERGODIC THEORY OF MARKOV CHAINS
64
Hence, since j is recurrent and therefore
aj ::; after letting £
+
0 lim (P ne  T )jj
+
f+ CXJ
f?(p;m) < oo i Xo = j) = 1, we get ( 1  0 ) SUp (Pn )jj
n?_ M
oo. Since this is true for all M
a+ < J 
;::: 1 ,
lim (p ne r) JJ + ( 1 0f+ CXJ ·
ajj.
it leads to
o)a+
J '
0, and so dl (r + r') , which, because 0 :::; r, r' < d, means that r' = 0 if r = 0 and that r' = d  r if r 2: 1 . Starting from ( * ) , it is easy to see that, for each pair (i , j) E §2 , there is a unique 0 :::; r < d such that (Pmd+ r ) ij > 0 for some m 2: 0. Namely, suppose that (Pmd+r ) ij > 0 < (Pm' d+ r' ) ij for some m, m' E N and 0 :::; r, r' < d. Then, by ( * ) , there exists an n 2: 1 such that (Pnd r )ji > 0 and so (P(m'+n)d+( r'  r ) )ii > 0. Since this means that d l (r'  r), we have proved that r r' . Now, let i0 be a fixed reference point in §, and, for each 0 :::; r < d, define §r to be the set of j such that there exists an m 2: 0 for which (Pmd+r )io) > 0. By the preceding, we know that the §r 's are mutually disjoint. In addi tion, by irreducibility and the Euclidean algorithm, § U�:� §r · Turning to the proof of property (1), use ( * ) to choose n 2: 0 and n' 2: 1 so that (p nd+r (j ) ) 2oJ > 0 < (P n 'd  r ( k ) ) kz·o · Then (P(n+ m+n')d+(r (j )+ r r ( k ) ) ) to to > 0 ' and so dl (r(j) + r  r(k)). Equivalently, r(k)  r(j) = r mod d. Conversely, if r(k)  r(j) = r mod d, choose n 2: 0 and n' 2: 1 so that (P nd+r ( k ) )iok > 0 and (P n' d r (j ) ) jio > 0. Then (P(n +m+n' )d+r )jk > 0 for any m 2: 1 satisfying (Pmd) ioio > 0. Since, by (3.1.13), (Pmd)ioio > 0 for all sufficiently large m's, this completes the left hand inequality in (2) , and the right hand inequality in (2) is proved in the same way. Finally, to check (3), note that, from (1), the restriction of pd to §r is a transition probability matrix, and by (2), it is both irreducible and aperiodic. The existence of such a partition has several interesting consequences. In the first place, it says that the chain proceeds through the state space in a ••
=
=
.
.
3.3 Exercises
67
d.§r(i) +n ,
n
where the cyclic way: if it starts from i, then after steps it is in addition in the subscript should be interpreted modulo In fact, with this convention for addition, we have that (3.2. 19) To see this, simply observe that, on the one hand, i) 0 unless i E ===} while, on the other hand, 1. Hence, i tj. ===} 1 0, whereas i E Secondly, because, for each 0 ::; r < the restriction of to is an irreducible, recurrent, and aperiodic, transition probability matrix, we know that, for each 0 ::; r < and j E there exists a E [0, 1] with the i, j) E More generally, 1rj;) for all property that
pn l§r = l§r+n " n p l§, ( = § 1 n §r+nn , '£�� p , = r+n §r+n 0=d,'£�,l§r�0(Pl (P l§r)i = §,, )i = (Plp§dr+n )i.§r ) § , 1r}; d r ( §�. (Pmd )ij (r(j)) if r(j)  r(i) s mod d { lim (P md+ s ) i . = 1rjj (3.2.20) m>oo 0 otherwise. In particular, if ( i, j) (§r 2, then d"'1 (pmd+s ) ij  1 n""""' 1 (Pmd) ij 1rj(jr) (And )ij  nd1 mn1 �=O s�=O nd m�=O + Hence, since we already know that (An )ij 1rjj , it follows that r (3.2.21) 1rj( )  d1rjj for 0 ::; r < d and j §r . In the case when P is positive recurrent on §, so is the restriction of p d to each §r , and therefore "L.j E§r 1r};> = 1 . Thus, in the positive recurrent case, (3.2.21) leads to the interesting conclusion that 1r§ assigns probability � to each §r· See Exercise 3.3.12 and 5.6.7 below for other applications of these considerations. +
=
1
)
E
_
""""'
_
d.
+
E
3.3 Exercises EXERCISE 3 . 3 . 1 . Just as Cesaro convergence is strictly weaker (i.e., it is im plied by but does not imply) than ordinary convergence, so, in this exercise, we will show that Abel convergence is strictly weaker than Cesaro convergence. ( a) Assume that the radius of convergence of u(k) = (a2 + t, (k)T ) :::; u(O) u (Puu)k :::; u(k) P
1 
2
for k E Z3 ,
then when is the column vector determined by the function and is the transition probability matrix for the nearest neighbor random walk on Z3 . That is, if l:�=l l (k)i  ( f)il = 1 ( P )k£ = otherwise. What follows are some hints. (a) Let k E Z 3 be given, and set
{�
M 1+ =
Show that
3
a2 + Li=l )k)T
(Pu)k :::; u(k)
for 1
and
:::; i
:::; 3.
if and only if
( 1  1 )  ! > 1 � (1 + 2xi) � + (1  2xi)� .
L....
M
3 i =l
2(1  4x; ) !
 }a )  :2: 1 + 21r and that (1 + �) � + (1  o ! :::; 1  e for � < 1, 1 1 2 S :::; u(k) if and conclude that 1 1 3 1 lxl 2 1 + 2M 2 1 6 '  3 i =1 (1  4xi ) 2 (b) Show that ( 1
(Pu)k
where lxl =
1
2
> L
Vl:i xt is the Euclidean length of x
=
(x x 2 , x 3 ) . 1,
3 MORE ERGODIC THEORY OF MARKOV CHAINS
70
( c ) Show that there is a constant C
�
< oo such that, as long as a
2 1 1 + 2 fxl 3 3 Lot (1  4x2) !
1
'
n I Xo = k) = :l) v )e1P'(Xn + 1 = j & Pi > n + 1 1 Xo t:) . e Assuming that v = vP and (v )j = 0 unless i*j , show that v (v)iJL · Hint: First show that v = 0 if (v)i = 0, and thereby reduce to the case when (v)i = 1 . Starting from the result in ( ) apply the Monotone Convergence Theorem to see that the right hand side tends to (JL)j as n 7 oo. =
=
(b)
=
a ,
EXERCISE 3 . 3 . 9 . Let C be a communicating class of positive recurrent states. The reason why Theorem 3.2.14 is called a "mean ergodic theorem" is that the asserted convergence is taking place in the sense of mean square convergence. Of course, mean square convergence implies convergence in probability, but, in general, it cannot be used to get convergence with probability 1 . Nonetheless, as we will show here, when E C) = 1 and j E C,
lP'(Xo
( 3.3.10 )
n1
1 h. m "' 1{J· }
n

n+oo
mL...=O. (Xm ) = 1r
··
JJ
with probability 1 .
Observe that, in § 2.3.3, we proved the individual ergodic theorem ( 2.3.10 ) under Doeblin's condition holds, and ( 3.3.10 ) says that the same sort of in dividual ergodic theorem holds even when Doeblin's condition is not present. In fact, there is a very general result, of which ( 3.3.10) is a very special case,
3.3 Exercises
73
which was proved originally by G.D. Birkhoff. However, we will not follow Birkhoff and instead, as we did in § 2.3.3, we base our proof on the Strong Law of Large Numbers, although' this time we need the full statement which holds ( cf. Theorem 1.4.11 in [9])) for averages of mutually independent, identically distributed, integrable random variables. ( a) Show that it suffices to prove the result when IP'(Xo = i) = 1 for some i E C. m (b) Set p� o ) = 0, and use p� ) to denote the time of the mth return to i . If
show that, conditional on X0 = i, both { Tm m ::::: 1} and {Ym m ::::: 1 } are sequences of mutually independent, identically distributed, nonnegative, integrable random variables. In particular, as an application of the Strong Law of Large Numbers and (3.2.5) plus the result in Exercise 3.3.7, conclude that, conditional on Xo = i, ( =) _ 1 (m) 1 P; 1 . Pi (*) and lim  "'""' 1{j } (Xm ) = 1rii1rij lliDoo m+ m) ooo m L... m =0..t :
:
1'
) 1 1 {j } (Xm ) 1 "'P;m · m + oo (ffi)  1rij  1rij Wi' th ' h probab'l' 1 1ty 1 . Hence, 1lill LA=O Wlt , P probability 1 . ( c ) In view of the results in ( a) and (b), we will be done once we check that, conditional on Xo = i, _
with probability 1 , where mn is the z + valued random variable determined so that p�mn  1 ) "S < p�mn) . To this end, first show that n
pC =n) _ l 1 1 n 1 � "'""' 1{ J·} (Xe)  "'""' 1 {J·} (Xc) (m n) � 0 0 '
n
�
"S
Tmn . _ mn
Next, from the first part of ( * ) , show that IP' (limn +oo mn = oo [ Xo i) Finally, check that, for any E > 0 , 1P' sup Tm 2:: E Xo = i m?_M m 1 'S L IP'(pi ::::: mE [ Xo = i) 'S  lE [pi , Pi ::::: Me I Xo = i] , E m=M =
(
00
I
)
and use this to complete the proof of (3.3.10).
=
1.
74
3 MORE ERGODIC THEORY OF MARKOV CHAINS
Ln , I Ln 
which is the random probabil ( d ) Introduce the empirical measure ity vector measuring the average time spent at points. That is, = By combining the result proved here with the one in (b) � of Exercise 3.3.6, conclude that limn .oo 1r0 ll v = 0 with probability 1 when E C) = 1 . EXERCISE 3 . 3 . 1 1 . Although the statement in Exercise 3.3.9 applies only to positive recurrent states, it turns out that there is a corresponding limit theo rem for states which are not positive recurrent. Namely, show that if j is not positive recurrent, then, no matter what the initial distribution,
2.':� l l {i}(Xm )· lP'(Xo
(Ln )i
1 n ( lP' n+limoo _.!_n "' l {j}(Xm )
= 0 = 1. m=O ) When j is transient, IE [2.':� l {j} (Xm )] and therefore the result is trivial. To handle j are null recurrent, begin by noting that it suffices to handle the case when lP'(Xo 1. Next, note that, conditional on X0 j, m+l)  PJm) 0} isj) a=sequence jz{p+valued of independent, identically distributed, random variables, and apply the Strong Law of Large Numbers to see that, for any R 1 , PJm) H(R) I Xo j lP' ( lim m+00 m ) ) lP' (mlim+00 / R H(R) I X0 = j) = 1 , where H(r) �IE[pj R! Xo = j] / R/ Hence, given X0 = j, (=) P� with probability 1 . Finally, check that, for any 0, [nc]) J P ( ( sup _.!_ � l{j} (Xm ) E I Xo = j ::; lP' sup lP' n?_N n 0 n?_N n ::; �f I X0 = j) ) ::; lP' (msup?_Nc (m) ::; f1 1 Xo j) , and combine this with the preceding to reach the desired conclusion. EXERCISE 3 . 3 . 1 2 . When j is aperiodic for P, we know that limn __,00 (P n ) ij exists for all i § and is 0 unless j is positive recurrent. When d(j) 1 and j is positive recurrent, show that limn. oo ( P n )jj will fail to exist. On the other hand, even if d(j) 1 , show that limn. oo(P n )ij = 0 for any j § which is not positive recurrent. L....t
< oo
=
:
=
m ?: ?:
m
J
?:
=
1\
m
?:
oo
1\
=
?
?:
as
oo .
E>
oo
?:
!!.i_ m
=
>
E
>
E
CHAPTER 4
Markov Processes in Continuous Time
Up until now we have been dealing with Markov processes for which time has a discrete parameter n E N. In this chapter we will introduce Markov processes for which time has a continuous parameter t E [0 , oo ) , even though our processes will continue to take their values in a countable state space §. 4 . 1 Poisson Processes
Just as Markov processes with independent, identically distributed incre ments ( a.k.a. random walks) on zd are the simplest discrete parameter Markov processes, so the simplest continuous time Markov processes are those whose increments are mutually independent and homogeneous in the sense that the distribution of an increment depends only on the length of the time interval over which the increment is taken. More precisely, we will be dealing in this section with zdvalued stochastic processes { X (t) : t 2 0} with the property that lP' ( X (O) = 0) = 1 and
lP'(X (t l )  X (to ) = j1 , . . . , X (tn )  X (tn 1) = Jn) n = II lP'(X (tm  tm d = Jm ) m =l for all n 2 1 , 0 :S: to < · · · < tn , and (j1 , . . . , Jn ) E ( zd ) n . 4. 1 . 1 . The Simple Poisson Process: The simple Poisson process is the N valued stochastic process {N(t) : t 2 0} which starts from 0 (i.e. , N(O) = 0), sits at 0 for a unit exponential holding time E1 ( i.e., N(t) = 0 for 0 :::; t < E1 and lP'(E1 > t) = e  t ) , at time E1 moves to 1 where it remains for an independent unit exponential holding time Ez , moves to 2 at time E1 + E2 , etc. More precisely, if {En : n 2 1 } is a sequence of mutually independent, unit exponential random variables and
when n = 0 when n 2 1, then the stochastic process { N ( t) : t 2: 0} given by (4. 1 . 1)
N(t) = max{n 2 0 : Jn :S: t}
76
4 MARKOV PROCESSES IN CONTINUOUS TIME
is a simple Poisson process. When thinking about the meaning of ( 4.1.1), keep in mind that, because, with probability 1 ,
En >
0 for all n �
1 and
CXJ
oo, = E L m =l m
with probability 1 the path t E [0, oo) N(t) is piecewise constant, right continuous, and, when it jumps, it jumps by + 1: N(t)  N(t  ) E {0, 1 } for all t 0, where N(t  ) lims/t N( s ) is the left limit of N( · ) at t. We now want to show that {N(t) : t � 0} moves along in independent, homogeneous increments. 1 That is, we want to show that, for each s , t E [O, oo) , N(t)  N( s ) is independent of (cf. §6.1.3) N( T ) : T E [O, s] } ) and has the same distribution as N(t): f+
>
=
a ({
(4.1.2) IP'(N( s + t)  N( s ) = n I N( T ) , T E [0, sl) = IP'(N(t) n) , n E N. =
Equivalently, what we have to check is that when ( s , t) E [O, oo? and A E T E [O, s] } ) , IP'( {N( s + t)  N(s) � n} n A ) = IP' (N(t) � n) IP'(A) for all n E N. Since this is trivial when n = 0, we will assume that n E In addition, since we can always write A as the disjoint union of the sets A n {N( s ) = m }, m E N, and each of these is again in : T E [0, s] } ) , we may and will assume that, for some m, N( s ) = m on A. But in that case we can write A = , }) and s } n B where B E on B. Hence, since N( t) n + � m + {===} s � s + t, and � s k � m}) , an application of € m}) is independent of (6.4.2) shows that
a ({N(T) :
z+ .
a ({N(T) a , . . . Em > {J ({E1 m+l Ja(m{J£  Jm : > J n m+ a ({Ek : 1P' ({ N ( s t)  N ( s) � n} n A) = IP' ( {Jm+n � s + t} n {Jm+ 1 > s} n B) = IP'({ Jm+n  Jm � s + t  Jm } n { Jm+1  Jm > s  Jm } n B) = lE [v (Jm ), B] , +
where, for � E [0, s] ,
v(�) IP' (Pm+n  Im � s + t  0 n Pm+1  Im > s  0) = IP'( Pn � s + t  0 n {E1 > s  0) = IP'( Pn � s + t  0 n {En > s  0) = IP'( Pn  1 + En � s + t  0 n {En > s  0) = lE [w (�, En ), En > s  �] when w ( � , IP'(Jn 1 � s + t  �  for � E [0, s] and E [s  �' s + t  �] . Up to this point we have not used any property of exponential random =
TJ )
=
TJ )
TJ
variables other than that they of positive. However, in our next step we will
1 Actually, our primary goal here is to develop a line of reasoning which will serve us well later on. A more straightforward, but less revealing, proof that the simple Poisson process has independent, homogeneous increments is given in Exercise 4.5. 1 below.
77
4.1 Poisson Processes
use their characteristic property, namely, the fact that an exponential random variable "has no memory." That is, from which it is an easy step to
lF' (E > a+ b i E > a) = lF'(E > b), (4.1 .3) lE[f(E), E > a] = e alE[f(a + E)] for any nonnegative, B[o ,oo) measurable function f. In particular, this means that v(�) = lE[w (�,En ), En > s  �] = e (sI;) IE[w(�,En + s  �)]) e (sl;)lF'(Jn 1 t  En) = e (s l;) lF'(Jn t) = e (sl; lF' (N(t) n) . Hence, we have now shown that lF'( {N(s + t)  N(s) n } n A) = lE[e  (s  J= ), B] lF'(N (t) n) . E
2::
:S:
:S:
=
2::
2::
Finally, since
= lF'( Pm+ l > s } B) = lF' ({Em+ 1 > s  Jm } n B) = lE [e(s Jm ) ' B] ' the proof of ( 4. 1 .2) is complete. Hence, we now know that the simple Poisson process {N(t) : t 0} has homogeneous and mutually independent incre lF' (A)
ments.
n
2::
Before moving on, we must still find out what is the distribution of N(t) . But, the sum of n mutually independent, unit exponential random variables has a r ( n ) distribution, and so
= = lP'(Jn t ln+1) = lP'(Jn t)  lP'( Jn+ln 1 1t Tn 1 dT  1 1t Tn dT = t (n 1 . 0
lF'( N(t) n)
:S:
=
_
)I
:S:
: 0 is a continuous time Markov process with transition probability t""'"'P(t) in the sense that
J!D(X(s + t) = k I X (a ) , a E [0, sl ) = (P(t)) X ( s) k . Observe that, as a consequence of (4. 1.8), we find that {P(t) : t :::>: 0 is a semigroup. That is, it satisfies the ChapmanKolmogorov equation ( 4.1 .8)
}
P(s + t) = P(s)P(t), s, t E [0, oo ) .
(4. 1 .9) Indeed,
( P (s + t) ) 0k = L J!D(X (s + t) = k & X (s) = j) jEZd = L (P(t) ) j£ (P(s) ) 0j = (P(s)P(t)) 0k , j EZd from which the asserted matrix equality follows immediately when one re members that (P(T) ) k£ = (P(T))o (£  k "
)
4.2 Markov Processes with Bounded Rates
There are two directions in which one can generalize the preceding without destroying the Markov property. For one thing, one can make the distribution of jumps depend on where the process is at the time it makes the jump. This change comes down to replacing the random walk in the compound Poisson process by more a general Markov chain. The second way to increase the randomness is to make the rate of jumping depend on the place where the process is waiting before it jumps. That is, instead of the holding times all having the same mean value, the holding time at a particular state will depend on that state. 4 . 2 . 1 . Basic Construction: Let § be a countable state space and P a transition probability matrix with the property that (P) ii = 0 for all i E §.3 Further, let 9\ { Ri i E § } � [0, oo ) be a family of rates. Then a continuous time Markov process on § with rates 9\ and transition probability matrix P is an §valued family {X(t) : t :::>: 0 of random variables with the properties that =
:
}
t""'"' X(t) is piecewise constant and right continuous, If Io 0 and, for n :::>: 1, In is the time of the nth jump of t""'"'X(t), then J!D(In In  1 + t & X(In) = j I X(T), T E [0, In) )  tRxun1) (P)x ( Jn l )j on { In  1 < 00 (a)
(b)
(4.2.1)
=
=
>
e
}.
3 We make this assumption for the same reason as we assumed i n §4. 1 . 2 that (JL) o = 0 , and, just as it resulted in no loss in generality there, so it does not reduce the generality here.
4.2 Markov Processes with Bounded Rates
81
Our first task is to see that, together with the initial distribution, the pre ceding completely determines the distribution of {X ( t) : t ::::: 0}, and, for reasons which will become clear soon, we will restrict our attention through out this section to the case when the rates 91 are bounded in the sense that supi Ri < oo. In fact, for the moment, we will also assume that the rates 91 are nondegenerate in the sense that 91 � ( 0, oo). Since it is clear that nondegeneracy implies that lP'(Jn < oo) = 1 for each n ::::: 0, we may ( cf. (6.1 .5)) and will assume that Jn < oo for all n ::::: 0. Now set Xn = X(Jn) for n 2:: 1 , and observe that, by (b) in (4.2.1), and En =
Jn;;n�n1 1
:
Hence, {Xn n ::::: 0} is a Markov chain with transition probability matrix P and the same initial distribution as the distribution of X(O) , {En n 2:: 1 } is a sequence of mutually independent, unit exponential random variables, and ( { Xn : n ::::: 0}) is independent of ( {En : n 2:: 1 } ) . Thus, the joint distribution of { Xn : n 2:: 0} and {En : n 2:: 1 } is uniquely determined. Moreover, {X(t) : t ::::: 0} can be recovered from {Xn : n 2:: O} U {En : n 2:: 1 } . Namely, given (e 1 , . . . , en, . . . ) E (0, oo)2+ and (jo , . . . , jn , . . . ) E §N, define
:
CJ
CJ
(!R , P )
(4.2.2)
(t; ( e 1 , . . . , en, . . . ), (jo , . . . , Jn, . . . ) ) = Jn for �n :S: t < �n+ l n where �o = 0 and �n = Rj� e m when n 2:: 1. m= l
L _1
Clearly,
X(t) = (!R,P) (t; (E1 , . . . , En, . . . ) , (Xo , . . . , Xn, . . . ) )
(4.2.3)
::; t
00
1
L Rj�_ Em . m =l Thus, we will know that the distribution of {X ( t) : t ::::: 0} is uniquely de termined once we check that I::= l Rx� _ 1 En = oo with probability 1. But for 0
X(t) hits §0. That is, we are claiming that the distribution of the process {X ( t) : t ;:::: 0} is that of the process {X ( t) : t ;:::: 0} stopped when it hits §0. To verify the preceding assertion, let {X�i ) : n ;:::: 0 & i E §0} be a family of §valued random variables and {En n ;:::: 1} a family of (0, oo )valued random variables with the properties that =
:
cr({X�i) :
)
n ;:::: 0 & i E §o} ) , cr({En : n ;:::: 1 } ) , and cr({X(t) : t ;:::: 0} are mutually independent, (2) for each i E §0 , {X�i ) : n ;:::: 0} is a Markov chain starting from i with transition probability matrix P , (3) {En : n ;:::: 1} are mutually independent, unit exponential random variables. (1)
{ XCil ( t) : t ;:::: 0} given by  . . . , En , . . . ) , (X0(i) , . . . , Xn(i) , . . . ) ) X (i) (t)  ( sJt ,P) (t ,. ( El,
Then, for each i E §, the process
_
starts at i and is associated with the rates 9'\ and transition probability matrix P. Finally, define {X(t) : t ;:::: 0} so that
if t < ( X(t) X(t) = { X(X (() l (t  () if t ;:::: (. Obviously, X(t) = X(t /1. (). Hence, we will be done once we show that the distribution of {X(t) : t ;:::: 0} is that of a process associated with the rates 9'\ and transition probability matrix P. To see this, set ]0 = 0 and, for m ;:::: 1 , let Jm be the time of the mth jump of tv>X(t). Next, suppose that A E cr({X(T) : T E [0, Jn) } ) . We need to show that
_
and because we can always write A is the disjoint union of sets of the form {X ( Jm ) = Jm for 0 :::; m < n }, we may and will assume that A itself is of this form. If Rj= > 0 for each 0 :::; m < n, then A E cr( {X(cr) : cr E [0, ln) } ) , and (Jg , X(]g)) (h X(Jg)) for 0 :::; £ :::; n on A. Thus (*) holds in this case. If jg E §o for some 0 :::; £ < n, use m to denote the first such £, set i = Jm, and let {Jjil : £ ;:::: 0} be the jump times of tv>XC i l (t) . Then we can write A = B n C, where B = {X(Jg) = jg for 0 :::; £ :::; m} and
=
if O :::; m < n  1 if m = n  1 .
4.2 Markov Processes with Bounded Rates
83
w ( { In > ln 1 + t X(Jn ) = j } n A) w ( { fnil m > JnCil m 1 + t xCil ( JCnil m ) = 1· } n c) w(B) &
&
=
and so ( * ) holds in this case also. In view of the preceding, we know that the distribution of is uniquely determined by ( 4.2.1) plus its initial distribution, and, in fact, that its distribution is the same as the process determined by 4.2.3). Of course, in the degenerate case, although 4.2.3) holds, the exponential random variables n 2:: and the Markov chain be functions of the : n 2:: will process However, now that we have proved the uniqueness of t) their distribution, there is no need to let this bother us, and so we may and will always assume that our process is given by 4.2.3). 4.2.2. The Markov Property: We continue with the assumption that the rates are bounded, and we now want to check that the process described above possesses the Markov property:
( {Xn
{ En : {X1}( : t ;::: 0}. (4.2.4
( 0} not (
{X (t) : t ;::: 0}
{X (t) : t ;::: 0}
W(X(s + t) = j I X(T), T E [0 , sl) (P(t)) X(s)j where (P(t) ) ij = IP(X(t) = j I X(O) = i ) . =
)
For this purpose, it will be useful to have checked that
�9'\m s < �m+ l Ri( s  Jm) } n Bm) = lP' { q,9'1 ,P (t; ( Em+ l  Ri ( s  Jm) , Em+ 2 , · · · Em+n, · · · ) ,
(
,
>
( i, Xm + l , . . . , Xm +n, . . . ) ) = j } n { Em+ l Ri( s  Jm) } n Bm Brn = (P(t) ) ij lP' (Am), = lP'(X(t) = j I X(O) =
i)E[e Ri(sJm), ]
)
from which ( * ) is now an immediate consequence. Just as (4.1.8) implied the semigroup property (cf. (4. 1 .9)) for the transition probability matrices there, so (4.2.4) implies the family {P(t) : t 2: 0} is a semigroup. In fact, the proof is the same as it was there:
(P(t) ) ij = L lP'(X (s + t) = j & X(s) = k I X(O) = i) kE§ = (P(t) ) kj lP' (X(s) = k) = L (P (t) ) kj (P(s) ) ik = (P(s)P(t) ) iJ ' k E§ k E§ In addition, it is important to realize that, at least in principle, the distribution of a Markov process is uniquely determined by its initial distribution together with the semigroup of transition probability matrices {P(t) : t > 0}. To be precise, suppose that {X(t) : t 2: 0} is a family of §valued random variables for which ( 4.2.4) holds, and let JL be its initial distribution (i.e. , the distribution of X(O).) Then, for all n 2: 1, 0 = to < t 1 < · · · < tn, and jo , . . . , jn E §, lP'(X (tm) = jm for 0 0 whenever i+j and there exists a nonnegative function u on § with the property that, for some E > 0,
L (P)j ku(k) ::; u(j)
{ k : i +k } then lP'( e oo iX(O) = i) =
=
0.
�
�. J
N
whenever
i+j,
P ROOF: To prove the first part, for each 2:: 1, set u(N) (j) = u(j) when j E FN and u(N) (j) = UN when j tJ. FN . It is an easy matter to check that if Q(N) = R(N) (P � I) ' where (R(N)) �J· · = R'[,(N) O·'l ,J· ' and u(N) is the column vector determined by (u(N)) i = u(N) (i), then, for all i E §, (Q(N)u(Nl) i ::; au(N) (i) . Hence, by Kolmogorov's forward equation (4.2.9) applied to {P(Nl (t) t 2:: 0}, d� (p (Nl (t)u(N) ) i ::; a(p (N) (t)u(N) ) i , and so (P(N l (T)u (Nl) i ::; e""Tu(N) (i) . But, since
94
4 MARKOV PROCESSES IN CONTINUOUS TIME
this means that
;
P ((N ::; T I X ( O) = i) ::; lE [u (Nl (x(Nl (T)) I X ( O) = i] N (P (N) (T)u(Nl) i < e cxT u(N) (i) < ecxT u ( i) = UN UN UN and so, by the Monotone Convergence Theorem, we have now proved that i) = 0. In proving the second assertion, we may and will assume that i+j for all j E §. Now take u(Nl (j) = u (j) if j E FN and u(Nl (j) = 0 if j rJ_ FN , and use u(N) to denote the associated column vector. Clearly ( Q(N)u(N)) i is either less than or equal to E or equal to 0 according to whether i E FN or i rJ_ FN . Hence, again by Kolmogorov's forward equation,
P( e ::; T I X(O) = i) ::; limN _, = JID( (N :S: T I X ( O)
d� ( p
=
(Nl (t)u (N) ) i ::; E L ( p (Nl ( t)) ij ' j E FN
and so
lE [u (Nl (x(N) (T)) I x(N) ( o) = i]  u(N) (i) ::; ElE
[for 1FN (x(Nl (t)) dt I x(Nl (o) = i] = ElE [T
1\
(N 1 x(Nl ( o) = i] .
N,
But, since u(N) :2 0, this means that JE [(N IX(Nl (O) = i] ::; u�i) for all and i) < oo. D so lE[eiX (O) = i] ::; u� 4.3.3. What to Do When Explosion Occurs: Although I do not intend to repent, I feel compelled to admit that we have ignored what, from the mathematical standpoint, is the most interesting aspect of the theory under consideration. Namely, so far we have said nothing about the options one has when explosion occurs with positive probability, and in this subsection we will discuss only the most banal of the many choices available. If one thinks of e < oo} as the event that the process escapes its state space in a finite amount of time, then continuation of the process after e might entail the introduction of at least one new state. Indeed, the situation here is very similar to the one encountered by topologists when they want to "compactify" a space. The simplest compactification of a separable, locally compact space is the one point compactification, the compactification which recognizes escape to infinity but ignores all details about the route taken. The analog of one point compactification in the present context is, at the time of explosion, to send the process to an absorbing point � rJ_ §. That is, one defines t E [0, oo ) X (t) E § U {�} by the prescription in ( 4.2.3 ) (remember that, by Theorem 4.3.1, e = J= with probability 1) as long as t E [0, J=) and
{
f+
4.4 Ergodic Properties
95
takes X ( t) = ll for t E [J oo ) . For various reasons, none of which I will explain, this extension is called that the minimal extension of the process. The minimal extension has the virtue that it always works. On the other hand, it, like the one point compactification, has the disadvantage that it completely masks all the fine structure of any particular case. For example, when § = Z , explosion can occur because, given that e < oo, limt /e X(t) = + oo with probability 1 or because, although limve I X(t) l = oo with probability 1, both limve X(t) = + oo and limve X (t) = oo occur with positive probability. In the latter case, one might want to record which of the two possibilities occurred, and this could be done by introducing two new absorbing states, ll + for those paths which escape via + oo and [l _ for those which escape via 00 ,
 00 .
Alternatively, rather than thinking of the explosion time as a time to banish the process from §, one can turn it into a time of renewal by redistributing the process over § at time e and running it again until it explodes, etc. Obviously, there is an infinity of possibilities. Suffice it to say that the preceding discussion hardly scratches the surface. 4.4 Ergodic Properties
In this section we will examine the ergodic behavior of the Markov processes which we have been discussing in this chapter. Our running assumption will be that the process with which we are dealing is a continuous time Markov process of the sort described in §4.2.1 under the condition that there is no explosion. 4.4. 1 . Classification of States: Just as in §3.1, we begin by classifying the states of §. In order to make our first observation, write Q = R( P  I) , where R is the diagonal matrix of rates from 91: and P is a transition probability matrix whose diagonal entries are 0. Next, define p'.R to be the transition probability matrix given by5 if Ri > 0 if Ri = 0.
(4.4.1)
Obviously, it is again true that Q = R ( P '.R  I) . In addition,
( 4.4.2)
i+j relative to p '.R {===} 3 n 2': 0 (Q n )ij > 0 {===} ( P (t) ) ij > 0 for all t > 0.
Because of ( 4.2.13), this is obvious when 91: is bounded, and, in general, it fol lows from the bounded case plus (cf. the notation in §4.3.1) limN___. oo ( p ( N l (t) ) ij = (P (t) ) i " J 5
It is worth noting that,
as
distinguished from P,
p9'l
is completely determined by Q.
4 MARKOV PROCESSES IN CONTINUOUS TIME
96
On the basis of ( 4.4.2) , we will write iSj when any one of the equivalent conditions in 4.4.2) holds, and we will say that i Qcommunicates with j and will write iSj when iSj and jSi. In this connection, § will be said to be Qirreducible if all states communicate with all other states. We turn next to the question of recurrence and transience. Because
(
lP' (X(t) = i for all t
= i) = 1 , it is obvious that i should be considered recurrent when Ri = 0 . On the other hand, since in the continuous time setting there is no "first positive time," it is less obvious what should be meant by recurrence of i when Ri > 0. However, if one adopts the attitude that time 1 in the discrete time context represents the first time that the chain can move, then it becomes clear that the role of 1 there should be played here by the first jump time J1, and therefore that the role of Pi is played by (4.4.3 O'i = inf{t X(t) = i}. Hence, we will say that i is Qrecurrent if either Ri = 0 or lP' ( ai < ooi X (O) = i) = 1 , and we will say that it is Qtransient if it is not Qrecurrent. Our next observation is that i is Qrecurrent if and only if it is recurrent with respect to the transition probability matrix p 0 ===? R1 > 0 for all j E C. EXERCISE 4 . 5 .4 . In this exercise we will give the continuoustime version of the ideas in Exercise 2.4.1 . For this purpose, assume that § is irreducible and positive recurrent with respect to Q, and use 7T to denote the unique
4.5 Exercises
103
probability vector which is stationary for each P(t) , t > 0. Next, determine the adjoint semigroup {IP'( t) T : t � 0} so that
(ir)J (P(t)) ji (p ( T t) ) = ( ) 7r i • •
2]
A
0
( a) Show that P(t) T is a transition probability matrix and that irP(t) T = ir
for each t � 0. In addition, check that {P(t)T : t � 0} is the semigroup determined by the Qmatrix Q T , where
(b) Let IP' and IP'T denote the probabilities computed for the Markov pro cesses corresponding, respectively, to Q and Q T with initial distribution ir. Show that IP'T is the reverse of IP' in the sense that, for each n E N, 0 = to < t1 < · · · < t n , and (jo, . . . , jn ) E §n+I ,
IP'T (X(tm) = Jm for 0 ::; m ::; n) = IP'(X(tn  tm) = Jm for 0 ::; m ::; n) . ( c ) Define P T so that
Show that P T is again a transition probability matrix, and check that Q T = R(PT  1). EXERCISE 4.5.5. Take § = N and (P) ij equal to 1 or 0 according to whether j = i + 1 or not. Given a set of strictly positive rates 9't, show that, no matter what its initial distribution, the Markov process determined by 9't and P explodes with probability 1 if L iE N Ri 1 < oo and does not explode if L iE N Ri 1 = 00 · EXERCISE 4 .5 .6 . Here is a more interesting example of explosion. Take § = Z3 , and let ( cf. Exercise 3.3.5) P be the transition probability matrix for the symmetric, nearest neighbor random walk on Z3 . Given a set of positive rates 9't with the property that L kE Z3 Rk_1 < oo, show that, starting at every k E Z3 , explosion occurs with probability 1 when (Q)k£ = Rk ((P) k£  Jk, £ ) · Hint: Apply the criterion in the second half of Theorem 4.3.6 to a function of the form ((k) i  (£)i) 2 L i=l £E Z3 £ and use the computation in Exercise 3.3.5.
�
(a? + t
)
1

2
,
104
4 MARKOV PROCESSES IN CONTINUOUS TIME
EXERCISE 4 . 5 . 7 . Even when § is finite, writing down a reasonably explicit expression for the solution to ( 4.2. 7) is seldom easy and often impossible. Nonetheless, a little linear algebra often does quite a lot of good. Throughout this exercise, Q is a Qmatrix on the state space §, and it is assumed that the associated Markov process exists (i.e. , does not explode) starting from any point. (a) If u E C5 is a bounded, nonzero, right eigenvector for Q with eigenvalue a E C, show that the real part of a must be less than or equal to 0. (b) Assume that N = #§ < oo and that Q admits a complete set of linearly independent, right eigenvectors u1, . . , UN E C5 with associated eigenvalues a 1 , . . . , aN . Let U be the matrix whose mth column is Um , and show that etQ = UA(t)U1 , where A(t) is the diagonal matrix whose m diagonal entry is et Oim . EXERCISE 4 . 5 . 8 . Here is the continuous analog of the result in Exercise 3.3.7. Namely, assume that i is Qrecurrent and R; > 0, set .
and let P, be the row vector given by (P,)1 p1 for each j E §. ( a) Show that {l; = A, , {l1 < oo for all j E §, and that {l1 > 0 if and only . Q J. . .1f �f+ (b) Show that P, = P,P(t) for all t > 0. Hint: Check that =
and that
( c ) In particular, if i is Qpositive recurrent and C = {j : i Sj}, show that
P, = ( 2::1 p1 ) ir e . Equivalently, A
( 7r
c
[
i
lE J;' l{j } (X (t) ) dt X (O) = i )j = lE [o; I X(O) = i]
]
EXERCISE 4 . 5 . 9 . Having given the continuoustime analog of Exercise 3.3.7, we now want to give the continuoustime analog of Exercise 3.3.8. For this purpose, again let i be a Qrecurrent element of § with R; > 0, and define the
4.5 Exercises
105
measure J.L accordingly, as in Exercise 4.5.8. Next, assume that f) E [0, oo ) § satisfies the conditions that: (v)i > 0, (v)j = 0 unless J!D(O"j < oo I X (O) = i) = 1 , and DQ = 0 in the sense that Rj (v)j = L kh (v)kQkj for all j E §. 6 The goal is to show that f) = Ri (v)iJ.L · Equivalently,
(v)j = Ri (v)iJE
[l!Ji l {j} (X(t)) dt / X(O) = i]
for all j
E
§.
In particular, by Exercise 4.5.8, this will mean that v = vP(t) for all t > 0. In the following, the transition probability P is chosen so that its diagonal entries vanish and Q = R ( P  I) , where, as usual, R is the diagonal matrix whose jth diagonal entry is Rj for each j E §. (a) Begin by showing that i is Precurrent. (b) Define the row vector v so that (v)j = �:��l� , and show that v = vP. ( c ) By combining (b) with Exercise 3.3.7, show that
(v)j = IE
[1: l {j} (Xm) I Xo = il ,
where {Xn : n 2: 0} is the Markov chain with transition probability matrix P. (d) Show that
RjiE
[l(J' l w (X(t)) dt I X (O) = i] = IE [1: l w (Xm) I Xo = il ,
and, after combining this with (c), arrive at the desired conclusion. EXERCISE 4 . 5 . 10. Following the strategy used in Exercise 3.3.9, show that, under the hypotheses in Corollary 4.4. 10, one has the following continuous version of the individual ergodic theorem:
(
)
lP' lim IILr  7r0 ll = o = 1, v
T..+ (X)
where C = {i : jSi} and Lr is the empirical measure determined by
{T
l {i} (X(t)) dt. (Lr ) i = _!._ T Jo
In addition, following the strategy in Exercise 3.3.11, show that, for any initial distribution, lP'
(
lim ..!..
T+oo
fr l {j} (X(t)) dt = ) = 1
T }0
o
when j is not Qpositive recurrent.
6 The reason why the condition vQ = 0 needs amplification is that, because Q has negative diagonal entries, possible infinities could cause ambiguities.
4 MARKOV PROCESSES IN CONTINUOUS TIME
106
EXERCISE 4 . 5 . 1 1 . Given a Markov process {X(t) : t :;;. 0} with Qmatrix Q, one can produce a Markov process with Qmatrix M Q by speeding u p the clock of t""'X(t) by a factor of M. That is, {X(Mt) : t :;;. 0} is a Markov process with Qmatrix MQ . The purpose of this exercise is to show how to carry out the analogous procedure, known in the literature as random time change, for variable rates. To be precise, let P be a transition probability matrix on § with (P)ii = 0 for all i. Choose and fix an i E §, let {Xn : n :;;. 0} be a Markov chain starting from i with transition probability matrix P and { N ( t) : t :;;. 0} a simple Poisson process which is independent of {Xn : n :;;. 0}, and set X 0 (t) XJ., ( ) for t :;;. 0. In particular, {X 0 (t) : t :;;. 0} is a Markov t (P  I) starting from i. Finally, let 91 be a set of process with Qmatrix positive rates, and take Q = R(P  I) accordingly. ( a) Define 1   dT for t E [0, oo), A(t) = t R =
1 o

X 0 (T)
observe that t""'A(t) is strictly increasing, set A(oo) = limv A(t) E (O, oo] , and use s E [O, A(oo)) A 1 (s) E [O, oo) to denote the inverse of t A(t) . Show that Rxo ( A  l (u)) dCJ, s E [0, A(oo)). A1 (s) =
f7
""'
= las
X 0 ( A 1 (s)) for s E [O, A(oo)) . Define J8 0 = J0 and, for n :;;. 1 , J� and Jn to be the times of the nth jump of t""'X0 (t) and S""'>X(s), respectively. After noting that Jn = A( J�), conclude that, for each n :;;. 1, s > 0, and j E §, (b) Set X(s)
=
=
lP'(Jn  Jn  1 > & X(Jn) = j J X( CJ ) , CJ E [0, Jn)) on { Jn  1 < 00 }. = e t R x(J n ! l (P)x ( Jn _!)j S
( c ) Show that the explosion time for the Markov process starting from i with Qmatrix Q has the same distribution as A(oo). In particular, if A(oo) = oo with probability 1 , use (b) to conclude that {X(s) : s :;;. 0} is a Markov process starting from i with Qmatrix Q. (d) As a consequence of these considerations, show that if the Markov process corresponding to ·q does not explode, then neither does the one cor responding to Q', where Q' is related to Q by (Q')ij = ai(Q)ij and the { ai : i E §} is a bounded subset of (O, oo).
CHAPTER 5
Reversible M arkov Processes
This is devoted to the study of a class of Markov processes which admit an initial distribution with respect to which they are reversible in the sense that, on every time interval, the distribution of the process is the same when it is run backwards as when it is run forwards. That is, for any n 2: 1 and (i0 , . . . , in) E En + l , in the discrete time setting, (5.0. 1)
IP'(Xm
=
im for 0 'S m 'S n)
=
IP'(Xn m = im for 0 'S m 'S n)
and in the continuous time setting, (5.0.2) IP'(X(tm) = im for 0 'S m 'S n) = IP'(X(tn  tm)
=
im for 0 'S
m
'S n)
whenever 0 = to < · · · < tn. Notice that the initial distribution of such a process is necessarily stationary. Indeed, depending on whether the setting is that of discrete or continuous time, we have
IP'(Xo = i & Xn = j ) or IP'(X(O) = i & X (t) = j)
IP'(Xn = i & Xo = j) = IP' ( X (t) = i & X (O) = j ) , =
from which stationarity follows after one sums over j. In fact, what the preceding argument reveals is that reversibility says that the joint distribution of, depending on the setting, (Xo, Xn ) or (X(O), X (t)) is the same as that of (Xn, Xo) or (X(t), X(O)) . This should be contrasted with the stationarity which gives equality only for the marginal distribution of the first components of these. In view of the preceding, one should suspect that reversible Markov pro cesses have ergodic properties which are better than those of general stationary processes, and in this chapter we will examine some of these special properties. 5 . 1 Reversible Markov Chains
In this section we will discuss irreducible, reversible Markov chains. Because the initial distribution of such a chain is stationary, we know ( cf. Theorem 3.2. 10) that the chain must be positive recurrent and that the initial distribu tion must be the probability vector ( cf. (3.2.9)) 1r = rr§ whose ith component
5 REVERSIBLE MARKOV PROCESSES
108
is ('rr) i JE[pi IXo = i]  1 . Thus, if P is the transition probability matrix, then, by taking n = 1 in (5.0 . 1 ) , we see that =
( 1r) i ( P) iJ
=
lP'(Xo = i & X 1 = j) = lP' (Xo = j & X1 = i) = ( 1r)j ( P)Ji ·
That is, P satisfies1 (5.1 . 1)
( 1r) i ( P) ij = (1r)1 ( P)1i , the condition of detailed balance.
Conversely, (5.1.1) implies reversibility. To see this, one works by induction on n 2 1 to check that
1r · ( P) · 'l.Q
· 'l.o'l1
· · ·
(P) ·
·
 1r· (P) ·
't n t tn 
't n
· 1 'Zn 'l.n
· ··
(P) ·
· ' ?.. t 't0
which is equivalent to (5.0. 1 ) .
5.1.1. Reversibility from Invariance: As we have already seen, reversibil ity implies invariance, and it should be clear that the converse is false. On the other hand, there are two canonical ways in which one can pass from an irreducible transition probability P with stationary distribution 1r to a tran sition probability for which 1r is reversible. Namely, define the adjoint p T of P so that (5.1 .2) Obviously, 1r is reversible for P if and only if P = p T . More generally, because 1rP = 1r, p T is again a transition probability. In addition, one can easily verify that both and p T p
(5.1 .3)
are transition probabilities which are reversible with respect to 1r. As is explained in Exercise 5.6.9 below, each of these constructions has its own virtue. 5.1.2. Measurements in Quadratic Mean: For reasons which will be come increasingly clear, it turns out that we will here want to measure the size of functions using a Euclidean norm rather than the uniform norm ll f l lu · Namely, we will use the norm (5.1 .4) 1 The reader who did Exercise 2.4. 1 should recognize that the condition below is precisely the same as the statement that P P T . In particular, if one knows the conclusion of that exercise, then one has no need for the discussion which follows. =
5.1 Reversible Markov Chains
109
is the expected value of g with respect to Because ( 11" ) i > 0 for each i E §, it is clear that l l f l l 2 ,1r = 0 {==} f 0. In addition, if we define the inner product ( f, g) 1r to be ( fg) 7r , then, for any t > 0, 71" .
=
± C1
f ll�,7r = t2 I I J I I �,7r ± 2 ( f, g) 7r + 2 ll g ll �,7r ' ll t f and so I ( f, g) 7r I :::; t2 11 fll §,7r + r 2 ll g ll §,7r for all t > 0. To get the best estimate, we minimize the right hand side with respect to t > 0. When either f or g is identically 0, then we see that ( f, g) 7r = 0 by letting t + oo or t + 0. If 1 o :::::
C
neither f nor g vanishes identically, we can do best by taking t = Hence, in any case, we arrive at Schwarz 's inequality
( ������:: )
4 .
(5.1.5) Given Schwarz's inequality, we know That is, we have the triangle inequality:
(5.1.6) Thus, if L 2 (11") denotes the space of f for which llf l l 2 ,7r < oo, then L2 ( 11" ) is a linear space for which ( f , g)'"' I I !  g l l 2 , 7r is a metric. In fact, this metric space is complete since, if limm _,00 supn > m l l fn  fm l l 2 ,1r = 0, then { fn (i) n 2:: 0} is Cauchy convergent in � for each i E §, and therefore there exists a limit function f such that fn(i) f(i) for each i E §. Moreover, by Fatou's Lemma, :
____,
ll f  fm ll2.1r :::; lim ll fn  fm l l 2 ,1r n > oo
____,
0 as
m + 00.
In this chapter, we will use the notation
Pf(i) = L f (j ) ( P ) ij jE§
= (Pf) i,
where f is the column vector determined by f. When f is bounded, it is clear that P f(i) is welldefined and that l i P f llu :::; l l f llu , where ll g l l u supi E§ lg(i) l = ll gll u when g is the column vector determined by g . We next want show that, even when f E L 2 ( ) P f ( i) is welldefined for each i E § and that P is a contraction in L 2 ( 11" ) l i P f l l 2 ,1r :::; ll f l l 2 , 1r · To check that P f(i) is welldefined, we will show that the the series LjE§ f (j ) ( P)ij is absolutely convergent. But, by the form of Schwarz's inequality in Exercise 1.3.1, =
11"
,
:
L I J (j ) I ( P ) ij jE§
:::::
I I J I I 2 ,7r
(
)
(P ) 2. . ! L ( ) J jE§ ']
11"
'
5 REVERSIBLE MARKOV PROCESSES
1 10 and, by (5. 1 .1 ),
As for the estimate li P f l brr :::; IIJII 2 ,1r , we use Exercise 5.6.2 below together with I;1 (P) ij 1 to see that (Pf(i)) 2 :::; Pj 2 (i) for each i. Thus, since 1r is Pstationary, =
(5.1 .7) An important consequence of (5.1 .7) is the fact that, in general (cf. (2.2.4)) , (5.1 .8) and (5. 1.9) P is aperiodic
===?
lim+ CQ II P n f  (!) 1r ll 2 '1r = 0 for all f E L2 (rr). n
To see these, first observe that, by (3.2. 11) , (3.2. 15), and Lebesgue's Domi nated Convergence Theorem, there is nothing to do when f vanishes off of a finite set. Thus, if { FN N ?: 1 } is an exhaustion of § by finite sets and if, for f E L 2 (rr), !N lFN f, then, for each N E z+ , :
=
IIAnf  (!) 7r l l 2 ,1r :S II An( !  fN ) ll 2,7r + IIAnfN  (JN)1r ll 2,7r + (IJN  J l ) 7r :S 2 IIJ  JN II 2 ,7r + IIAnfN  (JN)7r ll 2 ,7r , where, in the passage to the second line, we have used IIAng ll 2 , 1r :S l l g l l 2 ,1r , which follows immediately from (5.1.7) . Thus, for each N, limn +oo IIAnf (!) 1r l l 2 ,1r :S 2 11 !  !N I I 2 . 1r , which gives (5.1 .8) when N + oo. The argument
for (5. 1.9) is essentially the same, and, as is shown in Exercise 5.6.6 below, all these results hold even in the nonreversible setting. Finally, (5.1.1) leads to
(i ,j )
( i ,j )
In other words, P is symmetric on L 2 ( 1r ) in the sense that (5. 1 .10) 5. 1.3. The Spectral Gap: The equation (5. 1 .7) combined with (5.1 . 10) say that P a selfadjoint contraction on the Hilbert space2 is L 2 (rr). For the 2 A Hilbert space is a vector space equipped with an inner product which determines a norm for which the associated metric is complete.
5.1 Reversible Markov Chains
111
reader who is unfamiliar with these concepts at this level of abstraction, think about the case when § = {1, . . . , N}. Then the space of functions f : § + lR can be identified with JRN . (Indeed, we already made this identification when we gave, in §2. 1 .3, the relation between functions and column vectors.) After making this identification, the inner product on L2 ( 1r) becomes the inner product I:i (rr)i(v)i (w)i for column vectors and w in JRN . Hence (5.1.7) says that the matrix acts as a symmetric, contraction on JRN with this inner product. Alternatively, if P where is the diagonal matrix whose ith diagonal entry is (rr)i, then, by (5. 1.1), P is symmetric with respect to the standard inner product (v, w)JR.N I:i (v)i (w)i on JR N . Moreover, because, by (5.1 .7), v
P
=
II�PII �,
II
=
II �r,
we see where g is the function determined by the column vector that, as an operator on JRN , P is length contracting. Now, by the stan dard theory of symmetric matrices on JR N , we know that P admits eigen values 1 2: A1 2: · · · A N 2: 1 with associated eigenvectors ( e 1 , . . . , e N ) which are orthonormal for ( · , · )JRN : (ek , ee)JRN = ok,C· Moreover, because � = I:f= 1 (P)ij y'(7!1j, we know that A 1 = 1 and can take ( el)i = � Finally, by setting ge = (II)  � ee and letting ge be the associated function on §, we see that P ge = Aege, 91 1 , and ( gkl ge ) "' = ok,C · To summarize, when § has N elements, we have shown that P on L2 (rr) has eigenvalues 1 = A1 2: · · · 2: A N 2: 1 with corresponding eigenfunctions g1 , . . . , gN which are orthonormal with respect to ( · , · ) . Of course, since L2 (rr) has dimension N and, by orthonormality, the ge's are linearly independent, (g1 , . . . , gN ) is an orthonormal basis in L2 ( 1r ) . In particular, =
"'
( 5.1.11)
pn f 
(!)"'
N = L X£ ( f , ge ) "' ge for all n 2: 0 and f E L2(rr), £= 2
and so
IIPn f
 (!)"' I �,Tr
=
N L A�n
£= 2
(!, e) ! 9
:S
I  (!)"' I �'"' '
( 1  P)2n f
where (3 = ( 1  A 2 ) 1\ ( 1 + A N ) is the spectral gap between { 1 , 1 } and {,\ 2 , . . . , A N }. In other words, (5. 1 . 12)
I I Pn f  ( f) "' I I 2,Tr ::; ( 1  (J) n ll/ for all n 2: 0 and
f
(f)"' I L2 (Tr)
E L2 (rr) .
When § is not finite, it will not be true in general that one can find an orthonormal basis of eigenfunctions for Instead, the closest approximation
P.
5 REVERSIBLE M ARKOV PROCESSES
1 12
to the preceding development requires a famous result, known as the Spectral Theorem (cf. §107 in [7] ) , about bounded, symmetric operators on a Hilbert space. Nonetheless, seeing as it is the estimate (5.1 .12) in which we are most interested, we can get away without having to invoke the Spectral Theorem. To be more precise, observe that (5.1.12) holds when (3 = 1  sup { I P f  (f) 7r l l 2 , rr : f E L2 (rr) with ll f ll 2 ,1r = 1 } . (5. 1 .13)
!
To see this, first note that, when (3 is given by (5. 1 .13) ,
[ (
� )
(
�
)J
l iP f  U) 1r ll 2 ,1r = ll f ll 2 ,1r p r ll f 2 , 7r 1 f 2 , 7r 2 , 7r 2 {0}, \ 1 ( (rr) ) when L fE  P llfll2,1r : 0 (P)ji > 0. In particular, since, for each i, (P)ij > 0 for some j and therefore (P 2 )ii = 2:1 (P)i1 (P) 1i > 0, we see that the period must divide 2. To complete the proof at this point, first suppose that d = 1 . If f E L 2 ( ) satisfies f = Pf, then, as noted before, ( f ) rr = 0, and yet, because of aperiodicity and (5.1.9), limn _,00 pn f(i) = ( f ) rr = 0 for each i E §. Since f = P 2n f for all n � 0, this means that f 0. Conversely, if d = 2, take §o and §1 accordingly, as in §3.2. 7, and consider f = 1 §0  1 §1 . Because of (3.2. 19), P f =  J , and clearly llfll 2 , rr = 1. D As an immediate corollary of the preceding, we can give the following graph theoretic picture of aperiodicity for irreducible, reversible Markov chains. Namely, if we use P to define a graph structure in which the elements of § are the "vertices" and an "edge" between i and j exists if and only if (P)ij > 0, then the first part of Theorem 5.1.14, in combination with the considerations in §3.2. 7, says that the resulting graph is bipartite (i.e., splits into two parts in such a way that all edges run from one part to the other) if and only if the chain fails to be aperiodic, and the second part says that this is possible if and only if there exists an f E L 2 ( ) \ { 0} satisfying P f =  f. 5.1.5. Relation to Convergence in Variation: Before discussing meth ods for finding or estimating the (3 in (5.1 . 13), it might be helpful to compare the sort of convergence result contained in (5.1 .12) to the sort of results we have been getting heretofore. To this end, first observe that 7r
=
7r
In particular, if one knows, as one does when Theorem 2.2.1 applies, that (*)
sup ll8i P n  rr llv ::::; C(1  E ) n i
for some C < oo and E E (0, 1] , then one has that
which looks a lot like (5.1 .12) . Indeed, the only difference is that on the right hand side of (5. 1 .12) , C = 1 and the norm is ll fll 2 ,rr instead of l lfll u · Thus, one should suspect that (*) implies that the (3 in (5. 1.12) is at least as large as the E in (*). In, as will always be the case when § is finite, there exists a g E L 2 (rr )
5 REVERSIBLE MARKOV PROCESSES
1 14
with the properties that l l g ll 2 ,1r = 1, (g)7r 0, and either Pg = (1  (J)g or Pg =  ( 1  (3) g, this suspicion is easy to verify. Namely, set f = g 1 [ _ R,R] (g) where R > 0 is chosen so that a = ( f, g)7r 2: !, and set f = f  ( ! ) 1r · Then, after writing f = ag + (f ag) and noting that (g, f  ag)7r 0, we see that, for any n 2: 0, =
=
I (Pn fll � , 7r = a2 (1  (3) 2n ± 2 a (1  (J) n ( g, pn (f  ag) ) 7r + ! ! P n (f  ag) ! ! �,7r 2: (1 (3) 2n ,
�
_
since On the other hand, I I P n /II § 7r 0, in which case
is also dominated by the right hand side of (5.2.1). To go further, we need to borrow the following simple result from the theory of symmetric operators on a Hilbert space. Namely,
L5 (7r)
& ll f ll 2 ,1r = 1 } sup{ II P JII 2 ,7r : f E = sup { [ ( !, P !)71' 1 : f E & ll f ll 2 ,7r = 1 } .
(5.2.2)
L5 (7r)
That the right hand side of (5.2.2) is dominated by the left is Schwarz's inequality : I (!, P !) 71' I :S: ll f l l 2 ,7r l i P f ll 2 , 1r · To prove the opposite inequality, let f E with ll f ll 2 ,7r = 1 be given, assume that l iP f ll 2 ,1r > 0, and set g = I I P�IL . Then g E and llgll 2 , 1r = 1. Hence, if 1 denotes the supremum on the right hand side of (5.2.2) , then, from the symmetry of P ,
L6(1r)
..
L6(1r)
4 II P JII 2 ,7r = 4 (g, P f ) 7r = ( ( ! + g ) , P( f + g ) )7r  ( ( !  g ) , P ( f  g ) ) 7r :s: r ( ll f + g[l � ,7r + II!  gll � ,7r ) = 2 r ( ll fl l � ,7r + llgll � ,7r ) = 4r .
5 REVERSIBLE MARKOV PROCESSES
1 16
The advantage won for us by (5.2.2) is that, in conjunction with (5.2. 1 ) , it shows that (5.2.3)
where f3±
{3 = f3+ 1\ !3
==
inf{ (!, (I =t= P) J )rr : f E L6 ( 7r ) & llfll 2 ,rr
=
1}
Notice that, because (I  P) J = (I  P)(f  ( f ) rr ) , (5.2.4)
{3+ = inf{ (!, (I  P) J)rr : f E L2 ( 1r ) & Varrr (f)
==
II! 
( f)rr ll!
=
At the same, because
1}.
it is clear that the infemum in the definition of f3 is the same whether we consider all f E L 2 (1r) or just those in L§ (1r). Hence, another expression for f3 is (5.2.5) Two comments are in order here. First, when P is nonnegative definite, abbreviated by P 2': 0, in the sense that ( f, P J ) rr 2': 0 for all f E L2 (1r) , then f3+ :S 1 :S (3_ , and so {3 = f3+ · Hence, (5.2.6)
P 2': 0 ==? {3 = inf{ (!, (I  P) J)rr : f E L 2 (1r) & Varrr ( f) = 1 } .
Second, by Theorem 5.1 .14, we know that, in general, f3 = 0 unless the chain is aperiodic, and (cf. (5. 1 . 1 1 ) and the discussion at the beginning of §5.1.4) , when § is finite, that {3 > 0 if and only if § is aperiodic. The expressions for {3+ and j3_ in (5.2.4) and (5.2.5) lead to important calculational tools. Namely, observe that, by (5. 1 . 1 ) , _
L f(i) (7r )i (P)ij ( J (i)  J (j)) ( i, j ) = L f(i)( 7r )j (P)ji ( J (i)  f(j)) = L J (j)(7r )i (P)ij ( J (j)  f( i )) . ( i,j) (i,j )
(!, (I  P) J ) rr =
Hence, when we add the second expression to the last, we find that
�
P) J)rr = E (f, f) == L( 7r)i(P)ij ( J (j)  f(i)) 2 . i#j Because the quadratic form [ ( !, f) is a discrete analog of the famous quadratic form I V fl 2 (x ) dx introduced by Dirichlet, it is called a Dirichlet form. Extending this metaphor, one interprets {3+ as the Poincare constant (5.2.7)
�J
(5.2.8)
(!, (I 
117
5.2 Dirichlet Forms and Estimation of (3 in the Poincare inequality
( 5.2.9) To make an analogous application of ( 5.2.5), observe that
L ( J (j ) + ! ( i) ) 2 (n) i( P)ij = (P !2 ) rr + 2 (!, P f)rr + 11!11 � , rr ( i ,j ) = 2 11!11 �, rr + 2 (!, p f ) rr , and therefore
( 5.2.10)
(!, ( I + P ) f) rr = £ ( !, ! )
=
1
2 "' L..) n ) i ( P ) ij ( J (j )
(i ,j )
+ f(i) ) 2 .
Hence
L2 ( n ) & Varrr ( f ) = 1 } = inf{ E(f, f ) : f E L6 ( n) & IIJII 2 ,rr = 1 } .
(3_ = inf{ E( f, f ) : f E
( 5.2.11)
In order to give an immediate application of ( 5.2.9) and ( 5.2.11), suppose ( cf. Exercise 2.4.3) that (P) ij :::: Ej for all ( i, j) , set e = Lj Ej , assume that e > 0, and define the probability vector JL so that ( JL)i = � · Then, by Schwarz's inequality for expectations with respect to JL and the variational characterization of Varrr ( f ) as the minimum value of a� ( ( !  a) 2 )rr ,
2£ ( !, f) = L (n)i(P)ij ( J (j)  f( i) ) 2 ( i ,j ) :::: e L ( J (j)  f (i)) 2 ( n) i ( JL) j :::: L ( ( f)JL  f (i)) 2 ( n ) i :::: eVarrr ( f ) , (i ,j ) e
and, similarly, E( f, f ) :::: �Varrr ( f ) . Hence, by ( 5.2.9) , ( 5.2.11), and ( 5.2.3), :::: �. Of course, this result is a significantly less strong than the one we get by combining Exercise 2.4.3 with the reasoning in §5.1.4. Namely, by that exercise we know that llbip n  n llv ::::: 2 ( 1  ) n , and so the reasoning in §5.1.4 tells us that (3 :::: which is twice as good as the estimate we are getting here. On the other hand, as we already advertised, there are circumstances when the considerations here yield results when a Doeblintype approach does not. 5.2.2. Estimating (3+ : The origin of many applications of ( 5.2.9) and ( 5.2.11) to estimate (3+ and (3_ is the simple observation that (3
e
e,
( 5.2.12)
Varrr ( f ) =
� L:: (J (i)  f (j) ) 2 (n)i (n)j , 2]
5 REVERSIBLE MARKOV PROCESSES
1 18
which is easily checked by expanding (f ( i)  f (j)) 2 and seeing that the sum on the right equals 2 ( ! 2 ) 7r  2 ( !)';, . The importance of (5.2.12) is that it expresses Var7r (f) in terms of difference between the values of f at different points in §, and clearly E ( f , f ) is also given in terms of such differences. However, the differences which appear in E(f, f) are only between the values of f at pairs of points ( i, j) for which (P)ij > 0, whereas the right hand of side of (5.2.12) entails sampling all pairs ( i, j). Thus, in order to estimate Var7r ( f ) in terms of E( f, f ) , it is necessary to choose, for each (i, j) with i =/= j, a path p(i, j) = (ko, . . . , kn ) E §n + l with k0 i and kn = j , which is allowable in the sense (P)k= ' k= > 0 for each 1 :S m :S n , and to write 2 2 (f (i)  t (J ) ) = L !J.et eEp (i ,j ) where the summation in e is taken over the oriented segments (km 1, km) in the path p(i,j), and, for e = (k, R.), !J. e f f(R.)  f(k). At this point there are various ways in which one can proceed. For example, given any {a(e) : e E p(i, j)} 0 for each 1 _::::; m _::::; 2n + 1) , is closed (i.e. , = and starts at i (i.e., = i ) . Note our insistence that this path have an odd number of steps. The reason for our doing so is that when the number of steps is odd, an elementary exercise in telescoping sums shows that
(rr (p)) 
(rr(p)) + (rr)j
p
(3_
(3+
(ko, . . . , kzn+d
k0
(3_
ko kzn+ l) ,
m ) 1 m=O (f(km) 2n
2f(i) = L (
+ J ( km+ l)).
p(i) for each i E §, and we { a (e , p) : e E p E P} � (0, )
Thus, if P be a selection of such paths, one path make an associated choice of coefficients A = Then, just as in the preceding section,
4 IIJII�,tr = _::::;
oo .
Li (eEp(Li) (1)m(e) Lief) 2 (rr)i LpEP rr(p) (LeEp ap(e(e),p) ) (L(Li eEp e !) 2 a(e,p) )
____EJ!)_ '
5 REVERSIBLE MARKOV PROCESSES
120
where, when p (ko, . . . , k2n + I ) , 1r(p) = ( 7r ) k0 , and, for 0 ::; m ::; 2 n , m(e) m and fief = f (km) + f (km+ l ) if e = (km , km+l) · Hence, if =
_
w(p)
then
=
(
=
) l: p ) , 1
a(e, p) (e e Ep
1r(p) max eEp a ( e, p)
2 ll f11L.. ::; W('P, A) E ( f, f) where W('P, A) = sup e
l: w(p) ,
p3 e
and, by (5.2.11), this proves that 2 !3 ;:::: w(P, A) for any choice of paths P and coefficients A satisfying the stated requirements. When we take o:(e, p) 1 , then this specializes to =
(5.2.15)

 
where W1 (P)
=
p
� 7r(p) s �p � L...t L...t (e') , p3 e e ' Ep
It should be emphasized that the preceding method for getting estimates on (3_ is inherently flawed in that it appears incapable of recognizing spectral properties of P like nonnegative definiteness. That is, when P is nonnegative definite, then (3_ ;:::: 1 , but it seems unlikely that one could get that conclusion out of the arguments being used here. On the other hand, because our real interest is in (3 = (3+ 1\ (3_ and the estimates given by (5.2. 14) and (5.2.15) are likely to be comparable, the inadequacy of (5.2.15) causes less damage than one otherwise might fear.
5.3
Reversible Markov Processes in Continuous Time
Here we will see what the preceding theory looks like in the continuous time context and will learn that it is both easier and more resthetically pleasing there. We will be working with the notation and theory developed in Chapter 4. In particular, Q will denote a matrix of the form R( P I ) where R is a diagonal matrix whose diagonal entries are the rates 9\ = { Ri i E §} and P is a transition probability matrix. We will assume throughout that Q is irreducible in the sense that (cf. (4.4.2)), for each pair (i, j) E § 2 , (Q n )ij > 0 for some n 2': 0. 5.3.1. Criterion for Reversibility: Let Q be given, assume that the as sociated Markov process never explodes (cf. §4.3.1), and use { P (t ) : t ;:::: 0} to denote the semigroup determined by Q (cf. Corollary 4.3.2). Our purpose 
,
:
5.3 Reversible Markov Processes in Continuous Time
121
in this subsection i s to show that if fl is a probability vector for which the
detailed balance condition
(5.3.1) holds relative to Q, then the detailed balance condition
( fl ) i (P(t)) ij = (fl)J (P(t)) 1i for all t > 0 and (i, j) E §2
(5.3.2)
also holds. The proof that (5.3.1) implies (5.3.2) is trivial in the case when the rates 9'i: are bounded. Indeed, all that we need to do in that case is first use a simple inductive argument to check that (5.3.1) implies ( jl)i (Qn ) ij = (jl)1 (Q n ) ji for all n :::: 0 and then use the expression for P(t) given in (4.2. 13) . When the rates are unbounded, we will use the approximation procedure introduced in §4.3. 1 . Namely, refer to §4.3.1, and take Q (N) corresponding to the choice rates ry{ ( N ) described there. Equivalently, take (Q ( N ) )ij to be (Q)ij if i E FN and 0 when i rf FN . Using induction, one finds first that ((Q ( N) ) n )ij = 0 for all n :::: 1 and i rf FN and second that Hence, if {P ( N ) (t) : t > 0} is the semigroup determined by Q ( N) , then, since the rates for Q ( N) are bounded, (4.2.13) shows that if i rt FN (p CN l (t)) ij = oi ,j ( fl)i (p (N ) (t)) ij = ( jl)j ( p (N ) (t)) ji if (i, j) E Fir.
(5.3.3)
In particular, because, by (4.3.3) , (P(Nl (t)) ij > (P(t)) ij ' we are done. As a consequence of the preceding, we now know that (5.3.1) implies that fl is stationary for P(t) . Hence, because we are assuming that Q is irreducible, the results in §4.4.2 and §4.4.3 allow us to identify fl as the probability vec tor ir ir§ introduced in Theorem 4.4.8 and discussed in §4.4.3, especially (4.4. 11). To summarize, if fl is a probability vector for which ( 5. 3. 1 ) holds, then fl = ir. 5.3.2. Convergence i n L2(ir) for Bounded Rates: In view of the results just obtained, from now on we will be assuming that ir is a probability vector for which (5.3.1) holds when fl = ir. In particular, this means that =
(5.3.4)
(ir)i (P(t)) ij = (ir) 1 (P(t)) ji for all t > 0 and (i, j) E §2 .
Knowing (5.3.4) , one is tempted to use the ideas in §5.2 to get an estimate on the rate, as measured by convergence in L2 (ir) , at which P(t)f tends to (f ) fr . To be precise, first note that, for each h > 0,
( f, P(h)f ) ir = IIP(�) f ll � ,fr :::: 0,
5 REVERSIBLE MARKOV PROCESSES
122
and therefore (cf. (5.1.13), ( 5.2.6), and ( 5.2.7)) that
{3 (h)
1  sup { ! (f, P(h)f ) * ! f E L6 ('rr ) with l lfii 2 ,?T = 1 } = inf { ( !, I  P( h)f} * : Var*(f) = 1 } = inf { Eh ( f, f) : Var* (f) = 1 } , =
where
:
Eh (f, f)
=
( * ) i (P( h) ) ij ( f(j )  f( i ) ) 2 �L i#j
is the Dirichlet form for P(h) on L 2 (7T) . Hence (cf. ( 5.1.12)), for any t > 0 and n E z+ ,
To take the next step, we add the assumption that the rates 9\ are bounded. One can then use ( 4.2.8) to see that, uniformly in f satisfying Var* (f) = 1 ,
where
( 5.3.5)
EQ (f, f)
=
� L( 7T ) i (Q) ij ( f(j )  f(i ) ) 2 i #j
;
and from this it follows that the limit limh'\.O h1 {3( h) exists and is equal to
( 5.3.6 ) Thus, at least when the rates are bounded, we know that
( 5.3.7) 5.3.3. L 2 (7T)Convergence Rate in General: When the rates are un bounded, the preceding line of reasoning is too nai:ve. In order to treat the unbounded case, one needs to make some additional observations, all of which have their origins in the following lemma. 3 5.3.8 LEMMA. Given f E L 2 (7T ) , the function t E [0, oo ) f.> IIP(t)fll § *
is continuous, nonincreasing, nonnegative, and convex. In particular, (0, oo ) f.> (f,(I � (t))f)rr is nonincreasing, and therefore
l.
h '\.0 1m
11 ! 11§,*  II P(h)f ll§,* h
t 'E
exists in [0, oo] .
3 If one knows spectral theory, especially Stone's Theorem, the rather cumbersome argument which follows can be avoided.
123
5.3 Reversible Markov Processes in Continuous Time In fact (cf. (5.3.5)),
11 ! 11§,,r  II P(h )f l l§,r � 2 EQ(f' f) . h',. h PROOF: Let f E L2 ( fr ) be given, and (cf. the notation in §4.3.1 ) define fN = l FN f . Then, by Lebesgue's Dominated Convergence Theorem, II! 0 as N + oo. Because IIP(t) g ll 2 ,7r :::; ll g ll ,1r for all t > 0 and !N II 2, 7r 2 g E L2 (fr ), we know that lim0
>
uniformly in t E (0, oo ) as N + oo . Hence, by part ( a) of Exercise 5.6.1 below, in order to prove the initial statement, it suffices to do so when f vanishes off and set of for some M. Now let f be a function which vanishes off of 'ljJ (t) = I IP(t)f l l � 11 · At the same time, set 1/JN (t) = IIP(N) (t)f ll� ' 7r for N � M. Then, because by ( 4.3.3), 1/JN 1/J uniformly on finite intervals, another application of part ( a) in Exercise 5.6.1 allows us to restrict our attention to the 1/JN 's. That is, we will have proved that 'ljJ is a continuous, nonincreasing, nonnegative, convex function as soon as we show that each 1/JN is. The non negativity requires no comment. To prove the other properties, we apply ( 4.2. 7) to see that
FM
FM,
>
?jw(t) = ( Q (N) p(N) (t)j, p(Nl (t)f) 7r + (p(N ) (t)j, Q (N) p(N) (t)f) 11 " Next, by the first line of ( 5.3.3), we know that, because N � M , p(Nl (t)f the vanishes off of FN , and so, because ( fr ) i Q W) = ('ir ) 1 Q j�l for ( i , j ) E preceding becomes
F� ,
Similarly, we see that
;j;N (t) = 2 (Q (N) p(N) (t)j, Q(N) p(N) (t)f)7r + 2(p(N) ( t)j, (Q (N ) ) 2 p(N) (t )f)7r = 4(Q(N ) p(Nl (t)j, qCN ) p(Nl (t)f) * = 4!!Q(N ) p(Nl (t )J!I� , 7r � 0 . Clearly, the second o f these proves the convexity of 'ljJN . In order to see that
the first implies that 1/JN is nonincreasing, we will show that
(g ,  Q (Nl g) 7r = (* )
L ( ir ) iV;(N) g(i) 2 � i,jLEF'f, (fr ) i (Q) ij (g (j )  g(i ) ) 2 + iEFN ( #) j
if g = 0 off FN and v 1/J ( 0)  1/J ( s ) > 1/J ( s )  1/J ( t) > _:_ 1/J ('t )_ ' :1/J'( t_+_h) '
s 
s

ts

h
5.3 Reversible Markov Processes in Continuous Time
125
for any h > 0. Moreover, because,
IIP (t) fl1 �,7r  IIP (h) P (t) JII �, 7r h
7/J(t)  7/J(t + h) h
the last part of Lemma 5.3.8 applied with P (t) f replacing f yields the second asserted estimate. D With the preceding at hand, we can now complete our program. Namely, by writing
IIJII�,7r  IIP (t) 1 1 �,7r
L ( I I P ( �t ) ll �,7r  I I P Cm�l ) t ) l l �,7r) '
n 1
=
m=O
we can use the result in Lemma 5.3.9 to obtain the estimate
11 ! 11�,*  II P (t) f ll�,* ::::
� mt=l EQ (P ( �t ) J, P ( �t ) f) .
Hence, if .A is defined as in (5.3.6) , then, for any f E L6(rr ) ,
2 t 1 1 ! 1 1 �,71  I I P (t) fll �,7r :::: n.\ � I I P ( �t ) ! l l � , 71 ' mL...=l
which, when n > oo, leads to
1 1 !11 �,71  IIP (t) fll �,7r :::: 2 .\
1t IIP (T) fll �,7r dT.
Finally, by Gronwall 's inequality (cf. Exercise 5.6.4), the preceding yields the estimate I I P (t) fl1 �,7r ::::; e 2.At llfll �, 1r · After replacing a general f E by , we have now proved that (5.3.7) holds even when 9'\ is unbounded. f  (! )715.3.4. Estimating .A : Proceeding in the exactly the same way that we did in §5.2.2, we can estimate the .A in (5.3.6) in the same way as we estimated there. Namely, we make a selection P consisting of paths one for each from pair ( i, j) E § \ D, with the properties that if p ( i, j) = ( k0, , kn), then ko kn = and p( i, is allowable i n the sense that ( Q)k=_1 k= > 0 for each 1 ::::; m ::::; n. Then, just as in §5.2.2, we can say that
L2 (ir)
(3+
=
(5.3.10)
i,
j,
.A ::=:
p(i,j ) ,
j)
1
W(p
)
where W (P) = sup
. . •
L L (ir(p))p(e(;) (p))+ ,
e p3e e' Ep
where the supremum is over oriented edges e = (k, £) with (Q)ke > 0, the first sum is over p E P in which the edge e appears, the second sum is over = edges e' which appear in the path if the path starts at i f the path ends at and p(e') = (ir )k (Q)ke if e' (k, R) .
(ir (p))+ (ir)j =
p
p, (ir(p))  (ir) i j,
p
=
i,
5 REVERSIBLE MARKOV PROCESSES
126
5.4 Gibbs States and Glauber Dynamics
Loosely speaking, the physical principle underlying statistical mechanics can be summarized in the statement that, when a system is in equilibrium, states with lower energy are more likely than those with higher energy. In fact, J.W. Gibbs sharpened this statement by saying that the probability of a state i will be proportional to e  ��l , where k is the Boltzmann constant, T is temperature, and H ( i) is the energy of the system when it is in state i. For this reason, a distribution which assigns probabilities in this Gibbsian manner is called a Gibbs state. Since a Gibbs state is to be a model of equilibrium, it is only reasonable to ask what is the dynamics for which it is the equilibrium. From our point of view, this means that we should seek a Markov process for which the Gibbs state is the stationary distribution. Further, because dynamics in physics should be reversible, we should be looking for Markov processes which are reversible with respect to the Gibbs state, and, because such processes were introduced in this context by R. Glauber, we will call a Markov process which is reversible with respect to a Gibbs state a Glauber dynamics for that Gibbs state. In this section, we will give a rather simplistic treatment of Gibbs states and their associated Glauber dynamics. 5.4.1. Formulation: Throughout this section, we will be working in the following setting. As usual, § is either a finite or countably infinite space. On § there is given some "natural" background assignment v E (0, oo ) § of (not necessarily summable) weights, which should be thought of as a row vector. In many applications, v is uniform: it assigns each i weight 1 , but in other situations it is convenient to not have to assume that it is uniform. Next, there is a function H : § [0, oo ) (alias, the energy function) with the property that +
(5.4.1)
Z(f3)
=
L e  f3H (i) ( v) i < oo for each (3 E (0, oo ) . iE§
In the physics metaphor, f3 k� is, apart from Boltzmann's constant k, the reciprocal temperature, and physicists would call (3""" Z (f3) the partition function. Finally, for each f3 E (0, oo ), the Gibbs state !( !3) is the probability vector given by =
(5.4.2) From a physical standpoint, everything of interest is encoded in the parti tion function. For example, it is elementary to compute both the average and variance of the energy by taking logarithmic derivatives: (5.4.3)
and
5.4 Gibbs States and Glauber Dynamics
127
The final ingredient is the description of the Glauber dynamics. For this purpose, we start with a matrix A all of whose entries are nonnegative and whose diagonal entries are 0. Further, we assume that A is irreducible in the sense that
(5.4.4)
sup(An) ij > 0 for all ( i, j ) E §2 n �O
and that it is reversible in the sense that
(5.4.5) Finally, we insist that
(5.4.6)
L e!3H (j ) ( A) ij j E§
0.
At this point there are many ways in which to construct a Glauber dynam ics. However, for our purposes, the one which will serve us best is the one whose Qmatrix is given by
(5.4.7)
( Q( f3) ) ij = e!3 ( H (j ) H ( i ) ) + ( A) ij when j i= i ( Q( f3) ) ii =  'L: ( Q( f3) ) ij ' jf.i
where a+ = a V 0 is the nonnegative part of the number a E JR. Because,
(5.4.8)
! ( f3) clearly is reversible for Q( f3). There are many other possibilities, and the optimal choice is often dictated by special features of the situation under consideration. However, whatever choice is made, it should be made in such a way that, for each (3 > 0, Q( f3) determines a Markov process which never explodes. 5.4.2. The Dirichlet Form: In this subsection we will modify the ideas developed in §5.2.3 to get a lower bound on (5.4.9)
>..13 = inf { E[3 (f, f) : Var13(f ) = 1 } where E[3(f, f ) = L (i ( f3) ) i ( Q( f3) ) ij ( f(j )  f(i ) ) 2 jf.i
�
and Var13( f ) is shorthand for Var1 ( J (f ) , the variance of f with respect to ! ( f3). For this purpose, we introduce!3 the notation Elev(p) = o maxn H(im) and e (p ) = Elev(p)  H(io)  H(in) ::;m::;
5 REVERSIBLE MARKOV PROCESSES
128
for a path p = (i0 , . . . , in ) · Then, when Q((J) is given by (5.4.7) , one sees that, when p = ( io, . . . , in )
Hence, for any choice of paths P, we know that (cf. (5.3. 10))
W,e (P) = sup L w,e (P) � Z( f)) 1 e,6E ( P) W(P), e
p3 e
where W(P) = sup L w(p) and E(P) = sup e(p), e
p '3 e
pEP
and therefore that (5.4.10)
>.,e 2::
Z(f))e  i3 E (P ) W(P) .
On the one hand, it is clear that (5.4.10) gives information only when
W(P) < oo. At the same time, it shows that, at least if ones interest is in large f)'s, then it is important to choose P so that E(P) is as small as
possible. When § is finite, reconciling these two creates no problem. Indeed, the finiteness of § guarantees that W(P) will be finite for every choice of allowable paths. In addition, finiteness allows one to find for each ( i, j) a path p(i, j) which minimizes Elev(p) among allowable paths from i to j, and clearly any P consisting of such paths will minimize E(P). Of course, it is sensible to choose such a P so as to minimize W(P) as well. In any case, whenever § is finite and P consists of paths p(i , j) which minimize Elev(p) among paths p between i and j, E(P) has a nice interpretation. Namely, think of § as being sites on a map and of H as giving the altitude of the sites. That is, in this metaphor, H ( i) is the distance of i "above sea level." Without loss in generality, we will assume that at least one site k0 is at sea level: H(ko) = 0.4 When such an ko exists, the metaphorical interpretation of E(P) is as the least upper bound on the altitude a hiker must gain, no matter where he starts or what allowable path he chooses to follow, in order to reach the sea. To see this, first observe that if p and p' are a pair of allowable paths and if the end point of p is the initial point of p', then the path q is allowable and Elev(q) � Elev(p) V Elev(p') when q is obtained by concatenating p and If that is not already so, we can make it so by choosing ko to be a point at which H takes its minimum value and replacing H by H  H(ko ) . Such a replacement leaves both !(f3) and Q (f3) as will as the quantity on the right hand side of (5.4. 10) unchanged. 4
5.4 Gibbs States and Glauber Dynamics
)
)
129
)
p': if p = (io , . . . , in ) and p' = (i� , . . . , i �, ) , then q = (io, . . . , in , i � , . . . , i�, ) . Hence, for any (i, j) , e (p(i, j) = e (p(i, k0) V e (p(j, k0) , from which it should be clear that E (P) maxi e (p(i, k0)) . Finally, since, for each i, e (p(i, k0) ) = H ( £)  H ( i), where £ is a highest point along the path p( i, ko) , the explanation is complete. When § is infinite, the same interpretation is valid in various circumstances. For example, it applies when H "tends to infinity at infinity" in the sense that { i : H( i) ::; M} is finite for each M < oo. When § is finite, we can show that, at least for large (3, (5.4.10) is quite good. To be precise, we have the following result: 5.4. 1 1 THEOREM. Assume that § is finite and that Q (/3) is given by ( 5.4. 7) . Set = mini E § H ( i) and §0 = { i : H ( i) = } , and let e be the minimum value =
m
m
E(P) takes as P runs over all selections of allowable paths. Then e 2:  m , and e =  m if and only if for each ( i, j) E § x §o there is an allowable path p from i to j with Elev(p) = H(i). (See also Exercise 5.6.5 below.) More generally, whatever the value of e, there exist constants 0 < c_ ::; c+ < oo, which are independent of H, such that for all (3 2: 0.
PROOF: Because neither '"Y (f3) nor Q ((3) is changed if H is replaced by H whereas e changes to e + we may and will assume that = 0. Choose a collection P = {p(i, j) : (i, j) E §2 } of allowable paths so that, for each (i, j) , e (p(i, j) minimizes e(p) over allowable paths from i to j. Next choose and fix a k0 E §0. By the reasoning given above, (*) e = max i E§ e (p(i, ko) ) . In particular, since e (p(i, k0) ) = Elev (p(i, k0)  H(i) 2: 0 for all i E §, this proves that e 2: 0 and makes it clear that e = 0 if and only if H(i) = Elev (p( i, k0 ) ) for all i E §. Turning to the lower bound for Af3 , observe that, because = 0, Z ((3) 2: (v) k0 > 0 and therefore, by (5.4.10) , that we can take c_ = i{;��) . Finally, to prove the upper bound, choose C0 E § \ {ko } so that e(p0) = e when Po p( fo , ko) , let f be the set of i E § with the property that either i = k0 or Elev (p( i) < Elev(p0) for the path p( i) p( i, k0) E P from i to k0 , and set f = lr. Then, because ko E r and Co ¢'. r,
)
m
m,
m
)
m
=
Vrup ( !)
)
�
(�b(IJ)) ,) (�b(�)),)
=
v ko v c f3 )+ H(£o )) > .  ('"Y (f3) ) k0 ('"Y (f3) ) Eo = ( ) ( ) o e  (H(ko
Z((3)2
At the same time
5 REVERSIBLE MARKOV PROCESSES
130
But if i E r \ {ko}, j tJ_ r, and (A) ij > 0, then H(j) ;:,. Elev(po). To see this, consider the path q obtained by going in one step from j to i and then following p(i) from i to k0 . Clearly, q is an allowable path from j to ko , and therefore Elev(q) ;:,. Elev (p(j) ) ;:,. Elev(p0). But this means that Elev(p0) ::; Elev (p(j) ) ::; Elev(q)
=
Elev (p(i) ) V H(j),
which, together with Elev (p( i) ) < Elev(p0) , forces the conclusion that H(j) ;:,. Elev(p0) . Even easier is the observation that H(j) ;:,. Elev(p0) if j tJ_ r and (A)ko j > 0, since in that case the path (j, ko) is allowable and H(j)
=
Elev ( (j, k0) ) ;:,. Elev (p(j) ) ;:,. Elev(p0) .
Hence, after plugging this into the preceding expression for
L
(i ,j ) ErxrC
£(3(!, f), we get
( v) i ( A) ij ,
which, because e = Elev(p0)  H(ko)  H(€0 ) , means that
Finally, because Z((3)
::;
!!v!!v , the upper bound follows. D 5.5 Simulated Annealing
This concluding section deals with an application of the ideas in the preced ing section. Namely, given a function H : § [0, oo ) , we want to describe a procedure, variously known as the simulated annealing or the Metmpolis algorithm, for locating a place where H achieves its minimum value. In order to understand the intuition which underlies this procedure, let A be a matrix of the sort discussed in §5.4, assume that 0 is the minimum value of H, set §0 = {i H(i) = 0}, and think about dynamic procedures which would lead you from any initial point to §0 via paths which are allowable according to A (i.e., Akc > 0 if k and € are successive points along the path) . One procedure is based on the steepest decent strategy. That is, if one is at k, one moves to any one of the points € for which Akc > 0 and H(€) is minimal if H(€) ::; H(k) for at least one such point, and one stays put if H(€) > H(k) for every € with (A)kc > 0. This procedure works beautifully as long as you avoid, in the metaphor suggested at the end of §5.4, getting trapped in some "mountain valley." The point is that the steepest decent procedure is the most efficient strategy for getting to some local minimum of H. However, if that minimum is not global, then, in general, you will get stuck! Thus, if you are going to avoid this fate, occasionally you will have to go "up hill" even when +
:
5.5 Simulated Annealing
131
you have the option to go "down hill." However, unless you have a detailed priori knowledge of the whole terrain, there is no way to know when you should decide to do so. For this reason, it may be best to abandon rationality and let the decision be made randomly. Of course, after a while, you should hope that you will have worked your way out of the mountain valleys and that a steepest decent strategy should become increasingly reasonable. 5.5. 1 . The Algorithm: In order to eliminate as many technicalities as possible, we will assume throughout that § is finite and has at least 2 elements. Next, let H § [0, oo) be the function for which we want to locate a place where it achieves its minimum, and, without loss in generality, we will assume that 0 is its minimum. Now take v so that ( v ) i = 1 for all i E §, and choose a matrix A so that ( A ) ii = 0 for all i E §, ( A) ij = (A)Ji ?: 0 if j =/: i , and A is irreducible (cf. (5.4.4)) on §. In practice, the selection of A should be made so that the evaluation of H (j)  H ( i) when (A ) ij > 0 is as "easy possible." For example, if § has some sort of natural neighborhood structure with respect to which § is connected and the computation of H(j)  H( i) when j is a neighbor of i requires very little time, then it is reasonable to take A so that (A)ij = 0 unless j =J i is a neighbor of i. Now define 1((3) as in (5.4.2) and Q((J) as in (5.4.7) . Clearly, 1(0) is just the normalized, uniform distribution on §: (r (O)) i = L  I , where L #§ ?: 2 is the number of elements in §. On the one hand, as (3 gets larger, 1((3) becomes more concentrated on §o. More precisely, since Z ((3) ?: #§o ?: 1 a
:
+
as
=
(5.5.1) On the other hand, as (3 gets larger, Theorem 5.4. 1 1 says that, at least when e > 0, >.. ( (3) will be getting smaller. Thus, we are confronted by a conflict. In view of the introductory discussion, this conflict between the virtues of taking (3 large, which is tantamount to adopting an approximately steepest decent strategy, versus those of taking (3 small, which is tantamount to keeping things fluid and thereby diminishing the danger of getting trapped, should be expected. Moreover, a resolution is suggested at the end of that discussion. Namely, in order to maximize the advantages of each, one should start with 0 and allow (3 to increase with time. 5 That is, we will make (3 an (3 increasing, continuous function tv; (J (t) with (3(0) = 0. In the interest of unencumbering our formulae, we will adopt the notation =
Z(t) = Z (f3 (t)) , rt = r (f3 (t)) , ( · ) t = ( · )y, , I · l l 2, t = II · l l 2 ,y, , Vart = Vary, , Q(t) = Q (f3(t)) , Et = Ef3 (t) • and At = Af3 ( t) · 5 Actually, there is good reason to doubt that monotonically increasing (3 is the best way to go. Indeed, the name "simulated annealing" derives from the idea that what one wants to do is simulate the annealing process familiar to chemists, material scientists, skilled carpenters, and followers of Metropolis. Namely, what these people do is alternately heat and cool to achieve their goal, and there is reason to believe we should be following their example. However, I have chosen not to follow them on the unforgivable, but understandable, grounds that my analysis is capable of handling only the monotone case.
5 REVERSIBLE MARKOV PROCESSES
132
Because, in the physical model, f3 is proportional to the reciprocal of tem perature and f3 increases with time, tvt {3 (t) is called the cooling schedule. 5.5.2. Construction of the Transition Probabilities: Because the Q matrix here is time dependent, the associated transition probabilities will be timeinhomogeneous. Thus, instead of tvt Q (t) determining a one parameter family of transition probability matrices, for each s E [0, oo it will determine a map t vt P (s, t) from [s, oo into transition probability matrices by the time inhomogeneous Kolmogorov forward equation
)
)
(5.5.2)
d (s, t) = P (s, t)Q(t) on (s, oo) 1. th P (s, s) dt P w
( )
= I.
Although 5.5.2 is not exactly covered by our earlier analysis of Kolmogorov equations, it nearly is. To see this, we solve (5.5.2 via an approximation procedure in which Q(t) is replaced on the right hand side by
Q ( N) ( t )
)
=
[ r;r )
m m , (m + l) 10r t E N Q ( [t] N ) where [t] N N _
.c
.
The solution tvt p ( Nl (s, t) to the resulting equation is then given by the pre scription p( Nl (s, s) = I and
p(Nl (s, t) = p( N ) ( s, s V [t ] e (t s)Q([t] N ) for t > s. N
)
As this construction makes obvious, p( N) (s, t) is a transition probability ma trix for each N 2': 1 and t 2': s, and (s, t)vt P ( Nl (s, t) is continuous. Moreover,
l l p( N) (s, t)  p (M) (s, t) l l u ,v t :::; l i Q ([T] N )  Q ([T] M ) l l u,J p( N) (s, T ) llu, v dT
l
+
lt l! Q ( [T] M ) ilu,J p(N) (s, T)  p(M) (s, T) l lu,v dT.
But
and so
1t jf3([T] N )  f3([T] M ) j dT t + I I AIIu .v l j jp( N) (T )  p (M) ( T ) jiu,v dT.
jj p(Nl (s, t)  p(M) (s, t) j j u , v :s;IIAIIu,v ii H IIu
5.5 Simulated Annealing
133
Hence, after an application of Gronwall's inequality, we find that sup l l p (Nl (s, t)
OS s S t S T
�
p (Ml (s, t) l i u,v
'5:: I I A II u,v iiHII ueiiAIIu,v T faT 1 ,6 ( [7]
N)
�
,6 ( [T] M ) I dT.
Because Tv+,6(7) is continuous, this proves that the sequence {P ( Nl (s, t) N � 1 } is Cauchy convergent in the sense that, for each T > 0, As a consequence, we know that there exists a continuous (s, t)v+P(s, t) to which the p (Nl (s, t) ' s converge with respect to I I llu,v uniformly on finite intervals. In particular, for each t � s, P ( s, t) is a transition probability matrix and t E [s, oo) P(s, t) is a continuous solution to ·
f+
t
P(s, t) = I + i P(s, T) Q(T) dT, t E [s, oo), which is the equivalent, integrated form (5.5.2). Furthermore, if t E [s, oo) f.Lt E M1 (§) is continuously differentiable, then
f+
f.L = J.Lt Q (t) for t E [ s, oo) {===> f.Lt = JL8P(s, t) for t E [s, oo). d t Since the "if" assertion is trivial, we turn to the "only if" statement. Thus, supposP that t E [s, oo) f.Lt E M1 (§) satisfying flt = f.Lt Q(t) is given, and set Wt f.Lt JL8P(s, t). Then (5.5.3) flt
=
�
f+
=
�
and so, Hence, after another application of Gronwall's inequality, we see that Wt = 0 for all t � s. Of course, by applying this uniqueness result when J.Lt = 8iP(s, t) for each i E §, we learn that (s, t)v+P(s, t) is the one and only solution to (5.5.2). In ad dition, it leads to the following timeinhomogeneous version of the Chapman� Kolmogorov equation: (5.5.4)
P(r, t)
=
P(r, s)P(s, t) for 0 2, we no longer have so much control over it. Nonetheless, by first writing ( f/ )i = Ul  1 )�, and then using part ( c ) of Exercise 5.6.2 to see that 2
( ft2  1 ) z q
J.t t
[0, oo ) , :
>
(5.6.3)
where the meaning of the left hand side when (f) J.L = oo is given by taking 'lj;(oo) = limvoo 'I/J ( t). The inequality (5.6.3) is an example of more general statement known as Jensen's inequality (cf. Theorem 6.1.1 in [8]) . ( a) Use induction on n ?: 2 to show that
(Bl , . . . , Bn )
E
n [0, 1t with L Bk = and ( x 1 , . . .
1
m=l
, xn
)
E
[O, oo) n .
(b) Let { FN }f be a nondecreasing exhaustion of § by finite sets satisfying f.L (FN ) = �iE FN (J.L)i > 0, apply part (a) to see that
for each N, and get the asserted result after letting N > oo. ( c ) As an application of (5.6.3), show that, for any 0 < p f : § > [0, oo), (JP)f', � (fq )/t . 1.
l.
�
q < oo and
5 REVERSIBLE MARKOV PROCESSES
140
ExERCISE 5 . 6 . 4 . Gronwall's is an inequality which has many forms, the most elementary of which states that if : [0, T] [0, oo ) is a continuous function which satisfies
u
+
u(t) ::; A + B lot u(T) dT for t E [0, T] , then u(t) ::; AeBt for t E [0, T] . Prove this form of Gronwall's inequality. Hint: Set U(t) J� u(T) dT, show that U(t) ::; A +BU(t), and conclude that U (t) ::; � (eBt 1 ) . =

if and EXERCISE 5 . 6 . 5 . Define e as in Theorem 5.4. 1 1 , and show that e = only if for each (i , j) E § x §o there is an allowable path (i0, . . . , in ) starting at and ending at j along which H is nonincreasing. That is, io = i n = j, and, for each 1 ::; m ::; n, Ai= _ 1 im > 0 and H(i m ) ::; H( m  1 · EXERCISE 5 . 6 . 6 . This exercise deals with the material in §5. 1.3 and demon strates that, with the exception of (5.1.10) , more or less everything in that section extends to general irreducible, positive recurrent P's, whether or not they are reversible. Again let 1r 1r5 be the unique Pstationary probabil ity vector. In addition, for the exercise which follows, it will be important to consider the space L 2 ( 1r; 0 for all w E §, define
1.f
TJ 'F d:
{ w, w A
if TJ = w.
1 ' . . . ' wA N }
( a) Check that Q J.L is an irreducible Qmatrix on § and that the detailed
balance condition (JL)w Q f:;'17 = (JL) ry Q �w holds. In addition, show that (b) Let A be the uniform probability vector on §. That is, (.A)w
each w E §. Show that
and
1
(j, where mJ.L = £>.(!, f) ::; m£1L f)
J.L
( c ) For each S 0, N E z+, and w E { 1, 1 }N .
( 4t)
e
CHAPTER 6
Some Mild Measure Theory
On Easter 1933, A.N. Kolmogorov published Foundations of Probability, a book which set out the foundations on which most of probability theory has rested ever since. Because Kolmogorov's model is given in terms of Lebesgue's theory of measures and integration, its full appreciation requires a thorough understanding that theory. Thus, although it is far too sketchy to provide anything approaching a thorough understanding, this chapter is an attempt to provide an introduction to Lebesgue's ideas and Kolmogorov's application of them in his model of probability theory. 6.1 A Description of Lebesgue's Measure Theory
In this section, we will introduce the terminology used in Lebesgue's the ory. However, we will systematically avoid giving rigorous proofs of any hard results. There are many places in which these proofs can be found, one of them being [8] . 6 . 1 . 1. Measure Spaces: The essential components in measure theory are a set 0, the space, a collection :F of subsets of 0, the collection of measurable, subsets, and a function p, from :F into [0, oo] , called the measure. Being a space on which a measure might exist, the pair (0, :F) is called a measurable, space, and when a measurable space (0, :F) comes equipped with a measure p,, the triple (0, :F, p,) is called a measure space. In order to avoid stupid trivialities, we will always assume that the space 0 is nonempty. Also, we will assume that the collection F of measurable sets forms a aalgebra over 0:
0 E :F, A E :F ===} AC 0 \ A E F, =
00
It is important to emphasize that, as distinguished from pointset topology ( i.e. , the description of open and closed sets ) only finite or countable set theoretic operations are permitted in measure theory.
6 SOME M ILD MEASURE THEORY
146
Using elementary set theoretic manipulations, it is easy to check that
A, B E ===} A n B E and B \ A E :F {An }\'0 � ===} n1 An E :F. :F
:F
00
:F
Finally, the measure fL will be function which assigns 1 0 to 0 and is countably additive in the sense that
Am n An = when (6. 1 . 1 ) ===} fL (Q An) = � JL(An ) · In particular, for A, B E A � B ===} JL(B) = JL(A) + JL(B \ A) JL(A) (6. 1.2) JL(A n B) < oo ===} JL(A B) = JL(A) + JL(B)  JL(A n B). The first line comes from writing B as the union of the disjoint sets A and B \ A, and the second line comes from writing A U B as the union of the disjoint sets A and B \ (A n B ) and then applying the first line to B and An B. The finiteness condition is needed when one moves the term JL(A n B) to the left hand side in JL( B ) = JL(An B ) + fL (B \ (An B )). That is, one wants to avoid having to subtract oo from oo. When the set is finite or countable, there is no problem constructing measure spaces. Namely, one can take :F = {A : A � SZ}, the set of all subsets of make any assignment of w E JL({w}) E [O, oo], at which {An }r' �
:F
0
and
m
=J n
:F,
�
u
n
Sl,
Sl
f+
point countable additivity demands that we take
JL(A) = L JL({w}) for A � w EA
n.
However, when n is uncountable, it is far from obvious that interesting mea sures can be constructed on a nontrivial collection of measurable sets. Indeed, it is reasonable to think that Lebesgue's most significant achievement was his construction of a measure space in which n = JR, :F is a aalgebra of which every interval (open, closed, or semiclosed) is an element, and fL assigns each interval its length (i.e., b  a if is an interval whose right and left end points are b and a) . Although the terminology is misleading, a measure space :F, is said to be finite if < oo. That is, the "finiteness" here is not determined by
JL(I) =
I
(w, JL)
JL(SZ)
1 In view of additivity, it is clear that either J.L(0) 0 or J.t (A ) = oo for all A E :F. Indeed, by additivity, J.L(0) J.L(0 U 0) 2J.L(0), and therefore J.L(0) is either 0 or oo. Moreover, if J.L(0) = oo , then J.t ( A ) = J.t ( A U 0) = J.t(A) + J.L(0) = oo for all A E :F. =
=
=
6.1
A Description of Lebesgue's Measure Theory
147
the size of n but instead by how large n looks to 1l · Even if a measure space is not finite, it may be decomposable into a countable number of finite pieces, in which case it is said to be afinite. Equivalently, (n, :F, 1l) is afinite if there exists 0 for all i E §. Next observe that, under this assumption, the hypotheses become supn l fn ( i ) / ::; g ( i ) and fn (i) + f (i) for each i E §. In particular, I f / ::; g. Hence, by considering fn  f instead of { fn } r' and replacing g by 2g, we may and will assume that f =.= 0. Now let E > 0 be given, and choose L so that Li >L g(i) (JL) i < E. Then
6 . 1 .8. Fubini's Theorem: Fubini's Theorem deals with products of mea sure spaces. Namely, given measurable spaces (D1 , F1 ) and (D2, F2) , the product of F1 and F2 is the O"algebra F1 x F2 over D1 x D2 which is gen erated by the set { A1 x Az : A1 E F1 & A2 E F2} of measurable rectangles. An important technical fact about this construction is that, if f is a measur able function on (D1 x Dz, F1 x Fz) , then, for each w1 E D1 and w2 E D2 , Wzv+ j (w1 , wz) and w1.,... j(w1 , wz ) are measurable functions on, respectively, (Dz , Fz) and (D1 , Fl ) . In the following statement, it is important to emphasize exactly which vari able is being integrated. For this reason, we use the more detailed notation J0 f(w) p,(dw) instead of the more abbreviated J f dp,.
6 . 1 . 1 5 THEOREM. (Fubini) 7 Let (Dl , Fl , f..L l ) and (D2 , F2 , p,2) be a pair of O"finite measure spaces, and set D = D 1 x D 2 and F = F1 x F2 . Then there is a unique measure p, = p,1 x p,2 on (D, F) with the property that p,(A1 x Az) = f..L l (Al )J.Lz (Az) for all A1 E F1 and Az E Fz . Moreover, if f is a nonnegative, measurable function on (D, F), then both
7 Although this theorem is usually attributed Fubini, it seems that Tonelli deserves, but seldom receives, a good deal of credit for it.
6 SOME MILD MEASURE THEORY
156
are measurable functions, and
L1 (L2
f(w i , wz) tt2 (dw2)
)
tt1 (dw1)
= In f dp = L2 (L1 f(wl , w2) PI (dwl ) ) ttz (dw2) ·
Finally, for any integrable function f on (D, :F, p),
{ fn2 { L1
A1 = w1 :
!f(w i , wz ) ! pz (dwz)
< oo
A 2 = w2 :
! f(w i , w2) ! tti (dwl )
< oo
WIv> fr (w l )
::=
1A1 (WI )
} }
E :Fr , E :Fz,
r f(wi , Wz) pz (dw2) ln2
and
are integrable, and
In the case when D1 and D2 are countable, all this becomes very easy. Namely, in the notation which we used in §6. 1.6, the measure p1 x p2 cor responds to row vector Jt l x Jtz E [0, oo ) § 1 x§2 given by (Jtl X Jtz) (i 1 , 22 ) = (Jt) i1 ( Jt ) i2 , and so, by ( 6 . 1 . 14), Fubini's Theorem reduces to the statement that
{ ai1 i2 : (i1 , i2) E §1 x §2} � (oo, oo] satisfies either aiti2 2: 0 for all (i1 , i2) or I: (i 1 , i2 ) ! a i 1i2 ! < oo. In proving this, we may and will assume that §1 = z + §2 throughout and will start with the case when ai d2 2: 0. Given any pair (n1 , n2 ) E (z+ ?, when
=
L ai1i2 2: (i 1 , i 2)ES1 xS2 Hence, by first letting n1
> oo
and then letting n2
> oo,
we arrive at
6 . 2 Modeling Probability Similarly, for any n
E
(i1 ,i2)§1 x §2 i1Vi2$ n
z+ ,
aid2
and so, after letting n < oo ,
I: (i1 ,i2) l ai,i2 i
=
+
t (t )
i2= l i1=l
ai1i2
:S:
157
(
)
L L ai1i2 , i2ES1 i2ES2
oo , we get the opposite inequality. Next, when
L l aid2 1 i1E§1
< oo
for all
i2 E §2,
and so
L ai,i2 (i,,i2)ES1 xS2
=
a2:1 2.2 L a� i2 (i, ,i2)ES1 xS2 ) (i, ,i2 ES1 xS2
(
)
� (�, a� i2 )
i 2
(
)
2.:: 2.:: ai,i2 . 2.:: ai,i2 i2 ES2 i, E§, i2i2$ES2 i1 E§, n Finally, after reversing the roles 1 and 2, we get the relation with the order
= E.� 2.::
of summation reversed.
6.2 Modeling Probability To understand how these considerations relate to probability theory, think about the problem of modeling the tosses of a fair coin. When the game ends after the nth toss, a Kolmogorov model is provided by taking D = {0, 1 } n , F the set of all subsets of D, and setting p, ( { w } ) = 2 n for each w E D. More generally, any measure space in which D has total measure 1 can be thought of as a model of probability, for which reason, such a measure space is called a probability space, and the measure p, is called a probability measure and is often denoted by JP. In this connection, when dealing with probability spaces, ones intuition is aided by extending the metaphor to other objects. Namely, one calls D the sample space, its elements are called sample points, the elements of F are called events, and the number that lP assigns an event is called the probability of that event. In addition, a measurable map is called a random variable, it tends to be denoted by X instead of F, and, when it is IEkvalued, its integral is called its expected value. Moreover, the latter convention is reflected in the use of JE[X] , or, when more precision is required, JEll'[X] to denote J X dlP. Also, lE[X, A] or JEll'[X, A] is used to denote JA X dlP, the expected value of X on the event A. Finally, the distribution of a random variable whose values lie in the measurable space (E , B) is the probability measure X*JP on
6 SOME MILD MEASURE THEORY
158
lP'(X E B ) . In particular, when X is IRvalued , its distribution function Fx is defined so that Fx (x) = (X*lP') ( (  oo , xl ) = lP'(X :=; x ) . Obviously, Fx is nonincreasing. Moreover, as an application of (6. 1 . 3 ) , one can show that Fx is continuous from the right, i n the sense that Fx (x) = limy',x Fx (y) for each x E IR, limx',oo Fx (x) = 0, and limx/oo Fx (x) = 1 . At the same time, one sees that lP'(X < x) = Fx (x  ) limy/x Fx (y), and, as a consequence, that lP'(X = x) = Fx (x)  Fx (x  ) is the jump in Fx at x. (E, B) given by f..l ( B) =
=
6.2.1. Modeling Infinitely Many Tosses of a Fair Coin: Just as in the general theory of measure spaces, the construction of probability spaces presents no analytic (as opposed to combinatorial) problems as long as the sample space is finite or countable. The only change from the general theory is that the assignment of probabilities to the sample points must satisfy the condition that Lw E O lP'( {w}) = 1 . The technical analytic problems arise when S1 is uncountable. For example, suppose that, instead of stopping the game after n tosses, one thinks about a coin tossing of indefinite duration. Clearly, the sample space will now be S1 = {0, 1 }'z+ . In addition, when A 0, choose open sets Gk ;:::> rk so that W(Gk \ rk) 'S 2 k E. Then G = U� Gk is open, G ;:::> f = U� fk, and
iP'( G \ r ) ::: w
(Q
cck \ fk )
) � :::
IP'(ck \ fk) ::: E.
Hence, r E B, and so B is closed under countable unions. To see that it is also closed under complementation, let r E B be given, and, for each n ::=: 1 , choose an open set Gn ;:::> r so that W( Gn \ r ) 'S �  Then D = n � Gn ;:::> r and iP'( D \ f ) 'S iP'( Gn \ f ) 'S � for all n :::: 1 . Thus, W( D \ f ) = 0, and so D \ r E B. Now set Fn = GnC and C = U� Fn . Because each Fn is closed, and therefore measurable, c E B. Hence, since rc = c u (D \ r ) E B, we are done. 6 . 2 . 4 THEOREM. Referring to the preceding, Bn = O"(A) and f E B if and only if there exist C, D E Bn such that C t:;:: r t:;:: D and W( D \ C) = 0. Next, again use IP' to denote restriction JPl I B of JPl to B . Then, (D, B , IP') is a probability space with the property that IP' I An = IP'n for every n ::=: 1 .
6 SOME MILD
162
MEASURE THEORY
PROOF: We have already established that B is a O"algebra. Since all elements of A are open, the equality Bo = O"(A) follows from the fact that every open set can be written as the countable union of elements of A. To prove the characterization of B in terms of Bo , first observe that, since open sets are in B, Bo � B. Moreover, since sets of JIDmeasure 0 are also in B, c � r � D for C, D E Bo with JID(D \ C) = 0 implies that r = C n (f \ C) E B, since iP'(r \ C) :::; iP'(D \ C) = 0. Conversely, if r E B, then we can find sequences {Gn}i"' and {Hn}i"' of open sets such that Gn ::2 r, Hn ::2 rC, and JID(Gn \ f) v JID(Hn \ rC) :::; � · Thus, if D = n� Gn and C = U� HnC, then C, D E Bo , C � r � D, and
for all n � 1. Hence, JID(D \ C) = 0. All that remains is to check that fiD is countably additive on B. To this end, let {rk}f � B be a sequence of mutually disjoint sets, and set r = U� rk. By the second part of Lemma 6.2.2, we know that JID(f) :::; I: � iP'(rk) · To get the opposite inequality, let E > 0 be given, and, for each k, choose an open Gk ::2 rkC so that JID(Gk \ rkC) :::; 2 k E. Then each Fk = GkC is closed, and
In addition, because the rk'S are, the Fk'S are mutually disjoint. Hence, by the last part of Lemma 6.2.2, we know first that
and then that CXJ
CXJ
_L fiD(rk) :::; _L fiD(Fk) + E :::; iP'(r) + c . 1
o
6.3 Independent Random Variables In Kolmogorov's model, independence is best described in terms of (} algebras. Namely, if (S1, F, lP') is a probability space and F1 and F2 are O" subalgebras (i.e. , are O"algebras which are subsets) of F, then we say that F1 and F2 are independent if lP'(f1 n f2) = lP'(fi )lP'(f2) for all n E F1 and f2 E F2. It should be comforting to recognize that, when A1 , A2 E F and, for i E {1, 2}, Fi = O"( { Ai}) = {0, Ai , AiC, S1}, then, as is easily checked, F1 is in dependent of F2 precisely when, in the terminology of elementary probability theory, "A1 is independent of A2" : lP'(A1 n A2) = lP'(A1)lP'(A2 ) . The notion of independence gets inherited by random variables. Namely, the members of a collection {Xa : a E I} of random variables on (S1, F, lP')
6.3 Independent Random Variables
163
are said to be mutually independent if, for each pair of disjoint subsets J1 and J2 of I, the cralgebras cr ( {X"' : a E Jl }) and cr ( {X"' : a E J2 }) are independent. One can use Theorem 6.1.6 to show that this definition is equivalent to saying that if X"' takes its values in the measurable space (E"' , B"') , then, for every finite subset {am }1 of distinct elements of I and choice of B"' "' E B"''" , 1 :::::; m :::::; n,
n lP'(X"'= E B"'= for 1 :::::; m :::::; n) = IJ 1P'(X"'= E B"' "') . As a dividend of this definition, it is essentially obvious that if {X"' : a E I} are mutually independent and if, for each a E I, F"' is a measurable map on the range of X"' , then {F"'(X"') : a E I} are again mutually independent. Finally, by starting with simple functions, one can show that if { Xm } l are mutually independent and, for each 1 :::::; m :::::; n, fm is a measurable lRvalued function on the range of Xm, then
lE [fl (Xl ) · · · fn(Xn)]
=
n IJ lE [fm(Xm) ] 1
whenever the fm are all bounded or are all nonnegative. 6.3. 1 . Existence of Lots of Independent Random Variables: In the preceding section, we constructed a countable family of mutually independent random variables. Namely, if (r!, B, lP') is the probability space discussed in Theorem 6.2.4 and Xm (w) = w(m) is the mth coordinate of w, then, for any choice of n 2: 1 and (7Jl , . . . , 7Jn) E {0, 1}, lP'(Xm = 7]m, 1 :::::; m :::::; n) = 2 n = TI? lP'(Xm = 77m)· Thus, the random variables { Xm } 'f are mutually independent. Mutually independent random variables, like these, which take on only two values are called Bernoulli random variables. As we are about to see, Bernoulli random variables can be used as build ing blocks to construct many other families of mutually independent random variables. The key to such constructions is contained in the following lemma. 6 . 3 . 1 LEMMA. Given any family { Bm}'f of mutually independent, {0, 1 } valued Bernoulli random variables satisfying lP'(Bm = 0 ) = � = lP'(Bm = 1 ) for all m E z+ , set U = �� 2  m Bm · Then U is uniformly distributed on [0, 1 ) . That is, lP'(U :::::; u) is 0 if u < 0, u if u E [0, 1), and 1 if u 2: 1 . PROOF: Given N 2: 1 and 0 :::::;
(*)
n < 2N, we want to show that
To this end, note that n2  N < U :::::; (n + 1)2N if and only if � � 2  m Bm = (n + 1)2N and Bm 0 for m > N, or � � 2  m Bm = n2  N and Bm = 1 for some m > N. Hence, since lP'(Bm = 0 for all m > N) = 0, the left hand =
6 SOME MILD MEASURE THEORY
164
side of (*) is equal to the probability that � � 2  m Bm = n2  N. However, elementary considerations show that, for any 0 :::; n < 2N there is exactly one choice of (ry1 , . . . , 1J ) E {0, 1 } N for which � � 2 m 1Jm = n 2  N . Hence,
lP'
(�
N
2 m Em = n2  N
)=
lP' (Bm '/]m for 1 :::; m :::; N) = T N . =
Having proved (*) , the rest is easy. Namely, since = 0, (*) tells us that, for any 1 :::; k :::;
0 for all m :2: 1)
lP'(U 2N
=
0)
=
lP'(Bm
k 1 lP'(U :::; kT N ) = L lP' ( mT N < U :::; (m + 1 )T N ) kTN. m=O Hence, because U"""' lP' (U :::; u) is continuous from the right, it is now clear that Fu(u) = u for all u E [0, 1 ) . Finally, since lP'(U E [0, 1] ) = 1 , this completes =
the proof. 0 Now let I a nonempty, finite or countably infinite set. Then I x z+ is countable, and so we can construct a 1to1 map (a, n)"""' N (a, n) from I x z+ onto z+ . Next, for each a E I, define w E D {0, 1 }z+ f7 Xa(w) E D so that the nth coordinate of Xa ( w) is w ( N (a, n) ) . Then, as random variables on the probability space (D, B, lP') in Theorem 6.2.4, {Xa : a E I} are mutually independent and each has distribution lP'. Hence, if : D + [0, 1] is the continuous map given by
=
(ry)
L 2  m ry(m) m=l 00
=
for
1]
ED
and if Ua = (Xa), then the random variables {Ua = (Xa ) : a E I} are mutually independent and, by Lemma 6.3. 1 , each is uniformly distributed on [0, 1] . The final step in this section combines the preceding construction with the wellknown fact that any IRvalued random variable can be represented in terms of a uniform random variable. More precisely, a map F : lR + [0, 1] is called a distribution function if F is nondecreasing, continuous from the right, and tends to 0 at oo and 1 at +oo. Given such an F, define
F 1 (u)
F(x) :2: u} for u E [0, 1] . Notice that, by right continuity, F(x) :2: u {=} F 1 (u) :::; x. Hence, if U is uniformly distributed on [0, 1] , then F  1 (U) is a random variable whose distribution function is F. 6 . 3 . 2 THEOREM. Let (D, B, lP') be the probability space in Theorem 6.2.4. Given any finite or countably infinite index set I and a collection {Fa : a E I} =
inf{x E lR :
of distribution functions, there exist mutually independent random variables {Xa : a E I} on (D, B, lP') with the property that, for each a E I, Fa is the distribution of Xa .
6.4 Conditional Probabilities and Expectations
165
6.4 Conditional Probabilities and Expectations Just as they are to independence, O"algebras are central to Kolmogorov ' s definition of conditioning. Namely, given a probability space (D, F, J!D) , a sub O"algebra I:, and a random variable X which is nonnegative or integrable, Kolmogorov says that the random variable XE is a conditional expectation of X given I; if Xz:; is a nonnegative random variable which is measurable with respect to I: ( i.e., CJ ( {X})